Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save the linked list of the executor to suffer from GC nepotism #74

Merged
merged 1 commit into from
Apr 23, 2020

Conversation

franz1981
Copy link
Collaborator

@franz1981 franz1981 commented Apr 21, 2020

This has turned to be a relevant GC issue that can cause several problems (including OOM) on many linked-lists implementations.
For reference:

I've solved the problem in the same way CLQ explain here: https://github.com/AdoptOpenJDK/openjdk-jdk14/blob/f7165c322a6abefa34afa9eeb76dbd56f70aae09/src/java.base/share/classes/java/util/concurrent/ConcurrentLinkedQueue.java#L149

We use the trick of
linking a Node that has just been dequeued to itself.  Such a
self-link implicitly means to advance to head.

In addition, I see that this comment gives an additional hint on how to handle producer-side the case where tail.next == next ie self-link implicitly means to advance to head:

  • if tail is unchanged (from the last read) it means that head is the most up-to-date node to append to and we can use it
  • if tail is changed (from the last read) it just means that we've fallen behind and can use the new tail again

I haven't implemented this strategy, but I let it re-attempt reading this.tail if a consumer past over a producer.

@franz1981 franz1981 force-pushed the gc_nepotism branch 2 times, most recently from 694da73 to acc2b67 Compare April 21, 2020 07:00
@franz1981
Copy link
Collaborator Author

franz1981 commented Apr 21, 2020

@dmlloyd @carterkozak
I was sure to have broken (maybe I've it done it anyway eh :P) the current logic with this change, but seems that I'm able to make master to fail with this test (on my 4 cores with HT laptop):

    @Test
    public void gcNepotism() throws ExecutionException, InterruptedException {
        System.setProperty("jboss.threads.eqe.park-spins", Long.toString(Integer.MAX_VALUE));
        System.setProperty("jboss.threads.eqe.tail-lock", Boolean.FALSE.toString());
        Assume.assumeFalse(EnhancedQueueExecutor.TAIL_LOCK);
        EnhancedQueueExecutor executor = (new EnhancedQueueExecutor.Builder())
                .setCorePoolSize(10)
                .setMaximumPoolSize(20)
                .build();
        final int producers = 10;
        int total = 100_000;
        final CountDownLatch totalExecution = new CountDownLatch(total * producers);
        ExecutorService service = Executors.newFixedThreadPool(producers);
        Future<?>[] tasks = new Future[producers];
        for (int i = 0; i<producers;i++) {
            tasks[i] = service.submit(() -> {
                for (int j = 0; j < total; j++) {
                    executor.execute(totalExecution::countDown);
                }
            });
        }
        for (int i = 0; i<producers;i++) {
            tasks[i].get();
        }
        Assert.assertTrue(totalExecution.await(30, TimeUnit.SECONDS));
        service.shutdown();
        executor.shutdown();
    }

On master I'm getting:

java.util.concurrent.ExecutionException: java.lang.AssertionError

	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
	at org.jboss.threads.EnhancedQueueExecutorTest.gcNepotism(EnhancedQueueExecutorTest.java:181)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
	at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
	at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
	at com.intellij.rt.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:33)
	at com.intellij.rt.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:230)
	at com.intellij.rt.junit.JUnitStarter.main(JUnitStarter.java:58)
Caused by: java.lang.AssertionError
	at org.jboss.threads.EnhancedQueueExecutor.tryExecute(EnhancedQueueExecutor.java:1740)
	at org.jboss.threads.EnhancedQueueExecutor.execute(EnhancedQueueExecutor.java:755)
	at org.jboss.threads.EnhancedQueueExecutorTest.lambda$gcNepotism$0(EnhancedQueueExecutorTest.java:176)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

ie it's failing on assert tail instanceof TaskNode && tail.task == null; on the tail.task == null check
That assertion is still valid?

// GC Nepotism:
// We can save consumerNode::next from being dragged into
// old generation, if possible
consumerNode.compareAndSetNext(tailNextNext, null);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've used a compareAndSetNext because I'm not sure that this case is single-writer, maybe yes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should do a weakCompareAndSet here to avoid the heavier memory barrier?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO is safe wekCompareAndSet, but probably will make any difference only for ARM/PowerPC servers: AFAIK newish x86s won't have any specific code path implemented on the OpenJDK for this (but I could be wrong)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Well we do care about ARM and AArch64 so I think that makes it valid.

@franz1981
Copy link
Collaborator Author

If I need to provide a test to make evident the GC nepotism problem I can do it as well :+1

@franz1981 franz1981 force-pushed the gc_nepotism branch 4 times, most recently from fafc219 to db0355c Compare April 21, 2020 17:25
TaskNode taskNode = (TaskNode) headNext;
if (compareAndSetHead(head, taskNode)) {
// save from GC nepotism
head.setNextRelaxed(head);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you prove that this can never cause nodes to be skipped?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point: looking at the original code I see that is draining TaskNodes until tail::getNext is null, after a shutdown: I can preserve the original logic to handle the case of a concurrent call to it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the visibility guarantess while setting next too, taking "inspiration" from CLQ:
https://github.com/openjdk-mirror/jdk7u-jdk/blob/8aaca72128077594ecf4ecdee761e78d4ecb8b7d/src/share/classes/java/util/concurrent/ConcurrentLinkedQueue.java#L305

    final void updateHead(Node<E> h, Node<E> p) {
        if (h != p && casHead(h, p))
            h.lazySetNext(h);
    }

// cross-generational references, so better to "clean-up" head::next
// to save dragging head::next into the old generation.
// Clean-up cannot just null out next
head.setNextOrdered(head);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to see a correctness proof here. What are the possible observable output states, and why? Are the states all valid and already described above? What are the preconditions and postconditions?

Copy link
Collaborator Author

@franz1981 franz1981 Apr 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's mostly expressed in term of acquire-release relationships between atomic operations:

  • only one thread can win compareAndSetHead(head, taskNode) for a specific head, taskNode pair
  • other threads can reach head while its head::next is valued before/after head.setNextOrdered
  1. "slow" producers (ie producers descheduled after reading this.tail for enough time that other producers have moved on and a consumer has already moved head):
  • before head.setNextOrdered(head): they try to compareAndSetNext(null, newValue) and fail, because, given that the consumer has been able to move past to it, it means that next != null
  • after head.setNextOrdered(head): they try to compareAndSetNext(null, newValue) and fail, because head != null
  1. slow consumers (ie consumers that have been descheduled after reading this.head and are going to read head::getNext)
  • before head.setNextOrdered(head): they see the same head::next value of the consumer that has won casHead, hence they will attempts casHead, failing because has been already moved on
  • after head.setNextOrdered(head): they see the new value (head::next == head) and they will reattempt reading this.head. Given that head.setNextOrdered is write releasing any previous write (including the won casHead), load head::next and finding == head is load acquiring the new value of head, hence reading head will return at least the value released by the consumer that has won casHead

Copy link
Collaborator Author

@franz1981 franz1981 Apr 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The states already written in the overview of the class should be valid, because I've tried hard to not break the way we use null and just add checks that head::next == head means that "something" has changed: in particular head and by consequence tail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I think I understand all the possibilities here. But couldn't this be setNextRelaxed? We're not using node.next == node as the solitary indicator that it has been removed, so it's probably OK if some threads don't observe that (they would just spin past the node using the current behavior). Does it make a difference performance-wise?

Copy link
Collaborator Author

@franz1981 franz1981 Apr 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make a difference performance-wise?

setNextOrdered on x86 isn't using any specific hardware barrier; thanks to the strong memory model of x86 and the only perf hit would have been by using "updaters" (often negligible).

From a correctness pov yes: the only thing we care about here is atomicity, but sadly Unsafe isn't providing the right methods to just have that: ideally this is what we need here https://docs.oracle.com/javase/9/docs/api/java/lang/invoke/VarHandle.AccessMode.html#SET_OPAQUE , because the compareAndSet previous to it already provide the right sequential consistent write we need.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this safe in cases where EQE.head passes EQE.tail?

The tail reference may lag (currently by only one node, but the potential jdk14 CQL optimization could potentially allow more lag), which I think allows EQE.head to be EQE.tail.next. Clearing references may unlink tail from head and orphan submitted tasks.

It's very possible that I've missed an important piece here, I'm not confident that I've gotten this right.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't AFAIK.
A producer set prev_tail::casNext(null, new_tail) and then casTail(prev_tail,new_tail), while a consumer can consume only prev_tail, not new_tail (yet).
Given that other producers are able to read this.tail (ie new_tail) and evetually this.tail == new_tail others are always able to append new element without loosing bits around

Copy link
Collaborator Author

@franz1981 franz1981 Apr 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carterkozak Thinking about it again and again, probably we're "lucky" that we cannot lag > 1, but if we use the same jdk 14 optimization (as you've suggested) we need, on
producer side, to chase the last tail starting from head (that in theory should be near(ish)) given that it has passed tail, that has lagged behind, if node.next == node (see https://github.com/openjdk-mirror/jdk7u-jdk/blob/master/src/share/classes/java/util/concurrent/ConcurrentLinkedQueue.java#L348).

Copy link

@nitsanw nitsanw Apr 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a correctness pov yes: the only thing we care about here is atomicity

@franz1981 you can use a plain write in this case, since reference writes are always atomic IIRC

Copy link
Collaborator Author

@franz1981 franz1981 Apr 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nitsanw thanks, yep JLS should ensure that ref equality should work fine here, no tearing on the reference value indeed, thanks!

// GC Nepotism:
// save dragging headNext into old generation
// (although being a PoolThreadNode it won't make a big difference)
nextPoolThreadNode.setNextRelaxed(null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct in understanding that this is presumed safe because nextPoolThreadNode has not yet been published?

Copy link
Collaborator Author

@franz1981 franz1981 Apr 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excactly and because nextPoolThreadNode (if not published) is still owned by a single thread (the consumer thread itself), hence setNextRelaxed is fine,because no other threads would never be able to read nextPoolThreadNode::next while this is happening.

Copy link
Contributor

@dmlloyd dmlloyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I think it looks good. Any closing remarks or final testing before I merge?

@franz1981
Copy link
Collaborator Author

franz1981 commented Apr 22, 2020

@dmlloyd I'm concerned about the asssert on #74 (comment) that's failing on master pre-lock removal too. Is an assertion still valid?

@dmlloyd
Copy link
Contributor

dmlloyd commented Apr 22, 2020

ie it's failing on assert tail instanceof TaskNode && tail.task == null; on the tail.task == null check
That assertion is still valid?

The assertion is based on tail being a dead task node. I'd say the assertion is still valid. If tail is not a dead task, how does it come to pass that the task was not executed by the current thread?

@franz1981
Copy link
Collaborator Author

franz1981 commented Apr 22, 2020

@dmlloyd

The assertion is based on tail being a dead task node. I'd say the assertion is still valid. If tail is not a dead task, how does it come to pass that the task was not executed by the current thread?

Agree: in this case, the test I've written was failing on master pre-TAIL-LOCK removal (with System.setProperty("jboss.threads.eqe.tail-lock", Boolean.FALSE.toString())), is failing on master now and is failing on this PR as well...

@carterkozak
Copy link
Contributor

This may be naive, but isn't it possible that tail is destined to become a dead task node but getAndClearTask hasn't run yet? In this case I think head may have progressed past the tail, but should self-correct.

Aside from this assert I'm curious why TaskNode.task is volatile.

@dmlloyd
Copy link
Contributor

dmlloyd commented Apr 22, 2020

IIRC it was originally volatile because that was the means of publishing the task from the producer to the consumer. But maybe it no longer needs to be; if someone has time to do the requisite analysis I'd be open to the change.

@carterkozak
Copy link
Contributor

Running the gcNepotism test on 3.1.0.Final I'm getting assertion errors with and without locks (modified to allow the test to run with tail-lock enabled), with and without park-spins. The failure rate isn't quite as high as without locks, but every two to three runs.

@carterkozak
Copy link
Contributor

I've started a round of benchmarks, I'll post the results here in the morning.

@franz1981
Copy link
Collaborator Author

franz1981 commented Apr 23, 2020

@carterkozak Re the assertion error: so do you think it worths to investigate and fix it, although is there from some time?

I suggest to run the benchmark with CondCardMark JVM arg on/off

@carterkozak
Copy link
Contributor

carterkozak commented Apr 23, 2020

Full suite of benchmarks for raw comparison (still using UseParallelOldGC):

https://gist.github.com/carterkozak/0f265ca33bd668b4f1641da362f613dd

Using -XX:+UseG1GC -XX:-UseCondCardMark:
This PR:

Benchmark                           (burstSize)  (cores)  (executorThreads)     (factory)  (spinWaitCompletion)  Mode  Cnt    Score   Error  Units
ExecutorBenchmarks.benchmarkSubmit           10       28              16,16  EQE_NO_LOCKS                 false  avgt    4  177.902 ± 3.426  us/op

Master:

Benchmark                           (burstSize)  (cores)  (executorThreads)     (factory)  (spinWaitCompletion)  Mode  Cnt    Score   Error  Units
ExecutorBenchmarks.benchmarkSubmit           10       28              16,16  EQE_NO_LOCKS                 false  avgt    4  175.534 ± 5.646  us/op

Using -XX:+UseG1GC -XX:+UseCondCardMark:

this PR:

Benchmark                           (burstSize)  (cores)  (executorThreads)     (factory)  (spinWaitCompletion)  Mode  Cnt    Score    Error  Units
ExecutorBenchmarks.benchmarkSubmit           10       28              16,16  EQE_NO_LOCKS                 false  avgt    4  178.436 ± 10.158  us/op

Master:

Benchmark                           (burstSize)  (cores)  (executorThreads)     (factory)  (spinWaitCompletion)  Mode  Cnt    Score    Error  Units
ExecutorBenchmarks.benchmarkSubmit           10       28              16,16  EQE_NO_LOCKS                 false  avgt    4  176.804 ± 13.321  us/op

There's not much difference in the values, but our tasks run very quickly and nodes die young for the most part.

@franz1981
Copy link
Collaborator Author

franz1981 commented Apr 23, 2020

@carterkozak Parallel GC needs UseCondCardMark to perform well on high-core machines. G1GC already have conditional card marks by default: regardless tasks getting old in human metrics, it mostly depends on the (young) heap size that affect the promotion rate. Probably worth using GCProfiler or (-prof gc) on the bench IMO

@dmlloyd
Copy link
Contributor

dmlloyd commented Apr 23, 2020

Just want to note that the assertion problem does require a solution. I'm looking into it, but you are all welcome to look as well (chances are it will take me longer to find the problem anyway).

@dmlloyd
Copy link
Contributor

dmlloyd commented Apr 23, 2020

Well, I solved the assert mystery. It is indeed invalid (see #75 for the explanation).

@dmlloyd
Copy link
Contributor

dmlloyd commented Apr 23, 2020

Any other remarks before I merge?

@carterkozak
Copy link
Contributor

Looks great, thanks!

I've opened a PR to address TaskNode.task volatility #76 describing why I believe it's a safe change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants