Save the linked list of the executor to suffer from GC nepotism #74

franz1981 · 2020-04-21T06:36:52Z

This has turned to be a relevant GC issue that can cause several problems (including OOM) on many linked-lists implementations.
For reference:

I've solved the problem in the same way CLQ explain here: https://github.com/AdoptOpenJDK/openjdk-jdk14/blob/f7165c322a6abefa34afa9eeb76dbd56f70aae09/src/java.base/share/classes/java/util/concurrent/ConcurrentLinkedQueue.java#L149

We use the trick of
linking a Node that has just been dequeued to itself.  Such a
self-link implicitly means to advance to head.

In addition, I see that this comment gives an additional hint on how to handle producer-side the case where tail.next == next ie self-link implicitly means to advance to head:

if tail is unchanged (from the last read) it means that head is the most up-to-date node to append to and we can use it
if tail is changed (from the last read) it just means that we've fallen behind and can use the new tail again

I haven't implemented this strategy, but I let it re-attempt reading this.tail if a consumer past over a producer.

src/main/java/org/jboss/threads/EnhancedQueueExecutor.java

franz1981 · 2020-04-21T08:12:49Z

@dmlloyd @carterkozak
I was sure to have broken (maybe I've it done it anyway eh :P) the current logic with this change, but seems that I'm able to make master to fail with this test (on my 4 cores with HT laptop):

    @Test
    public void gcNepotism() throws ExecutionException, InterruptedException {
        System.setProperty("jboss.threads.eqe.park-spins", Long.toString(Integer.MAX_VALUE));
        System.setProperty("jboss.threads.eqe.tail-lock", Boolean.FALSE.toString());
        Assume.assumeFalse(EnhancedQueueExecutor.TAIL_LOCK);
        EnhancedQueueExecutor executor = (new EnhancedQueueExecutor.Builder())
                .setCorePoolSize(10)
                .setMaximumPoolSize(20)
                .build();
        final int producers = 10;
        int total = 100_000;
        final CountDownLatch totalExecution = new CountDownLatch(total * producers);
        ExecutorService service = Executors.newFixedThreadPool(producers);
        Future<?>[] tasks = new Future[producers];
        for (int i = 0; i<producers;i++) {
            tasks[i] = service.submit(() -> {
                for (int j = 0; j < total; j++) {
                    executor.execute(totalExecution::countDown);
                }
            });
        }
        for (int i = 0; i<producers;i++) {
            tasks[i].get();
        }
        Assert.assertTrue(totalExecution.await(30, TimeUnit.SECONDS));
        service.shutdown();
        executor.shutdown();
    }

On master I'm getting:

java.util.concurrent.ExecutionException: java.lang.AssertionError

	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
	at org.jboss.threads.EnhancedQueueExecutorTest.gcNepotism(EnhancedQueueExecutorTest.java:181)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
	at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
	at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
	at com.intellij.rt.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:33)
	at com.intellij.rt.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:230)
	at com.intellij.rt.junit.JUnitStarter.main(JUnitStarter.java:58)
Caused by: java.lang.AssertionError
	at org.jboss.threads.EnhancedQueueExecutor.tryExecute(EnhancedQueueExecutor.java:1740)
	at org.jboss.threads.EnhancedQueueExecutor.execute(EnhancedQueueExecutor.java:755)
	at org.jboss.threads.EnhancedQueueExecutorTest.lambda$gcNepotism$0(EnhancedQueueExecutorTest.java:176)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

ie it's failing on assert tail instanceof TaskNode && tail.task == null; on the tail.task == null check
That assertion is still valid?

franz1981 · 2020-04-21T09:11:26Z

src/main/java/org/jboss/threads/EnhancedQueueExecutor.java

+                        // GC Nepotism:
+                        // We can save consumerNode::next from being dragged into
+                        // old generation, if possible
+                        consumerNode.compareAndSetNext(tailNextNext, null);


I've used a compareAndSetNext because I'm not sure that this case is single-writer, maybe yes.

Maybe we should do a weakCompareAndSet here to avoid the heavier memory barrier?

IMO is safe wekCompareAndSet, but probably will make any difference only for ARM/PowerPC servers: AFAIK newish x86s won't have any specific code path implemented on the OpenJDK for this (but I could be wrong)

OK. Well we do care about ARM and AArch64 so I think that makes it valid.

franz1981 · 2020-04-21T11:58:36Z

If I need to provide a test to make evident the GC nepotism problem I can do it as well :+1

dmlloyd · 2020-04-21T18:16:02Z

src/main/java/org/jboss/threads/EnhancedQueueExecutor.java

                TaskNode taskNode = (TaskNode) headNext;
                if (compareAndSetHead(head, taskNode)) {
+                    // save from GC nepotism
+                    head.setNextRelaxed(head);


Can you prove that this can never cause nodes to be skipped?

good point: looking at the original code I see that is draining TaskNodes until tail::getNext is null, after a shutdown: I can preserve the original logic to handle the case of a concurrent call to it.

I've changed the visibility guarantess while setting next too, taking "inspiration" from CLQ:
https://github.com/openjdk-mirror/jdk7u-jdk/blob/8aaca72128077594ecf4ecdee761e78d4ecb8b7d/src/share/classes/java/util/concurrent/ConcurrentLinkedQueue.java#L305

final void updateHead(Node<E> h, Node<E> p) { if (h != p && casHead(h, p)) h.lazySetNext(h); }

dmlloyd · 2020-04-22T13:06:09Z

src/main/java/org/jboss/threads/EnhancedQueueExecutor.java

+                        // cross-generational references, so better to "clean-up" head::next
+                        // to save dragging head::next into the old generation.
+                        // Clean-up cannot just null out next
+                        head.setNextOrdered(head);


I need to see a correctness proof here. What are the possible observable output states, and why? Are the states all valid and already described above? What are the preconditions and postconditions?

It's mostly expressed in term of acquire-release relationships between atomic operations:

only one thread can win compareAndSetHead(head, taskNode) for a specific head, taskNode pair

other threads can reach head while its head::next is valued before/after head.setNextOrdered

"slow" producers (ie producers descheduled after reading this.tail for enough time that other producers have moved on and a consumer has already moved head):

before head.setNextOrdered(head): they try to compareAndSetNext(null, newValue) and fail, because, given that the consumer has been able to move past to it, it means that next != null

after head.setNextOrdered(head): they try to compareAndSetNext(null, newValue) and fail, because head != null

slow consumers (ie consumers that have been descheduled after reading this.head and are going to read head::getNext)

before head.setNextOrdered(head): they see the same head::next value of the consumer that has won casHead, hence they will attempts casHead, failing because has been already moved on

after head.setNextOrdered(head): they see the new value (head::next == head) and they will reattempt reading this.head. Given that head.setNextOrdered is write releasing any previous write (including the won casHead), load head::next and finding == head is load acquiring the new value of head, hence reading head will return at least the value released by the consumer that has won casHead

The states already written in the overview of the class should be valid, because I've tried hard to not break the way we use null and just add checks that head::next == head means that "something" has changed: in particular head and by consequence tail.

OK I think I understand all the possibilities here. But couldn't this be setNextRelaxed? We're not using node.next == node as the solitary indicator that it has been removed, so it's probably OK if some threads don't observe that (they would just spin past the node using the current behavior). Does it make a difference performance-wise?

Does it make a difference performance-wise?

setNextOrdered on x86 isn't using any specific hardware barrier; thanks to the strong memory model of x86 and the only perf hit would have been by using "updaters" (often negligible).

From a correctness pov yes: the only thing we care about here is atomicity, but sadly Unsafe isn't providing the right methods to just have that: ideally this is what we need here https://docs.oracle.com/javase/9/docs/api/java/lang/invoke/VarHandle.AccessMode.html#SET_OPAQUE , because the compareAndSet previous to it already provide the right sequential consistent write we need.

Is this safe in cases where EQE.head passes EQE.tail?

The tail reference may lag (currently by only one node, but the potential jdk14 CQL optimization could potentially allow more lag), which I think allows EQE.head to be EQE.tail.next. Clearing references may unlink tail from head and orphan submitted tasks.

It's very possible that I've missed an important piece here, I'm not confident that I've gotten this right.

It shouldn't AFAIK.
A producer set prev_tail::casNext(null, new_tail) and then casTail(prev_tail,new_tail), while a consumer can consume only prev_tail, not new_tail (yet).
Given that other producers are able to read this.tail (ie new_tail) and evetually this.tail == new_tail others are always able to append new element without loosing bits around

@carterkozak Thinking about it again and again, probably we're "lucky" that we cannot lag > 1, but if we use the same jdk 14 optimization (as you've suggested) we need, on
producer side, to chase the last tail starting from head (that in theory should be near(ish)) given that it has passed tail, that has lagged behind, if node.next == node (see https://github.com/openjdk-mirror/jdk7u-jdk/blob/master/src/share/classes/java/util/concurrent/ConcurrentLinkedQueue.java#L348).

From a correctness pov yes: the only thing we care about here is atomicity

@franz1981 you can use a plain write in this case, since reference writes are always atomic IIRC

@nitsanw thanks, yep JLS should ensure that ref equality should work fine here, no tearing on the reference value indeed, thanks!

dmlloyd · 2020-04-22T14:11:32Z

src/main/java/org/jboss/threads/EnhancedQueueExecutor.java

+                        // GC Nepotism:
+                        // save dragging headNext into old generation
+                        // (although being a PoolThreadNode it won't make a big difference)
+                        nextPoolThreadNode.setNextRelaxed(null);


Am I correct in understanding that this is presumed safe because nextPoolThreadNode has not yet been published?

Excactly and because nextPoolThreadNode (if not published) is still owned by a single thread (the consumer thread itself), hence setNextRelaxed is fine,because no other threads would never be able to read nextPoolThreadNode::next while this is happening.

dmlloyd

OK I think it looks good. Any closing remarks or final testing before I merge?

franz1981 · 2020-04-22T15:26:06Z

@dmlloyd I'm concerned about the asssert on #74 (comment) that's failing on master pre-lock removal too. Is an assertion still valid?

dmlloyd · 2020-04-22T16:05:06Z

ie it's failing on assert tail instanceof TaskNode && tail.task == null; on the tail.task == null check
That assertion is still valid?

The assertion is based on tail being a dead task node. I'd say the assertion is still valid. If tail is not a dead task, how does it come to pass that the task was not executed by the current thread?

franz1981 · 2020-04-22T16:29:14Z

@dmlloyd

The assertion is based on tail being a dead task node. I'd say the assertion is still valid. If tail is not a dead task, how does it come to pass that the task was not executed by the current thread?

Agree: in this case, the test I've written was failing on master pre-TAIL-LOCK removal (with System.setProperty("jboss.threads.eqe.tail-lock", Boolean.FALSE.toString())), is failing on master now and is failing on this PR as well...

carterkozak · 2020-04-22T16:34:38Z

This may be naive, but isn't it possible that tail is destined to become a dead task node but getAndClearTask hasn't run yet? In this case I think head may have progressed past the tail, but should self-correct.

Aside from this assert I'm curious why TaskNode.task is volatile.

dmlloyd · 2020-04-22T16:40:28Z

IIRC it was originally volatile because that was the means of publishing the task from the producer to the consumer. But maybe it no longer needs to be; if someone has time to do the requisite analysis I'd be open to the change.

carterkozak · 2020-04-22T19:10:41Z

Running the gcNepotism test on 3.1.0.Final I'm getting assertion errors with and without locks (modified to allow the test to run with tail-lock enabled), with and without park-spins. The failure rate isn't quite as high as without locks, but every two to three runs.

carterkozak · 2020-04-23T02:22:26Z

I've started a round of benchmarks, I'll post the results here in the morning.

franz1981 · 2020-04-23T05:54:25Z

@carterkozak Re the assertion error: so do you think it worths to investigate and fix it, although is there from some time?

I suggest to run the benchmark with CondCardMark JVM arg on/off

carterkozak · 2020-04-23T13:43:22Z

Full suite of benchmarks for raw comparison (still using UseParallelOldGC):

https://gist.github.com/carterkozak/0f265ca33bd668b4f1641da362f613dd

Using -XX:+UseG1GC -XX:-UseCondCardMark:
This PR:

Benchmark                           (burstSize)  (cores)  (executorThreads)     (factory)  (spinWaitCompletion)  Mode  Cnt    Score   Error  Units
ExecutorBenchmarks.benchmarkSubmit           10       28              16,16  EQE_NO_LOCKS                 false  avgt    4  177.902 ± 3.426  us/op

Master:

Benchmark                           (burstSize)  (cores)  (executorThreads)     (factory)  (spinWaitCompletion)  Mode  Cnt    Score   Error  Units
ExecutorBenchmarks.benchmarkSubmit           10       28              16,16  EQE_NO_LOCKS                 false  avgt    4  175.534 ± 5.646  us/op

Using -XX:+UseG1GC -XX:+UseCondCardMark:

this PR:

Benchmark                           (burstSize)  (cores)  (executorThreads)     (factory)  (spinWaitCompletion)  Mode  Cnt    Score    Error  Units
ExecutorBenchmarks.benchmarkSubmit           10       28              16,16  EQE_NO_LOCKS                 false  avgt    4  178.436 ± 10.158  us/op

Master:

Benchmark                           (burstSize)  (cores)  (executorThreads)     (factory)  (spinWaitCompletion)  Mode  Cnt    Score    Error  Units
ExecutorBenchmarks.benchmarkSubmit           10       28              16,16  EQE_NO_LOCKS                 false  avgt    4  176.804 ± 13.321  us/op

There's not much difference in the values, but our tasks run very quickly and nodes die young for the most part.

franz1981 · 2020-04-23T13:50:43Z

@carterkozak Parallel GC needs UseCondCardMark to perform well on high-core machines. G1GC already have conditional card marks by default: regardless tasks getting old in human metrics, it mostly depends on the (young) heap size that affect the promotion rate. Probably worth using GCProfiler or (-prof gc) on the bench IMO

dmlloyd · 2020-04-23T14:40:40Z

Just want to note that the assertion problem does require a solution. I'm looking into it, but you are all welcome to look as well (chances are it will take me longer to find the problem anyway).

dmlloyd · 2020-04-23T16:52:31Z

Well, I solved the assert mystery. It is indeed invalid (see #75 for the explanation).

dmlloyd · 2020-04-23T17:53:27Z

Any other remarks before I merge?

carterkozak · 2020-04-23T19:15:26Z

Looks great, thanks!

I've opened a PR to address TaskNode.task volatility #76 describing why I believe it's a safe change.

franz1981 commented Apr 21, 2020

View reviewed changes

src/main/java/org/jboss/threads/EnhancedQueueExecutor.java Outdated Show resolved Hide resolved

franz1981 force-pushed the gc_nepotism branch 2 times, most recently from 694da73 to acc2b67 Compare April 21, 2020 07:00

franz1981 force-pushed the gc_nepotism branch from acc2b67 to 4bd7988 Compare April 21, 2020 09:09

franz1981 commented Apr 21, 2020

View reviewed changes

franz1981 force-pushed the gc_nepotism branch 4 times, most recently from fafc219 to db0355c Compare April 21, 2020 17:25

dmlloyd reviewed Apr 21, 2020

View reviewed changes

Save the linked list of the executor to suffer from GC nepotism

f36a2f5

franz1981 force-pushed the gc_nepotism branch from db0355c to f36a2f5 Compare April 22, 2020 07:45

dmlloyd reviewed Apr 22, 2020

View reviewed changes

dmlloyd approved these changes Apr 22, 2020

View reviewed changes

dmlloyd merged commit fe2a3b0 into jbossas:master Apr 23, 2020

franz1981 mentioned this pull request Jan 13, 2025

HttpEventHandler.body should avoid being resized eclipse-vertx/vert.x#5444

Open

Save the linked list of the executor to suffer from GC nepotism #74

Save the linked list of the executor to suffer from GC nepotism #74

Conversation

franz1981 commented Apr 21, 2020 • edited Loading

franz1981 commented Apr 21, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

franz1981 commented Apr 21, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

franz1981 Apr 22, 2020 • edited Loading

Choose a reason for hiding this comment

franz1981 Apr 22, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

franz1981 Apr 22, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

franz1981 Apr 24, 2020 • edited Loading

Choose a reason for hiding this comment

nitsanw Apr 24, 2020 • edited Loading

Choose a reason for hiding this comment

franz1981 Apr 24, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

franz1981 Apr 22, 2020 • edited Loading

Choose a reason for hiding this comment

dmlloyd left a comment

Choose a reason for hiding this comment

franz1981 commented Apr 22, 2020 • edited Loading

dmlloyd commented Apr 22, 2020

franz1981 commented Apr 22, 2020 • edited Loading

carterkozak commented Apr 22, 2020

dmlloyd commented Apr 22, 2020

carterkozak commented Apr 22, 2020

carterkozak commented Apr 23, 2020

franz1981 commented Apr 23, 2020 • edited Loading

carterkozak commented Apr 23, 2020 • edited Loading

franz1981 commented Apr 23, 2020 • edited Loading

dmlloyd commented Apr 23, 2020

dmlloyd commented Apr 23, 2020

dmlloyd commented Apr 23, 2020

carterkozak commented Apr 23, 2020

franz1981 commented Apr 21, 2020 •

edited

Loading

franz1981 commented Apr 21, 2020 •

edited

Loading

franz1981 Apr 22, 2020 •

edited

Loading

franz1981 Apr 22, 2020 •

edited

Loading

franz1981 Apr 22, 2020 •

edited

Loading

franz1981 Apr 24, 2020 •

edited

Loading

nitsanw Apr 24, 2020 •

edited

Loading

franz1981 Apr 24, 2020 •

edited

Loading

franz1981 Apr 22, 2020 •

edited

Loading

franz1981 commented Apr 22, 2020 •

edited

Loading

franz1981 commented Apr 22, 2020 •

edited

Loading

franz1981 commented Apr 23, 2020 •

edited

Loading

carterkozak commented Apr 23, 2020 •

edited

Loading

franz1981 commented Apr 23, 2020 •

edited

Loading