Attempting to fix failures due to LCM/GCM flags #3094

krmahadevan · 2024-03-17T13:26:35Z

Fixes # .

Did you remember to?

Add test case(s)
Update CHANGES.txt
Auto applied styling via ./gradlew autostyleApply

We encourage pull requests that:

Add new features to TestNG (or)
Fix bugs in TestNG

If your pull request involves fixing SonarQube issues then we would suggest that you please discuss this with the
TestNG-dev before you spend time working on it.

Note: For more information on contribution guidelines please make sure you refer our Contributing section for detailed set of steps.

coderabbitai · 2024-03-17T13:26:39Z

Important

Auto Review Skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository.

To trigger a single review, invoke the @coderabbitai review command.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>.
- Generate unit-tests for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit tests for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai generate interesting stats about this repository and render them as a table.
- @coderabbitai show all the console.log statements in this repository.
- @coderabbitai read src/utils.ts and generate unit tests.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
The JSON schema for the configuration file is available here.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/coderabbit-overrides.v2.json

CodeRabbit Discord Community

Join our Discord Community to get help, request features, and share feedback.

krmahadevan · 2024-03-18T07:21:18Z

@juherr @vlsi - I need help with figuring out where to start looking in terms of fixing the failures. They are obviously flaky tests which seems to suggest either the functionality has a bug that's not obvious to the naked eye or perhaps the test needs to be refactored.

The failures seem to be tied to the StressLCM and StrressGCM flags on the JVM. I am having a hard time trying to figure out how to reproduce the issues locally as well.

For e.g., the build suggests that I can use the below command to reproduce the problem

TZ="Pacific/Chatham" ./gradlew build -Duser.country=FR -Duser.language=fr \
-Ptestng.test.extra.jvmargs="-Duser.country=FR -Duser.language=fr -XX:+UnlockDiagnosticVMOptions -XX:+StressGCM -XX:+StressLCM" \
-PtestDisableCaching="JIT randomization should not be cached"

But running this locally on my Mac, with the same JDK (I used sdkman to switch between JDK flavours) is not yielding the same failure and the build passes.

Action runs logs available at https://github.com/testng-team/testng/actions/runs/8321962523/job/22768998250?pr=3094

The test that am trying to fix https://scans.gradle.com/s/ibv2y77of77xc/tests/task/:testng-core:test/details/test.listeners.ListenersTest/ensureOrderingForTestListenersViaApi?top-execution=1

As obvious, this test runs fine without the flags.

vlsi · 2024-03-18T07:39:07Z

It turns out there's StressSeed=N parameter to make StressGCM/StressLCM repeatable: https://bugs.openjdk.org/browse/JDK-8253765, however, the option was added in Java 16 only.

So the options could be:
a) Try running the test multiple times
b) Try searching for a global shared fields with "listeners". It might be that there's a global static field that gets overwritten when tests execute concurrently
c) Try adding StressSeed, however, it would be Java 16+ only, and it would probably require the same architecture. In other words, if your machine is ARM and CI uses Intel, then you might be unable to reproduce the failure exactly. It is worth adding StressSeed, however, I am not sure it would gain much help in this case though

krmahadevan · 2024-03-18T07:42:52Z

@vlsi - Thank you so much for responding back. I am still on the Intel architecture :) So that part can be excluded out. Let me see if I can bump up to JDK17 and use StressSeed

krmahadevan · 2024-03-18T14:36:58Z

@vlsi - I tried all the 3 options but still not able to reproduce the issue locally. I even went back to checking if there's anything that stands out in this PR, which introduced the test, to see if anything that stands out. Nothing there as well.

I have additionally tried to create a main() method that looks like below

  public static void main(String[] args) throws InterruptedException {
    String arguments = String.join(" ", ManagementFactory.getRuntimeMXBean().getInputArguments());
    System.err.println("Arguments = " + arguments);
    ListenersTest object = new ListenersTest();
    for (int i=1; i<= 10000; i++) {
      object.ensureOrderingForTestListenersViaApi();
      TimeUnit.MILLISECONDS.sleep(500);
    }
  }

and have executed this with the JVM flags

-Duser.country=FR -Duser.language=fr -XX:+UnlockDiagnosticVMOptions -XX:+StressGCM -XX:StressSeed=10 -XX:+StressLCM

Still no luck. It feels like I am running around in circles :( Any other pointers on how to be able to reproduce the issue? In the meantime, I will try to refactor the tests to see if it can capture the problem properly. I have a vague feel that the test that is expected to timeout, doesn't seem to do so (at-least that is what i am inferring from the failures)

vlsi · 2024-03-18T15:46:27Z

InvokedMethodListenerHolder and SuiteListenerHolder look suspicious as they contain static shared LOGS list that might be modified from multiple concurrent threads. Have you checked race conditions for those lists? Can the lists be made non-static? (==avoid sharing the lists between tests)

krmahadevan · 2024-03-18T16:02:57Z

InvokedMethodListenerHolder and SuiteListenerHolder look suspicious as they contain static shared LOGS list that might be modified from multiple concurrent threads. Have you checked race conditions for those lists? Can the lists be made non-static? (==avoid sharing the lists between tests)

Yes, that's the part that I was thinking of refactoring, because that's the only place which looks suspicious to me as well. Am glad that I was thinking in the same direction. Let me try and refactor the tests and see if that helps. Thanks for taking a closer look and helping out @vlsi

vlsi · 2024-03-20T07:13:27Z

testng-core/src/test/java/test/listeners/issue2916/LogContainer.java

+import org.testng.internal.AutoCloseableLock;
+
+public enum LogContainer {
+  instance;


single instance per JVM would still have issues with concurrent test execution. For instance, if multiple tests execute at the same time, they would log into the same LogContainer thus they would corrupt each others' logs

True that. Some additional context:

The tests are not running in parallel.

The tests in question basically work in 2 modes [ API mode wherein we feed in the listener instance and CLI mode wherein the listener instances are created by TestNG using reflection ]

When working with CLI mode, I am not sure as to how to access the listener instance that TestNG created so that I can access the logs captured by the test.

I seem to be seeing a pattern in the failure. The failure is always confined to the tests with timeouts. When we run the test with the StressLCM/StressGCM flags, the tests don't seem to be timing out, but instead they are running to completion as seen by the below output

but some elements were not found: | -- | -- | ["DragonWarrior.onTestFailedWithTimeout", | | "MasterShifu.onTestFailedWithTimeout", | | "MasterOogway.onTestFailedWithTimeout"] | | and others were not expected: | | ["DragonWarrior.onTestSuccess", | | "MasterShifu.onTestSuccess", | | "MasterOogway.onTestSuccess"]

This makes me think that there's something wrong with the codebase when it comes to running timing out tests, but am not able to figure out what, since I dont know what to look for. I am looking for static states (non existent). The shared states are being tracked via threadsafe variants (Atomic*).

Any other pointers you can suggest ?

This output confirms my hunch. When running with the stress flags, a test that is supposed to be timing out, does not time out. Not sure what will cause that because the code paths that are related to timeouts dont have any global variables etc.,

@juherr @vlsi - I have hit a dead end with this. I need some help so that I can proceed further.

When running with the stress flags, a test that is supposed to be timing out, does not time out

What is the test that should time out and fails to do so?
Have you investigated if the timeout logic is implemented properly? It might be there is a concurrency issue within the timeout implementation.

@vlsi - Its a very simple test that uses the timeout feature that TestNG provides.

This is what the test looks like

@Test(timeOut = 25, priority = 4) public void timingOutTest() throws InterruptedException { TimeUnit.MILLISECONDS.sleep(100); new Throwable().printStackTrace(); if (counter != 3) { Assert.fail(); } }

The test passed consistently when I don't use those flags. That is what makes me believe that when those flags are enabled, TestNG's approach of basically running these tests has a problem. But no matter how many times I have skimmed through them, I still am not able to identify them :(

This is where TestNG is basically running those tests driven/controlled by timeouts.

Well, having timeout of 25ms vs sleep of 100ms might be too optimistic. In other words, "sleep 100ms" within the test might be too small amount of time to "guarantee" the interrupt logic would have enough time to execute.

Could you try increasing sleep to 10sec?

@vlsi - Lol.. I should have started off with this as the question. It could have atleast saved me a day's time. Bumping up the sleep time worked. I am now going to create a new PR from master again, with just the timeout bumped to see if the fix is consistent. If it is, then I think I will perhaps want to stick with just that change instead of this elaborate change which still doesn't guarantee thread safety when the tests are run in parallel (which you also rightly called out).

Because there's reflection and since there's no straightforward way of retrieving all the logs from the listeners (especially when the test invokes TestNG via the CLI), I will want the changeset to be as little as possible.

Thank you so much @vlsi for spending time and helping me out.

krmahadevan · 2024-03-21T09:46:34Z

Closing this PR since I have now raised a new PR #3098

krmahadevan force-pushed the fix_failures_related_to_lcm_gcm branch from e1fe4cb to a5fcbb4 Compare March 18, 2024 05:58

Attempting to fix failures due to LCM/GCM flags

0b09262

krmahadevan force-pushed the fix_failures_related_to_lcm_gcm branch from a5fcbb4 to 0b09262 Compare March 20, 2024 04:05

vlsi reviewed Mar 20, 2024

View reviewed changes

krmahadevan force-pushed the fix_failures_related_to_lcm_gcm branch from 23c3720 to c695b4d Compare March 20, 2024 15:38

Adding some debug statements

e5e22a7

krmahadevan force-pushed the fix_failures_related_to_lcm_gcm branch from c695b4d to e5e22a7 Compare March 20, 2024 16:18

Bump the wait to 10 seconds

16a7b32

krmahadevan closed this Mar 21, 2024

krmahadevan deleted the fix_failures_related_to_lcm_gcm branch March 21, 2024 09:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attempting to fix failures due to LCM/GCM flags #3094

Attempting to fix failures due to LCM/GCM flags #3094

krmahadevan commented Mar 17, 2024

coderabbitai bot commented Mar 17, 2024

Auto Review Skipped

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configration File (`.coderabbit.yaml`)

CodeRabbit Discord Community

krmahadevan commented Mar 18, 2024 •

edited

Loading

vlsi commented Mar 18, 2024

krmahadevan commented Mar 18, 2024

krmahadevan commented Mar 18, 2024

vlsi commented Mar 18, 2024

krmahadevan commented Mar 18, 2024

vlsi Mar 20, 2024

krmahadevan Mar 20, 2024

krmahadevan Mar 20, 2024

krmahadevan Mar 21, 2024

vlsi Mar 21, 2024

krmahadevan Mar 21, 2024

vlsi Mar 21, 2024

krmahadevan Mar 21, 2024

krmahadevan Mar 21, 2024

krmahadevan commented Mar 21, 2024

Attempting to fix failures due to LCM/GCM flags #3094

Attempting to fix failures due to LCM/GCM flags #3094

Conversation

krmahadevan commented Mar 17, 2024

Did you remember to?

coderabbitai bot commented Mar 17, 2024

Auto Review Skipped

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configration File (.coderabbit.yaml)

CodeRabbit Discord Community

krmahadevan commented Mar 18, 2024 • edited Loading

vlsi commented Mar 18, 2024

krmahadevan commented Mar 18, 2024

krmahadevan commented Mar 18, 2024

vlsi commented Mar 18, 2024

krmahadevan commented Mar 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krmahadevan commented Mar 21, 2024

CodeRabbit Configration File (`.coderabbit.yaml`)

krmahadevan commented Mar 18, 2024 •

edited

Loading