-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attempting to fix failures due to LCM/GCM flags #3094
Attempting to fix failures due to LCM/GCM flags #3094
Conversation
Important Auto Review SkippedDraft detected. Please check the settings in the CodeRabbit UI or the To trigger a single review, invoke the Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configration File (
|
e1fe4cb
to
a5fcbb4
Compare
@juherr @vlsi - I need help with figuring out where to start looking in terms of fixing the failures. They are obviously flaky tests which seems to suggest either the functionality has a bug that's not obvious to the naked eye or perhaps the test needs to be refactored. The failures seem to be tied to the For e.g., the build suggests that I can use the below command to reproduce the problem TZ="Pacific/Chatham" ./gradlew build -Duser.country=FR -Duser.language=fr \
-Ptestng.test.extra.jvmargs="-Duser.country=FR -Duser.language=fr -XX:+UnlockDiagnosticVMOptions -XX:+StressGCM -XX:+StressLCM" \
-PtestDisableCaching="JIT randomization should not be cached" But running this locally on my Mac, with the same JDK (I used Action runs logs available at https://github.com/testng-team/testng/actions/runs/8321962523/job/22768998250?pr=3094 The test that am trying to fix https://scans.gradle.com/s/ibv2y77of77xc/tests/task/:testng-core:test/details/test.listeners.ListenersTest/ensureOrderingForTestListenersViaApi?top-execution=1 As obvious, this test runs fine without the flags. |
It turns out there's So the options could be: |
@vlsi - Thank you so much for responding back. I am still on the Intel architecture :) So that part can be excluded out. Let me see if I can bump up to JDK17 and use |
@vlsi - I tried all the 3 options but still not able to reproduce the issue locally. I even went back to checking if there's anything that stands out in this PR, which introduced the test, to see if anything that stands out. Nothing there as well. I have additionally tried to create a public static void main(String[] args) throws InterruptedException {
String arguments = String.join(" ", ManagementFactory.getRuntimeMXBean().getInputArguments());
System.err.println("Arguments = " + arguments);
ListenersTest object = new ListenersTest();
for (int i=1; i<= 10000; i++) {
object.ensureOrderingForTestListenersViaApi();
TimeUnit.MILLISECONDS.sleep(500);
}
} and have executed this with the JVM flags
Still no luck. It feels like I am running around in circles :( Any other pointers on how to be able to reproduce the issue? In the meantime, I will try to refactor the tests to see if it can capture the problem properly. I have a vague feel that the test that is expected to timeout, doesn't seem to do so (at-least that is what i am inferring from the failures) |
InvokedMethodListenerHolder and SuiteListenerHolder look suspicious as they contain |
Yes, that's the part that I was thinking of refactoring, because that's the only place which looks suspicious to me as well. Am glad that I was thinking in the same direction. Let me try and refactor the tests and see if that helps. Thanks for taking a closer look and helping out @vlsi |
a5fcbb4
to
0b09262
Compare
import org.testng.internal.AutoCloseableLock; | ||
|
||
public enum LogContainer { | ||
instance; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
single instance per JVM would still have issues with concurrent test execution. For instance, if multiple tests execute at the same time, they would log into the same LogContainer
thus they would corrupt each others' logs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True that. Some additional context:
- The tests are not running in parallel.
- The tests in question basically work in 2 modes [ API mode wherein we feed in the listener instance and CLI mode wherein the listener instances are created by TestNG using reflection ]
- When working with CLI mode, I am not sure as to how to access the listener instance that TestNG created so that I can access the logs captured by the test.
- I seem to be seeing a pattern in the failure. The failure is always confined to the tests with timeouts. When we run the test with the StressLCM/StressGCM flags, the tests don't seem to be timing out, but instead they are running to completion as seen by the below output
but some elements were not found: |
-- | --
| ["DragonWarrior.onTestFailedWithTimeout", |
| "MasterShifu.onTestFailedWithTimeout", |
| "MasterOogway.onTestFailedWithTimeout"] |
| and others were not expected: |
| ["DragonWarrior.onTestSuccess", |
| "MasterShifu.onTestSuccess", |
| "MasterOogway.onTestSuccess"]
This makes me think that there's something wrong with the codebase when it comes to running timing out tests, but am not able to figure out what, since I dont know what to look for. I am looking for static states (non existent). The shared states are being tracked via threadsafe variants (Atomic*).
Any other pointers you can suggest ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This output confirms my hunch. When running with the stress flags, a test that is supposed to be timing out, does not time out. Not sure what will cause that because the code paths that are related to timeouts dont have any global variables etc.,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When running with the stress flags, a test that is supposed to be timing out, does not time out
What is the test that should time out and fails to do so?
Have you investigated if the timeout logic is implemented properly? It might be there is a concurrency issue within the timeout implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vlsi - Its a very simple test that uses the timeout
feature that TestNG provides.
This is what the test looks like
@Test(timeOut = 25, priority = 4)
public void timingOutTest() throws InterruptedException {
TimeUnit.MILLISECONDS.sleep(100);
new Throwable().printStackTrace();
if (counter != 3) {
Assert.fail();
}
}
The test passed consistently when I don't use those flags. That is what makes me believe that when those flags are enabled, TestNG's approach of basically running these tests has a problem. But no matter how many times I have skimmed through them, I still am not able to identify them :(
This is where TestNG is basically running those tests driven/controlled by timeouts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, having timeout of 25ms
vs sleep
of 100ms
might be too optimistic. In other words, "sleep 100ms" within the test might be too small amount of time to "guarantee" the interrupt logic would have enough time to execute.
Could you try increasing sleep
to 10sec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vlsi - Lol.. I should have started off with this as the question. It could have atleast saved me a day's time. Bumping up the sleep time worked. I am now going to create a new PR from master
again, with just the timeout bumped to see if the fix is consistent. If it is, then I think I will perhaps want to stick with just that change instead of this elaborate change which still doesn't guarantee thread safety when the tests are run in parallel (which you also rightly called out).
Because there's reflection and since there's no straightforward way of retrieving all the logs from the listeners (especially when the test invokes TestNG via the CLI), I will want the changeset to be as little as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much @vlsi for spending time and helping me out.
23c3720
to
c695b4d
Compare
c695b4d
to
e5e22a7
Compare
Closing this PR since I have now raised a new PR #3098 |
Fixes # .
Did you remember to?
CHANGES.txt
./gradlew autostyleApply
We encourage pull requests that:
If your pull request involves fixing SonarQube issues then we would suggest that you please discuss this with the
TestNG-dev before you spend time working on it.
Note: For more information on contribution guidelines please make sure you refer our Contributing section for detailed set of steps.