-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NativeAOT] Fix Lock's usage of NativeRuntimeEventSource.Log to account for recursive accesses during its own class construction #94873
Conversation
Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas Issue Details
Fixes #94728
|
With this and #94241, what is our confidence that In #94515 I used |
When we were stabilizing the old Lock the same question was asked. I think the same reasoning still applies - Old lock did not have this problem, ... eventually. The implementation in #93879 does not seem to be affected. This is a fixable issue.
The only legit use of LowLevelLock in NativeAOT is in implementation of Wait subsystem. Since Lock uses that, we can't use Lock there. |
@kouvel - is there a good understanding of the repro? It worries me a bit that the bug reproduces exclusively on linux-arm64. That was also the case with the previous bug. Relying on linux-arm64 for stress might not be a good place to be. If there was a way to force this to repro on other platforms, especially on win-x64, it would be very useful. I wonder what makes linux-arm64 special here? |
Thanks, I sent out a PR to use that instead: #94991.
It shouldn't be linux-arm64 specific, but we definitely enable EventSource for fewer tests. EventSource is disabled by default in the shipping configuration (some workloads like ASP.NET will enable it back). The tests mirror that. Only tests that test event source enable it back (search for |
It's a timing issue that can lead to the stack trace in #94728 or something like it. It's not specific to a platform or architecture, probably not easy to repro without inducing timings.
There appears to be a circular dependency where Lock does some lazy initialization that involves class construction, which in turn uses Lock for synchronization. That makes the system a bit brittle to changes. For instance, new issues were uncovered when raising events through |
Somehow on arm64 we start seeing contentions from sockets before module initialization is complete. Not sure if that was expected by whoever have put |
I think the most reliable fix for this would be to stop calling static initialization from the Lock entirely, and make something else do the initialization instead.
In either case there is a chance to have a small window of operating with uninitialized statics, but that is ok as long as the initialization is optional. The advantage of [ModuleInitializer] is being more explicit and upfront, but there could be concern that module init would run at program startup and the NativeRuntimeEventSource initialization could be a bit heavy (there appears to be some reflection use, type loading, attribute parsing,..). The Finalizer approach allows to wait until we are sure that we have a program that may have contentions to report. |
Maybe an ordering issue with this: runtime/src/libraries/Common/tests/System/Net/QuicLoad.cs Lines 12 to 20 in 784537a
The code looks to be Linux-only so that would explain we don't see it on Windows. Cc @LakshanF. IIRC RuntimeEventSource initialization it was made a ModuleInitializer because running it as part of CoreLib initialization was too early and we don't have any other initialization phase before Main. Maybe we need one. |
We do topological sort of ModuleInitializers. CoreLib module initializer should run first and the Quic module initializer should run second. Both CoreLib module initializer and the Quic module initializer start new threads. The problem seems to be triggered when the background work done by the Quic module initializer happens faster than the background work done by the Quic module initializer. I think the root cause of the problem is use of managed |
Second related problem is that the managed locks use high-level wait that can be overridden using synchronization context. Try running this:
It makes the managed locks fairly high-level service that is not appropriate for low-level services like class constructor runner. Using the managed locks in low-level services like class constructor runner or frozen object allocator is a farm for re-entrancy crashes. I think the rule of thumb should be both regular CoreCLR and native AOT should use same type of lock in equivalent services. If CoreCLR uses low-level lock for something, native AOT should do the same. |
Given that it wouldn't be feasible to use Should I make the relevant changes to ClassConstructorRunner also? |
I think so. And delete/simplify code that has been working around the Lock use in ClassConstructorRunner. Thank you! |
Is that a problem or something expected? |
It is expected. The problem is that two threads running in parallel cause contention in constructor runner that leads to the crash. |
It is a bit surprising why we decided to do this to locks in the first place (perhaps something with COM/STA as usual), and keep maintaining. In normal use Lock has 2 levels of optimizations to avoid Waiting, and |
The argument seems equally applicable to any feature "X". - If I use feature X in SynchronizationContext.Wait, then I may manufacture recursions when X uses locks.
How far will this go? What about type unifiers, sync table, conditional weak table, weak reference implementation, COM stuff, other places that use locks? Are they all broken because someone might do weird things with SynchronizationContext? I think it is doable and is a lot cheaper to just keep |
Yes, it needs to stop somewhere. The split between ordinary locks and managed locks is very arbitrary today. My point is that the best way to fix reentrancy issues specific to native AOT is by using same type of lock between runtimes in the current state of the world.
I would be open to the idea of making the new Lock use primitive waits. It would require investigation of what SynchronizationContext overrides are used for today and whether they would get broken by it. |
Right. I was going to literally suggest "I'd go as far as use uninterruptible Wait in Lock", but decided to search for anything that could be possibly broken. (other than tests). So far I found nothing interesting. SynchronizationContext is used, but mostly for Post/Send/Complete purposes. Noone seems to be overriding Wait in any nontrivial way. I found only 2 cases both just forward to nonalertable WaitForMultipleObjectsEx, ironically - to suppress the message pumping behavior of standard wait. |
I think the root cause for reentrancy is that Lock initializes EventSource as a part of its regular operations and that runs arbitrary code and that causes reentrancies. If static initialization happens on a side - in a module initializer or in a finalizer, the problem will disappear. I do not think that CoreCLR is a good guidance on what lock to use. The reasons in CoreCLR could be different, often just because of different native/managed split, dances around COOP/PREEMPT or because of some appdomain/sql stuff that may not be applicable anymore. The static initialization is actually a good example. LowLevelLock lacks functionality for that. It also has fairly poor performance. That was never an issue as LowLevelLock was basically a "backstop" lock for cases that happen rarely or nearly always resulted in a blocking wait. Perf improvements would not "move the needle". It would be a roundabout solution to something that could be solved inside the Lock itself. Alternatively, Lock could have an internal ctor/flag to not initialize EventSource, but I think plugging into the diagnostics would be nice even for system locks. |
I do not think we need that for static initializers. The static initializers are scheduled by "dead-lock aware lock". The regular (low-level) lock is needed to protect the dead-lock aware lock state. There should not be any recursion. The low-level lock for this purpose does not need to be fine-tuned, given other overheads involved in running the static constructors. We have the low-level lock exposed by the System.Native PAL. I think it would be hard to get rid of all uses of it. As long as we have the low-level lock exposed by the System.Native PAL, it should be fine to use it in S.P.CoreLib as needed. |
Given the fact that we run into contentions, and often enough to bring up the reentrancy issue, indicates that this lock can be a hot spot and thus a potential bottleneck if lock does not handle contentions well.
It is ok to use it. It works. It is just generally there are no reasons to use it vs. regular There is an issue of "what if someone does things with SynchronizationContext.Wait", but to shield from that most locks would need to be LowLevelLock. That is probably why noone "does things with SynchronizationContext.Wait" as that would be dangerous. |
What about locks taken on UI threads? I imagine they would still need to use a message-pumping wait for correctness and that could run user code and potentially lead to reentrancy. Should locks used in the runtime (in low-level situations) also use message-pumping waits? |
Yes. I'm more in favor of reusing |
@kouvel do you have any timelines for this? It is a heavy hitter in the CI (I personally see this failing every day and CI stats say we hit it 15 times a day on average) and we're about to snap .NET 9 Preview 1. |
…sive accesses during its own class construction - `NativeRuntimeEventSource.Log` can be null if it's being constructed in the same thread earlier in the stack, added null checks Fixes dotnet#94728
The proposed change in this PR would I believe fix the CI issue. The discussion in this PR has evolved into resolving a larger issue, which I'm currently looking into, hope to have something out soon. After thinking about it more, I may continue to employ the same technique for the CI issue, as using a finalizer doesn't guarantee timely initialization and I'd prefer to avoid going down the wait path before initialization is complete to avoid other potential reentrancy issues (and to have a way to fix a reentrancy issue on that path if one is introduced), though I'm still evaluating it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
Could you please open an issue for this if there is not one already? |
Rebased, and added a commit with one more null check in the wait path used by Lock. Opened #97105 to track making some of the waits unalertable. |
Looks like all remaining CI failures are known issues. |
- Added an internal constructor that enables the lock to use non-alertable waits - Non-alertable waits are not forwarded to `SynchronizationContext` wait overrides, are non-message-pumping waits, and are not interruptible - Updated most of the uses of `Lock` in NativeAOT to use non-alertable waits - Also simplified the fix in dotnet#94873 to avoid having to do the relevant null checks in various places along the wait path, by limiting the scope of the null checks to the initialization phase Fixes dotnet#97105
…nt for recursive accesses during its own class construction (dotnet#94873) * Fix Lock's usage of NativeRuntimeEventSource.Log to account for recursive accesses during its own class construction - `NativeRuntimeEventSource.Log` can be null if it's being constructed in the same thread earlier in the stack, added null checks Fixes dotnet#94728
…7227) * Add an internal mode to `Lock` to have it use trivial (non-alertable) waits - Added an internal constructor that enables the lock to use trivial waits - Trivial waits are non-alertable waits that are not forwarded to `SynchronizationContext` wait overrides, are non-message-pumping waits, and are not interruptible - Updated most of the uses of `Lock` in NativeAOT to use trivial waits - Also simplified the fix in #94873 to avoid having to do the relevant null checks in various places along the wait path, by limiting the scope of the null checks to the initialization phase Fixes #97105
Unfortunately, the Build analysis can't match the logs once azdo cancels the build. We're tracking that issue in #97044 but it won't get automatically flagged |
NativeRuntimeEventSource.Log
can be null if it's being constructed in the same thread earlier in the stack. Added null checks.Fixes #94728