-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLRJit Access Violation on when attempting to run Windows ARM64 app #69222
Comments
Tagging subscribers to this area: @JulieLeeMSFT Issue DetailsDescriptionAlso tracked: dotnet/performance#2421
Looking deeper into this error and testing our built CoreRun.exe and the hello world console app locally, we were able to get the following error and stack trace:
Run with this error can be found here: https://dev.azure.com/dnceng/internal/_build/results?buildId=1765074&view=results, on either of the windows runs. Reproduction StepsUsing the latest dotnet SDK on a Windows ARM64 machine:
Expected behaviorApp should print hello world. Actual behaviorFailure occurs, printing:
Regression?This last worked with runtime commit 3e5517b and first failed with f8fa9f6, this was made up of these PR's from what I could tell: #67771, #69013, and #68748. Known WorkaroundsNo response ConfigurationThis occurs on Windows ARM64 machines when running with the latest dotnet version. This does not seem to be an issue on non-Windows ARM64 configurations. Other informationNo response
|
This is likely a regression from ff23735 |
cc @kunalspathak, looks like AV during LSRA |
Will take a look. |
Few questions:
|
I am actually able to repro this. I will investigate and submit a fix. |
I am not able to repro this on my local checked/debug build. Wondering if this is release only failure. Building release version now. |
yep, its checked vs. release difference. Will investigate further. |
It seems that something is off with new C++ in In Release:
In Debug:
We use the return value to access an element from an array which is of size |
TLDRBecause of C++ optimization, the code in unsigned lo32 = ulo32(value); // == 0 for input 0x100000000
unsigned hi32 = uhi32(value); // == 1 for input 0x100000000
if (lo32 != 0) {
return genLog2(lo32);
} else {
return genLog2(hi32) + 32; // returns 32
} into lo32 = ... ; // == 0 for input 0x100000000
hi32 = ... ; // == 1 for input 0x100000000
if (hi32 == 0) {
return genLog32(hi32) + 32;
} else {
return genLog32(lo32); // returns 0xFF
} DetailsConsider the runtime/src/coreclr/jit/compiler.hpp Lines 521 to 534 in 88fb966
Pasting here for quick reference: genLog2() and surrounding code// Given an unsigned 64-bit value, returns the lower 32-bits in unsigned format
//
inline unsigned ulo32(unsigned __int64 value)
{
return static_cast<unsigned>(value);
}
// Given an unsigned 64-bit value, returns the upper 32-bits in unsigned format
//
inline unsigned uhi32(unsigned __int64 value)
{
return static_cast<unsigned>(value >> 32);
}
/*****************************************************************************
*
* Given a value that has exactly one bit set, return the position of that
* bit, in other words return the logarithm in base 2 of the given value.
*/
inline unsigned genLog2(unsigned __int64 value)
{
unsigned lo32 = ulo32(value);
unsigned hi32 = uhi32(value);
if (lo32 != 0)
{
assert(hi32 == 0);
return genLog2(lo32);
}
else
{
return genLog2(hi32) + 32;
}
} Here is the assembly code of the relevant portion: ...
; This code loads the magic number for % operation. See https://godbolt.org/z/o7hEq4feK
00007ffe`8998da70 529229a2 mov w2,#0x914D
00007ffe`8998da74 72b759e2 movk w2,#0xBACF,lsl #0x10
00007ffe`8998da78 b9407b26 ldr w6,[x25,#0x78]
00007ffe`8998da7c b9405325 ldr w5,[x25,#0x50]
00007ffe`8998da80 f9402724 ldr x4,[x25,#0x48]
00007ffe`8998da84 f94017fb ldr x27,[sp,#0x28]
; coversCandidates &= ~coversCandidateBit;
00007ffe`8998da88 cb0f03e8 neg x8,x15
00007ffe`8998da8c ea0f010d ands x13,x8,x15
00007ffe`8998da90 aa2d03e9 mvn x9,x13 ; x13 = 0x100000000
00007ffe`8998da94 8a0f012f and x15,x9,x15
; ------ START OF genLog2(). x13= 0x100000000 (value of coversCandidateBit)
; It directly does `value >> 32` operation which is part of uhi32() method
; and uses its result to check if it is 0 or not.
00007ffe`8998da98 d360fdae lsr x14,x13,#0x20 ; x14 becomes 1
00007ffe`8998da9c 54000140 beq clrjit!LinearScan::allocateReg+0x88c (00007ffe`8998dac4)
00007ffe`8998daa0 9ba27da8 umull x8,w13,w2
00007ffe`8998daa4 d360fd09 lsr x9,x8,#0x20
00007ffe`8998daa8 4b0901aa sub w10,w13,w9
00007ffe`8998daac 0b4a0528 add w8,w9,w10,lsr #1
00007ffe`8998dab0 53057d09 lsr w9,w8,#5
00007ffe`8998dab4 528004a8 mov w8,#0x25
00007ffe`8998dab8 1b08b52c msub w12,w9,w8,w13 ; <--- w12= 0
; Below instruction loads the element hashTable[hash], where hash's value is in w12==0
; x21 is the address of hashTable (see contents below)
; Here we load hashTable[0] which is 0xff.
00007ffe`8998dabc 38ec4aa8 ldrsb w8,[x21,w12 uxtw #0] ; x21= 00007ffe89a1e5a8
00007ffe`8998dac0 1400000a b clrjit!LinearScan::allocateReg+0x8b0 (00007ffe`8998dae8)
00007ffe`8998dac4 9ba27dc8 umull x8,w14,w2 ; <--- else part starts
00007ffe`8998dac8 d360fd09 lsr x9,x8,#0x20
00007ffe`8998dacc 4b0901ca sub w10,w14,w9
00007ffe`8998dad0 0b4a0528 add w8,w9,w10,lsr #1
00007ffe`8998dad4 53057d09 lsr w9,w8,#5
00007ffe`8998dad8 528004a8 mov w8,#0x25
00007ffe`8998dadc 1b08b92c msub w12,w9,w8,w14
00007ffe`8998dae0 38ec4aa8 ldrsb w8,[x21,w12 uxtw #0]
00007ffe`8998dae4 11008108 add w8,w8,#0x20 ; + 32 present in expr `genLog2(hi32) + 32`.
; ------ END OF genLog2()
00007ffe`8998dae8 d3407d0c uxtw x12,w8
... Contents of hashTable runtime/src/coreclr/inc/bitposition.h Lines 34 to 40 in 88fb966
Next, we try to access |
Nice investigation, sounds like a tricky one. Can we switch to an intrinsic in |
Thanks for the suggestion, I will try it out. |
Yes, it has been a long outstanding problem that we are not running tests against bits that we are shipping. If we are not running tests against Windows Release arm64 in the twice a day rolling builds that are expected to have full coverage, it is clearly a problem that we need to fix. cc @jakobbotsch For PR CI coverage, we are running a subset of architecture/OS/flavor combinations by design. I do not see a problem with Windows arm64 Release missing there. |
We don't - coreclr seems to be testing only checked in the rolling (just the same as PR) and libraries tests which would also have caught this run
runtime tests are never run on release then, and release coreclr is never tested then on: linux/musl arm64, linux/musl arm, win-arm64, osx-arm64 note: This is only talking about the main runtime pipeline - I didn't check any of the outer/stress pipelines. |
I'll look into expanding the matrix for rolling runs on Monday. |
I looked into this a bit more and I can see that we do have the full coverage in Process terminated. Infinite recursion during resource lookup within System.Private.CoreLib. This may be a bug in System.Private.CoreLib, or potentially in certain extensibility points such as assembly resolve events or CultureInfo names. Resource name: Arg_AccessViolationException
at System.Environment.FailFast(System.String)
at System.SR.InternalGetResourceString(System.String)
at System.SR.GetResourceString(System.String)
at System.AccessViolationException..ctor()
at System.SpanHelpers.IndexOf(Char ByRef, Char, Int32)
at System.String.Ctor(Char*)
at System.Globalization.CultureData.GetLocaleInfoEx(System.String, UInt32)
at System.Globalization.CultureInfo.GetUserDefaultLocaleName()
at System.Globalization.CultureInfo..cctor()
at System.Globalization.CultureInfo.get_CachedCulturesByName()
at System.Globalization.CultureInfo.GetCultureInfo(System.String)
at System.Resources.ManifestBasedResourceGroveler.GetNeutralResourcesLanguage(System.Reflection.Assembly, System.Resources.UltimateResourceFallbackLocation ByRef)
at System.Resources.ResourceManager.CommonAssemblyInit()
at System.SR.get_ResourceManager()
at System.SR.InternalGetResourceString(System.String)
at System.SR.GetResourceString(System.String)
at System.AccessViolationException..ctor()
at System.SpanHelpers.IndexOf(Char ByRef, Char, Int32)
at System.String.Ctor(Char*)
at System.AppContext.Setup(Char**, Char**, Int32) until the latest run that incorporates the fix from #69333. This issue was actually pointed out in #67771 before it was merged, but it looks like it might have been misdiagnosed as another issue. win-x64 is also failing in runtime-extra-platforms right now which looks to be #67728. The pipeline unfortunately has a lot of failures so keeping tabs on what is known and what isn't is challenging, but we should try to be more aware of this in the future. |
The problem is extra-platforms is not followed as closely - but it does need to be as you say. Is there any coreclr testing on release at all? |
Yes, we do run a number of legs with release coreclr build. |
I meant the ones for src/test. Or is that not relevant enough to catch further issues on release? |
For PR CI coverage, we do run libraries tests on release coreclr release that ensure that it is not completely broken. So testing on checked coreclr should be sufficient there. For the rolling builds, I agree we should be running the runtime tests on release coreclr and libraries. It sounds like something that needs to be fixed. |
We have started seeing this bug again. It looks like it happened between the commits here, dd919f1...b12e288. |
Strangely we have runs from yesterday that are green: |
@DrewScoggins - I downloaded the artifacts from the failing job and it seems to work fine for me. Is the machine in bad state or something? |
It looks like the most recent run was successful. I will keep watching this today over the next handful of runs to see if we keep seeing this and close this if it is fixed. |
Do you see any more failures or should we go ahead and close it? |
We should go ahead and close this, the errors haven't been showing up and things have been running smoothly. I will go ahead and close it. |
The fix needs to be ported to 6.0 |
Seems reasonable to port it, but I would also hope the VC++ team would prioritize fixing this so that newer compilers will not hit this problem in all old source code we compile. @kunalspathak did you hear back from them? |
They have already fixed the bug and it is set to "Pending Release". https://developercommunity.visualstudio.com/t/Bad-codegen-for-Arm64-when-upgraded-to-1/10054888 I will start the process to backport it because I am not sure when the release will be out and we will update VC++ at that time. |
Note, in the issue, they did suggest a workaround of disabling the newer optimization by using |
Description
Also tracked: dotnet/performance#2421
In performance runs for Windows ARM64 we are failing due to run benchmarks due to this error:
Looking deeper into this error and testing our built CoreRun.exe and the hello world console app locally, we were able to get the following error and stack trace:
Run with this error can be found here: https://dev.azure.com/dnceng/internal/_build/results?buildId=1765074&view=results, on either of the windows runs.
Reproduction Steps
Using the latest dotnet SDK on a Windows ARM64 machine:
Expected behavior
App should print hello world.
Actual behavior
Failure occurs, printing:
Regression?
This last worked with runtime commit 3e5517b and first failed with f8fa9f6, this was made up of these PR's from what I could tell: #67771, #69013, and #68748.
Known Workarounds
No response
Configuration
This occurs on Windows ARM64 machines when running with the latest dotnet version. This does not seem to be an issue on non-Windows ARM64 configurations.
Other information
No response
The text was updated successfully, but these errors were encountered: