-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Post-2.1 plan of Intel hardware intrinsic #10260
Comments
cc @CarolEidt @AndyAyersMS @jkotas @tannergooding @mikedn @redknightlois Please feel free to improve the above list. |
The goal should be to have HW intrinsics be in shape to be fully supported. Seems like the overall priorities are good:
IIRC there were some not fully resolved design aspects; they should be on the list too. |
There are a few pain points in the interface to deal with HW intrinsics, mostly on the |
@AndyAyersMS Thank you so much for the suggestions, updated the list. |
Good point. But I am afraid and not sure that "implicit conversions" may confuse the front-end compiler (Roslyn) and IDE to determine which overload users actually want. |
They will have to be done with care, that is either implicit and/or overloads could be the way to go. Currently |
Although the first item captures the broad issue of API design and usability, it would be worth calling out that, beyond getting CQ data, we also want to use scenarios to drive usability. @redknightlois points out one concern, but we should also seek usability feedback more broadly. |
The PR adding support for the x86 FMA intrinsics is up: dotnet/coreclr#18105 |
One of the proposals to handle this is in https://github.com/dotnet/corefx/issues/27911#issuecomment-372684004 There are many other possible solutions which could be based on |
@fiigii @tannergooding Whis ISA I could start working on now as I have some free cycles? |
IIRC, SSE2 still has a few issues (e.g. CQ, tests, etc.), it would be better to refine SSE2 code at first. |
@fiigii Could you please also add PAUSE as intrinsics? The PAUSE instruction target is high performance / low latency intra-thread communication, lot's of energy could be saved with a small latency increase if these communicating tight loops could use a single PAUSE instruction instead of busy spinning. This is critical as we have more and more cores and multiple numa nodes within a server. Could you please also check the explicit cache control intrinsics (CLFLUSH, CLFLUSHOPT, CLWB, CLZERO(?), SFENCE) as persistent memory become mainstream. Original issue: |
As far as I can tell, calling |
@svick Unfortunately not. And almost every other intrinsics are available, why PAUSE is not? |
@svick For very high throughput low latency operations nothing beats calling PAUSE. Not on that level yet, but it is important to have it when dealing with that kind of stuff. |
Cc @kouvel for spin wait feedback. |
@zpodlovics PAUSE is quite different from other instructions that has different performance (throughput/latency) on different Intel microarchitectures. For example, the latency of the PAUSE instruction has been extended to 140 cycles that is much longer than previous. So, according to my experience, using bare PAUSE is very difficult to get optimal portable spin. Meanwhile, I believe Microsoft and Intel already and will spend a lot of effort to tune SpinWait (e.g., https://github.com/dotnet/coreclr/issues/13388). So, I think it needs more discussion, data, and applications to expose PAUSE as an intrinsic.
Yes, we have them. Please look at Sse and Sse2 class. |
@fiigii Yes, I know. There are some cases, where latency (and performance) is everything. Trading applications, especially high frequency trading have this requirements. Skylake PAUSE horrible latency is unfortunate, but on that microarchitecture there is no alternative for busy spinning for latency critical applications. Please just do not try to make "smart" decision for the developers, it will be almost always wrong. For these usecases leave these decisions to the application developers. There is no alternative solution for PAUSE intrinsics for this usecase: "To ensure the proper order of outstanding memory operations, the processor incurs a severe penalty. The penalty from memory order violations can be reduced significantly by inserting a PAUSE instruction in the loop. This eliminates multiple loop iterations in the pipeline." Source: It's not accidental that Java added this as JEP285 hints [1]: "As a practical example and use case, current x86 processors support a PAUSE instruction that can be used to indicate spinning behavior. Using a PAUSE instruction demonstrably reduces thread-to-thread round trips. Due to its benefits and widely recommended use, the x86 PAUSE instruction is commonly used in kernel spinlocks, in POSIX libraries that perform heuristic spins prior to blocking, and even by the JVM itself. However, due to the inability to hint that a Java loop is spinning, its benefits are not available to regular Java code." Please note, almost all performance critical application moved or try to move non-blocking directions, this intra-thread communication pattern is and will be everywhere soon. I am afraid these performance and latency critical application developers will avoid the .NET Core, as other platforms could provide better solutions for this usecase. [2] [3] [4] [5] [1] http://openjdk.java.net/jeps/285 |
How about exposing these ones:
The list is incomplete, but from my experience the instructions above are needed quite often if the app has to process data on bitstream level or do the bit-twiddling for any other purposes. |
In that case, wouldn't a better solution be to improve the performance of To make it concrete, when I call 00007FF7DBCB1490 sub rsp,28h
00007FF7DBCB1494 mov ecx,1
00007FF7DBCB1499 call ThreadNative::SpinWait (07FF83B754D00h)
00007FF7DBCB149E nop
00007FF7DBCB149F add rsp,28h
00007FF7DBCB14A3 ret
--- e:\a\_work\5\s\src\vm\comsynchronizable.cpp --------------------------------
if (iterations <= 0)
00007FF83B754D00 test ecx,ecx
00007FF83B754D02 jle ThreadNative::SpinWait+93h (07FF83B754D93h)
FCIMPL1(void, ThreadNative::SpinWait, int iterations)
00007FF83B754D08 mov rax,rsp
00007FF83B754D0B mov dword ptr [rax+8],ecx
00007FF83B754D0E push r12
00007FF83B754D10 push r14
00007FF83B754D12 push r15
00007FF83B754D14 sub rsp,170h
00007FF83B754D1B mov qword ptr [rsp+58h],0FFFFFFFFFFFFFFFEh
00007FF83B754D24 mov qword ptr [rax+10h],rsi
00007FF83B754D28 mov qword ptr [rax+18h],rdi
00007FF83B754D2C mov rax,qword ptr [__security_cookie (07FF83BBA0008h)]
00007FF83B754D33 xor rax,rsp
00007FF83B754D36 mov qword ptr [rsp+160h],rax
00007FF83B754D3E mov esi,ecx
00007FF83B754D40 lea r14,[ThreadNative::SpinWait (07FF83B754D00h)]
00007FF83B754D47 mov qword ptr [__me],r14
if (iterations <= 100000)
00007FF83B754D4C cmp esi,186A0h
00007FF83B754D52 jg WinMDInternalImportRO::Release+0B194Ah (07FF83B83DF7Ah)
YieldProcessorNormalized(YieldProcessorNormalizationInfo(), iterations);
00007FF83B754D58 mov edx,dword ptr [g_yieldsPerNormalizedYield (07FF83BBA027Ch)]
00007FF83B754D5E imul rdx,rsi
00007FF83B754D62 pause
00007FF83B754D64 sub rdx,1
00007FF83B754D68 jne ThreadNative::SpinWait+62h (07FF83B754D62h)
00007FF83B754D6A mov rcx,qword ptr [rsp+160h]
00007FF83B754D72 xor rcx,rsp
00007FF83B754D75 call __security_check_cookie (07FF83B780510h)
00007FF83B754D7A lea r11,[rsp+170h]
00007FF83B754D82 mov rsi,qword ptr [r11+28h]
00007FF83B754D86 mov rdi,qword ptr [r11+30h]
00007FF83B754D8A mov rsp,r11
00007FF83B754D8D pop r15
00007FF83B754D8F pop r14
00007FF83B754D91 pop r12
00007FF83B754D93 ret That is indeed a fair amount of extraneous code, but I think it could be improved:
I think that with those two changes, you would be left with very little extra code.
How is the Disruptor code you linked to better than .Net Core |
@svick Please feel free to improve the SpinWait as you want, I would be happy to have improved SpinWait, but that's a completly different problem, but I am not interested to replace PAUSE instrinsincs to anything else (and please do not waste your time and energy to try to convince me about this). What I would like to have is a single PAUSE intrinsincs nothing less, nothing more. Please read the manual description part carefully, especially the bold part. Yes, other platforms/languages have already support this, for example thanks to JEP285 hints, the Java JIT could emit PAUSE as intrinsics in Java, the "YieldProcessor" in Visual Studio C++ also means PAUSE as intrinsincs [2], rep, nop instruction sequence actually equals to PAUSE instructions (opcode: F390) [3]. From the Intel instruction manual PAUSE instruction: "Improves the performance of spin-wait loops. When executing a “spin-wait loop,” a Pentium 4 or Intel Xeon processor suffers a severe performance penalty when exiting the loop because it detects a possible memory order violation. The PAUSE instruction provides a hint to the processor that the code sequence is a spin-wait loop. The processor uses this hint to avoid the memory order violation in most situations, which greatly improves processor performance. For this reason, it is recommended that a PAUSE instruction be placed in all spin-wait loops." You can even try it out yourself, using the [3] document code examples: "These measurements were taken on a 6-core, 12-thread, Intel® Core™ i7 processor 990X equivalent system. The observed performance gains were quite impressive. Up to 4x gains were seen when using eight threads, and even at thirty-two threads, the performance numbers were approximately 3x over just using Sleep(0)." Like it or not, but .NET Core is and will compete with another platforms and languages, and leaving lot more than 3x-4x improvements on the table is a dumb idea (especially as we already have near full intrinsics for everything else). Your competitor could just choose Java and reimplement everything there ... and you are out of business really soon. Take a look at Kestrel as a good example for cross-polination: their requirements and ideas significantly improved both the C# language and the CoreFX base library and the CoreCLR runtime too. [1] http://cr.openjdk.java.net/~ikrylov/8147844.hs.03/hotspot.patch |
The perf issue mentioned by the Intel docs where issuing a pause could yield a 4x improvement would be fixed by Thread.SpinWait(1), so that magnitude of an issue shouldn't be there. In .NET Core 2.1 Thread.SpinWait tries to normalize the delay such that spin-wait loops may work more similarly across different processors and would not have to be separately tuned based on which processor it's running on. The goal was to make it a somewhat fixed amount of relatively short delay. The delay for SpinWait(1) is > a single pause instruction, especially on pre-Skylake processors. I agree it would be nice to have something like a Thread.Pause() that is treated as an intrinsic and is specified to issue a hardware-specific delay for latency-critical scenarios. These are probably less common, but it allows fine-tuning of spin-wait loops where necessary. |
I would like to see intrinsics for some instructions that are not behind a CPUID flag, specifically for
|
@pentp, it might be useful to open up a CoreCLR issue asking if |
At first look |
@pentp, right. There is some platform specific functionality that might be useful as intrinsics, but just fixing the existing APIs to do what we want might also be a good start. |
Having them as intrinsics behind |
Three suggestions:
|
@Jorenkv, for 1/2, unfortunately It additionally enforces a requirement that
For 3, do you have a real world example where having ref would help? For the most part, it is expected that you will be doing unsafe conversions, and the option is to either use actual unsafe code or to use |
|
Some unsafe is probably a given, but with unsafe static float HorizontalAdd(ReadOnlySpan<float> span)
{
Debug.Assert(span.Length == 4);
var v = Sse.LoadVector128((float*)Unsafe.AsPointer(ref MemoryMarshal.GetReference(span))); //oops need a pointer
var t = Sse.Shuffle(v, v, 0b10_11_00_01);
var s = Sse.Add(v, t);
t = Sse.MoveHighToLow(t, s);
s = Sse.AddScalar(s, t);
return Sse.ConvertToSingle(s);
} |
Right. But it would also be good to understand whether or not that represents the normal type of code that would be executed. Generally speaking, you shouldn't be It also isn't like using That isn't to say that also providing |
@tannergooding My mistake, I thought the pointer would be possible. Seems strange that a hypothetical For the |
I actually did do some benchmarking, and you're absolutely correct. I always pin in the outer loop and pass pointers around today. I was hoping I could use That said, I think there will be cases where at least for Load/Store, |
@Jorenkv, it is on the list of things to discuss :) (whenever we have the next design meeting). But, like with everything, If it is a hot path, benchmark and profile to find out what works best for your app. |
The example you have posted above will lead to intermittent crashes and data corruptions unless the memory is pinned in some other way. |
One more question, to get an aligned buffer and avoid having to deal with pinning I was thinking I would Marshal.AllocHGlobal a block of memory, skip a few bytes at the start to get to a 32-byte boundary, and work with that. Any particular reason why this would be a bad idea? (Assuming we're doing this in moderation.) |
That was meant to be an example where the I did find it interesting, though, that the JIT always marks that particular method Partially Interruptible in my testing. If there were some way to guarantee that behavior, that could also reduce the need for pinning in short-running methods. |
Does point 4 (containment for imm, 1-arg, 3-arg forms) cover VEX encoded SSE2?
|
Yes, it does. |
Slightly an aside, but it's been discussed earlier here: Given that, even with Edit: to be clear, that does work, but I'm not sure if there's a better way 🙂 |
@Porges I've been using this (with my input starting out as a byte buffer) Span<Vector128<uint>> vecBuff = MemoryMarshal.Cast<byte, Vector128<uint>>(byteSpan);
for (int i = 0; i < vecBuff.Length; i++)
{
Vector128<uint> input = vecBuff[i]; The read from the Span is compiling down to movsxd rcx,eax
shl rcx,4
vmovupd xmm3,xmmword ptr [r10+rcx] Not a whole lot of room for improvement there. Although... This shaves off two whole instructions: var uintBuff = MemoryMarshal.Cast<Vector128<uint>, uint>(vecBuff);
fixed (uint* buffPos = uintBuff)
{
uint* currPos = buffPos;
uint* end = (uint*)Unsafe.Add<uint>(currPos, uintBuff.Length);
while(currPos < end)
{
var input = Sse2.LoadVector128(currPos);
currPos = (uint*)Unsafe.Add<uint>(currPos, 4); I looking forward to finding out how both approaches are wrong 😁 my buffer is unmanaged, GC ain't moving it anywhere! Check. Mate. Edit: benchmarks say shaving off those two instructions doesn't matter (which isn't too surprising since it should be a pretty pipeline/speculative execution friendly pattern) |
@tannergooding @eerhardt |
From my understanding, there have been minimal changes in coreclr for .NET Core 2.2. You can see the changes here: dotnet/coreclr@release/2.1...release/2.2 It doesn't look like any of them are HW intrinsics related. |
Bmi1
,Bmi2
,Aes
, andPclmulqdq
)category:cq
theme:intrinsics
skill-level:intermediate
cost:extra-large
The text was updated successfully, but these errors were encountered: