Post-2.1 plan of Intel hardware intrinsic #10260

fiigii · 2018-04-27T18:10:53Z

Refine the API design of Intel hardware intrinsic
- related issues/PRs: Improve Intel hardware intrinsic APIs coreclr#17637, https://github.com/dotnet/corefx/issues/29247
Implement remaining AVX2 intrinsic [rely on (1)]
- related issues/PRs: https://github.com/dotnet/coreclr/issues/16653
Implement remaining SSE4.2 intrinsic [rely on (1)]
- related issues/PRs: https://github.com/dotnet/coreclr/issues/16270
Enable containment analysis on more hardware intrinsic forms (e.g., imm, 1-arg, 3-arg, etc.)
Implement FMA intrinsic [rely on (4)]
- FMA intrinsic codegen is different from other ISAs whose instruction selection depends on the operator's position (e.g., in registers or memory?)
Implement other ISA classes (Bmi1, Bmi2, Aes, and Pclmulqdq)
- fully support all the Intel hardware intrinsic of existing APIs
Create non-trivial benchmarks for Intel hardware intrinsic
- related issues/PRs: https://github.com/dotnet/coreclr/issues/17798
Improve the CQ of Intel hardware intrinsic base-on key scenarios [partially rely on (7)]
- especially when intrinsics are used in combination or with things like Vectors, and identify areas that need work.
- related issues/PRs: https://github.com/dotnet/coreclr/issues/17073, https://github.com/dotnet/coreclr/issues/17108
Investigate the JIT throughput impact from hardware intrinsic recognition [rely on (7)]
- related discussion: Implement "IsSupported" for all ISA classes of Intel hardware intrinsics coreclr#14020 (comment)
Identify candidates that can be optimized using HW intrinsics and implement them using intrinsics (CoreFX, mscorlib, HPC, ML, etc.)
- related issues/PRs: https://github.com/dotnet/coreclr/issues/15506, https://github.com/dotnet/corefx/issues/25386
more...

category:cq
theme:intrinsics
skill-level:intermediate
cost:extra-large

The text was updated successfully, but these errors were encountered:

fiigii · 2018-04-27T18:12:03Z

cc @CarolEidt @AndyAyersMS @jkotas @tannergooding @mikedn @redknightlois

Please feel free to improve the above list.

AndyAyersMS · 2018-04-30T22:12:58Z

The goal should be to have HW intrinsics be in shape to be fully supported.

Seems like the overall priorities are good:

finish off current implementations
get mileage on the existing APIs to evaluate usability
start looking at CQ, especially when intrinsics are used in combination or with things like Vectors, and identify areas that need work.

IIRC there were some not fully resolved design aspects; they should be on the list too.

redknightlois · 2018-04-30T23:01:17Z

There are a few pain points in the interface to deal with HW intrinsics, mostly on the StaticCast<T> front that probably could be handled through library implicit conversions and overloads.

fiigii · 2018-04-30T23:06:47Z

@AndyAyersMS Thank you so much for the suggestions, updated the list.

fiigii · 2018-04-30T23:13:51Z

mostly on the StaticCast front that probably could be handled through library implicit conversions and overloads.

Good point. But I am afraid and not sure that "implicit conversions" may confuse the front-end compiler (Roslyn) and IDE to determine which overload users actually want.
Moveover, StaticCast<T, U> is important to let HW intrinsic work with generic code (i.e., StaticCast<T, float>(Sse.Add(...)) to return a generic Vector128<T>). Does implicit conversion help to this case?

redknightlois · 2018-04-30T23:17:29Z

They will have to be done with care, that is either implicit and/or overloads could be the way to go. Currently StaticCast<T,U> clutters the code a bit too much. There must be some non-obvious compromise that could work.

CarolEidt · 2018-05-01T00:46:25Z

Although the first item captures the broad issue of API design and usability, it would be worth calling out that, beyond getting CQ data, we also want to use scenarios to drive usability. @redknightlois points out one concern, but we should also seek usability feedback more broadly.

tannergooding · 2018-05-23T22:36:21Z

The PR adding support for the x86 FMA intrinsics is up: dotnet/coreclr#18105

4creators · 2018-05-24T11:09:29Z

They will have to be done with care, that is either implicit and/or overloads could be the way to go. Currently StaticCast<T,U> clutters the code a bit too much. There must be some non-obvious compromise that could work.

One of the proposals to handle this is in https://github.com/dotnet/corefx/issues/27911#issuecomment-372684004

There are many other possible solutions which could be based on VectorNNN<T> as well.

4creators · 2018-05-24T11:10:58Z

@fiigii @tannergooding Whis ISA I could start working on now as I have some free cycles?

fiigii · 2018-05-24T11:18:32Z

IIRC, SSE2 still has a few issues (e.g. CQ, tests, etc.), it would be better to refine SSE2 code at first.

zpodlovics · 2018-05-28T18:05:59Z

@fiigii Could you please also add PAUSE as intrinsics? The PAUSE instruction target is high performance / low latency intra-thread communication, lot's of energy could be saved with a small latency increase if these communicating tight loops could use a single PAUSE instruction instead of busy spinning. This is critical as we have more and more cores and multiple numa nodes within a server.

Could you please also check the explicit cache control intrinsics (CLFLUSH, CLFLUSHOPT, CLWB, CLZERO(?), SFENCE) as persistent memory become mainstream.

Original issue:
https://github.com/dotnet/coreclr/issues/916#issuecomment-366562117

svick · 2018-05-28T19:29:09Z

@zpodlovics

Could you please also add PAUSE as intrinsics?

As far as I can tell, calling Thread.SpinWait(1) on x64 will invoke pause enough times to wait 1 ns. Is that not good enough for you?

zpodlovics · 2018-05-28T20:05:13Z

@svick Unfortunately not. Thread.SpinWait(1) still still doing lot's of unwanted calculation, branches and calls instead of a single (or a few, but still fully controlled) PAUSE instruction within in tight loop. The cpu micro-op cache also plays a critical role here. I have benchmarked / tested every wait method available, in almost every possible way. If you do non-blocking intra-thread communications every call, every branch and every instruction counts, and I am talking about > 100M msg/sec rate between threads. I would like to improve the energy consumption a bit by using a single (or a few but still fully controlled) PAUSE instruction as intrinsics within a loop, instead of doing full busy loop.

And almost every other intrinsics are available, why PAUSE is not?

redknightlois · 2018-05-28T20:14:42Z

@svick For very high throughput low latency operations nothing beats calling PAUSE. Not on that level yet, but it is important to have it when dealing with that kind of stuff.

danmoseley · 2018-05-28T20:23:43Z

Cc @kouvel for spin wait feedback.

fiigii · 2018-05-28T23:35:21Z

And almost every other intrinsics are available, why PAUSE is not?

@zpodlovics PAUSE is quite different from other instructions that has different performance (throughput/latency) on different Intel microarchitectures. For example, the latency of the PAUSE instruction has been extended to 140 cycles that is much longer than previous. So, according to my experience, using bare PAUSE is very difficult to get optimal portable spin. Meanwhile, I believe Microsoft and Intel already and will spend a lot of effort to tune SpinWait (e.g., https://github.com/dotnet/coreclr/issues/13388). So, I think it needs more discussion, data, and applications to expose PAUSE as an intrinsic.

Could you please also check the explicit cache control intrinsics (CLFLUSH, CLFLUSHOPT, CLWB, CLZERO(?), SFENCE) as persistent memory become mainstream.

Yes, we have them. Please look at Sse and Sse2 class.

zpodlovics · 2018-05-29T04:17:50Z

@fiigii Yes, I know. There are some cases, where latency (and performance) is everything. Trading applications, especially high frequency trading have this requirements. Skylake PAUSE horrible latency is unfortunate, but on that microarchitecture there is no alternative for busy spinning for latency critical applications. Please just do not try to make "smart" decision for the developers, it will be almost always wrong. For these usecases leave these decisions to the application developers.

There is no alternative solution for PAUSE intrinsics for this usecase:

"To ensure the proper order of outstanding memory operations, the processor incurs a severe penalty. The penalty from memory order violations can be reduced significantly by inserting a PAUSE instruction in the loop. This eliminates multiple loop iterations in the pipeline."

Source:
https://software.intel.com/en-us/articles/long-duration-spin-wait-loops-on-hyper-threading-technology-enabled-intel-processors
https://software.intel.com/sites/default/files/m/d/4/1/d/8/17689_w_spinlock.pdf

It's not accidental that Java added this as JEP285 hints [1]:

"As a practical example and use case, current x86 processors support a PAUSE instruction that can be used to indicate spinning behavior. Using a PAUSE instruction demonstrably reduces thread-to-thread round trips. Due to its benefits and widely recommended use, the x86 PAUSE instruction is commonly used in kernel spinlocks, in POSIX libraries that perform heuristic spins prior to blocking, and even by the JVM itself. However, due to the inability to hint that a Java loop is spinning, its benefits are not available to regular Java code."

Please note, almost all performance critical application moved or try to move non-blocking directions, this intra-thread communication pattern is and will be everywhere soon. I am afraid these performance and latency critical application developers will avoid the .NET Core, as other platforms could provide better solutions for this usecase. [2] [3] [4] [5]

[1] http://openjdk.java.net/jeps/285
[2] https://bugs.openjdk.java.net/browse/JDK-8147844
[3] http://cr.openjdk.java.net/~ikrylov/8147844.hs.03/hotspot.patch
[4] https://github.com/giltene/GilExamples/tree/master/SpinWaitTest/src/main/java/org/performancehints
[5] https://github.com/Abc-Arbitrage/Disruptor-cpp/blob/master/Disruptor/SpinWait.cpp

voinokin · 2018-05-29T07:27:04Z

How about exposing these ones:

BSR/BSF - these exist since 386's days and are better than nothing for some high-perf apps in absence of POPCNT/LZCNT/TZCNT
CMOVcc - there were talks about emitting it by JIT automatically, but as a partial solution CMOV[N]E/Z could be directly exposed in API and bound eg. to some boolean parameter - this could be better than emitting conditional jumps in some cases
MOVBE/BSWAP
ROL/ROR/SHLD/SHRD - I think JIT now guesses ROR/ROL for 32 bits and/or 64 bits (on 64-bit system), but not for the other sizes
BT/BTC/BTS
PREFETCHW - not for bit-twiddling, but otherwise quite useful

The list is incomplete, but from my experience the instructions above are needed quite often if the app has to process data on bitstream level or do the bit-twiddling for any other purposes.

svick · 2018-05-29T11:58:34Z

@zpodlovics

Thread.SpinWait(1) still still doing lot's of unwanted calculation, branches and calls instead of a single (or a few, but still fully controlled) PAUSE instruction within in tight loop.

In that case, wouldn't a better solution be to improve the performance of SpinWait()? That would be better for everyone, not just those who choose to use the intrinsic.

To make it concrete, when I call Thread.SpinWait(1) in a .Net Core 2.1 RTM application on Windows 10, this is what I see in disassembly:

00007FF7DBCB1490  sub         rsp,28h  
00007FF7DBCB1494  mov         ecx,1  
00007FF7DBCB1499  call        ThreadNative::SpinWait (07FF83B754D00h)  
00007FF7DBCB149E  nop  
00007FF7DBCB149F  add         rsp,28h  
00007FF7DBCB14A3  ret  

--- e:\a\_work\5\s\src\vm\comsynchronizable.cpp --------------------------------
    if (iterations <= 0)
00007FF83B754D00  test        ecx,ecx  
00007FF83B754D02  jle         ThreadNative::SpinWait+93h (07FF83B754D93h)  
FCIMPL1(void, ThreadNative::SpinWait, int iterations)
00007FF83B754D08  mov         rax,rsp  
00007FF83B754D0B  mov         dword ptr [rax+8],ecx  
00007FF83B754D0E  push        r12  
00007FF83B754D10  push        r14  
00007FF83B754D12  push        r15  
00007FF83B754D14  sub         rsp,170h  
00007FF83B754D1B  mov         qword ptr [rsp+58h],0FFFFFFFFFFFFFFFEh  
00007FF83B754D24  mov         qword ptr [rax+10h],rsi  
00007FF83B754D28  mov         qword ptr [rax+18h],rdi  
00007FF83B754D2C  mov         rax,qword ptr [__security_cookie (07FF83BBA0008h)]  
00007FF83B754D33  xor         rax,rsp  
00007FF83B754D36  mov         qword ptr [rsp+160h],rax  
00007FF83B754D3E  mov         esi,ecx  
00007FF83B754D40  lea         r14,[ThreadNative::SpinWait (07FF83B754D00h)]  
00007FF83B754D47  mov         qword ptr [__me],r14  
    if (iterations <= 100000)
00007FF83B754D4C  cmp         esi,186A0h  
00007FF83B754D52  jg          WinMDInternalImportRO::Release+0B194Ah (07FF83B83DF7Ah)  
        YieldProcessorNormalized(YieldProcessorNormalizationInfo(), iterations);
00007FF83B754D58  mov         edx,dword ptr [g_yieldsPerNormalizedYield (07FF83BBA027Ch)]  
00007FF83B754D5E  imul        rdx,rsi  
00007FF83B754D62  pause  
00007FF83B754D64  sub         rdx,1  
00007FF83B754D68  jne         ThreadNative::SpinWait+62h (07FF83B754D62h)  

00007FF83B754D6A  mov         rcx,qword ptr [rsp+160h]  
00007FF83B754D72  xor         rcx,rsp  
00007FF83B754D75  call        __security_check_cookie (07FF83B780510h)  
00007FF83B754D7A  lea         r11,[rsp+170h]  
00007FF83B754D82  mov         rsi,qword ptr [r11+28h]  
00007FF83B754D86  mov         rdi,qword ptr [r11+30h]  
00007FF83B754D8A  mov         rsp,r11  
00007FF83B754D8D  pop         r15  
00007FF83B754D8F  pop         r14  
00007FF83B754D91  pop         r12  
00007FF83B754D93  ret

That is indeed a fair amount of extraneous code, but I think it could be improved:

A lot of the code seems to be due to a buffer security check. Maybe that could be removed?
There are two unnecessary conditional jumps that test the number of iterations. I think that if those checks were moved to managed code, JIT should be able to eliminate them.

I think that with those two changes, you would be left with very little extra code.

I am afraid these performance and latency critical application developers will avoid the .NET Core, as other platforms could provide better solutions for this usecase.

How is the Disruptor code you linked to better than .Net Core Thread.SpinWait(1)? As far as I can tell, it doesn't even give you convenient API for waiting that does not sleep.

zpodlovics · 2018-05-29T12:17:55Z

@svick Please feel free to improve the SpinWait as you want, I would be happy to have improved SpinWait, but that's a completly different problem, but I am not interested to replace PAUSE instrinsincs to anything else (and please do not waste your time and energy to try to convince me about this).

What I would like to have is a single PAUSE intrinsincs nothing less, nothing more. Please read the manual description part carefully, especially the bold part.

Yes, other platforms/languages have already support this, for example thanks to JEP285 hints, the Java JIT could emit PAUSE as intrinsics in Java, the "YieldProcessor" in Visual Studio C++ also means PAUSE as intrinsincs [2], rep, nop instruction sequence actually equals to PAUSE instructions (opcode: F390) [3].

From the Intel instruction manual PAUSE instruction:

"Improves the performance of spin-wait loops. When executing a “spin-wait loop,” a Pentium 4 or Intel Xeon processor suffers a severe performance penalty when exiting the loop because it detects a possible memory order violation. The PAUSE instruction provides a hint to the processor that the code sequence is a spin-wait loop. The processor uses this hint to avoid the memory order violation in most situations, which greatly improves processor performance. For this reason, it is recommended that a PAUSE instruction be placed in all spin-wait loops."

You can even try it out yourself, using the [3] document code examples:

"These measurements were taken on a 6-core, 12-thread, Intel® Core™ i7 processor 990X equivalent system. The observed performance gains were quite impressive. Up to 4x gains were seen when using eight threads, and even at thirty-two threads, the performance numbers were approximately 3x over just using Sleep(0)."

Like it or not, but .NET Core is and will compete with another platforms and languages, and leaving lot more than 3x-4x improvements on the table is a dumb idea (especially as we already have near full intrinsics for everything else). Your competitor could just choose Java and reimplement everything there ... and you are out of business really soon. Take a look at Kestrel as a good example for cross-polination: their requirements and ideas significantly improved both the C# language and the CoreFX base library and the CoreCLR runtime too.

[1] http://cr.openjdk.java.net/~ikrylov/8147844.hs.03/hotspot.patch
[2] https://msdn.microsoft.com/en-us/library/windows/desktop/ms687419(v=vs.85).aspx
[3] https://software.intel.com/en-us/articles/benefitting-power-and-performance-sleep-loops

kouvel · 2018-05-29T17:45:21Z

The perf issue mentioned by the Intel docs where issuing a pause could yield a 4x improvement would be fixed by Thread.SpinWait(1), so that magnitude of an issue shouldn't be there.

In .NET Core 2.1 Thread.SpinWait tries to normalize the delay such that spin-wait loops may work more similarly across different processors and would not have to be separately tuned based on which processor it's running on. The goal was to make it a somewhat fixed amount of relatively short delay. The delay for SpinWait(1) is > a single pause instruction, especially on pre-Skylake processors.

I agree it would be nice to have something like a Thread.Pause() that is treated as an intrinsic and is specified to issue a hardware-specific delay for latency-critical scenarios. These are probably less common, but it allows fine-tuning of spin-wait loops where necessary.

pentp · 2018-06-07T10:01:52Z

I would like to see intrinsics for some instructions that are not behind a CPUID flag, specifically for Decimal (and BigInteger):

DIV [RDX:RAX], r/m64 and DIV [EDX:EAX], r/m32, with quotient and remainder both usable.
MUL [RAX], r/m64, MUL [EAX], r/m32, with the high-order bits (RDX/EDX) of the product usable.
Bmi2.MultiplyNoFlags (MULX) should have an overload with only the high half returned (this saves one register) - ulong MultiplyNoFlagsHigh(ulong left, ulong right)

tannergooding · 2018-06-07T14:48:03Z

@pentp, it might be useful to open up a CoreCLR issue asking if Math.DivRem and Math.BigMul can be updated to be treated as intrinsics -- As an extension of that, an API proposal (CoreFX) suggesting that unsigned versions of the same be exposed might also be useful.

pentp · 2018-06-07T16:40:16Z

At first look Math.DivRem and Math.BigMul might look promising, but it's not just lack of unsigned or intrinsics - BigMul would need 64-bit overloads also, e.g. ulong BigMul(ulong a, ulong b, out ulong high).
Exposing 128-bit division in a general purpose function is even more problematic because AFAIK only x64 has an instruction for that and it has a hard constraint - the quotient must fit in 64-bits or an exception is thrown. It's extremely useful in some calculations, but not in general.
Adding a few intrinsics in X86.DivMul (or something like that) would be a lot easier to do.

tannergooding · 2018-06-07T16:45:00Z

@pentp, right. There is some platform specific functionality that might be useful as intrinsics, but just fixing the existing APIs to do what we want might also be a good start.

pentp · 2018-06-07T16:53:22Z

Having them as intrinsics behind IsSupported has the additional benefit of allowing a fallback with a specialized algorithm.
And if DIV/IDIV/MUL/IMUL had intrinsics then Math.DivRem and Math.BigMul could be easily implemented on top of them.

Jorenkv · 2018-06-09T20:15:59Z

Three suggestions:

Change the generic constraint on the System.Runtime.Intrinsics.VectorXXX<T> declarations to where T : unmanaged so we can declare pointers to these types.
(Depends on # 1) Let the intrinsics for instructions that accept memory locations use VectorXXX<T>* in their parameter types when fitting. E.g. Sse.LoadVector128(Vector128<float>* address), and Sse.LoadHigh(Vector128<float> lower, Vector64<float>* address)
Let intrinsics accept ref arguments in addition to pointer arguments. E.g. Sse.LoadVector128(ref Vector128<float> address). This would also allow the use of the APIs without having to use unsafe code.

tannergooding · 2018-06-09T21:05:48Z

@Jorenkv, for 1/2, unfortunately Vector128<T> where T : unmanaged only lets you take the address of T. It does not let you take the address of Vector128<T>: https://sharplab.io/#v2:EYLgtghgzgLgpgJwDQxASwDZICYgNQA+AAgAwAERAjANwCwAUA0QMxkCuAdlBAGZxmwEbAMYwyANTiiA9gkoAmABwAeACoA+BgEgA7gAtE/VWRDsOkDhADmcbAwDe2lhKkxZCxQApjANwgY2OABKBjIwskd6cLItVQAqMgAHcX9AsgBeMgAyPwC4OijwgF9Q8NKw5yoANgoAFjJVOFhPSRk5JTV1BOTXWRDCsMjokvoRhiA=

It additionally enforces a requirement that Vector128<T> can only be used by:

C# 7.3 and above
Other compilers that recognize the modreq attribute (pretty much none)

For 3, do you have a real world example where having ref would help? For the most part, it is expected that you will be doing unsafe conversions, and the option is to either use actual unsafe code or to use System.Runtime.Compilerservices.Unsafe methods to convert between types and to do pointer/ref arithmetic (so you will generally be using some kind of unsafe no matter the API shape).

pentp · 2018-06-09T21:16:42Z

ref/out parameters are needed to avoid pinning - if for example Math.DivRem were to be implemented by calling an intrinsic, then with only pointer overloads it would require either pinning the out int result parameter or declaring a local and copying the result out from there, neither of which add any value and are totally avoidable with ref/out overloads.

saucecontrol · 2018-06-09T21:17:42Z

Some unsafe is probably a given, but with Span<T> it's becoming more possible to avoid pinning and raw pointers. I posted this example in the coreclr gitter a couple of days ago

unsafe static float HorizontalAdd(ReadOnlySpan<float> span)
{
    Debug.Assert(span.Length == 4);

    var v = Sse.LoadVector128((float*)Unsafe.AsPointer(ref MemoryMarshal.GetReference(span))); //oops need a pointer
    var t = Sse.Shuffle(v, v, 0b10_11_00_01);
    var s = Sse.Add(v, t);
    t = Sse.MoveHighToLow(t, s);
    s = Sse.AddScalar(s, t);

    return Sse.ConvertToSingle(s);
}

tannergooding · 2018-06-09T22:07:58Z

Right. But it would also be good to understand whether or not that represents the normal type of code that would be executed.

Generally speaking, you shouldn't be pinning variables in each call anyways. You should pin the outer array once, perform some vectorized algorithm over the entire array (or some subset of the array), and then unpin. In this case, any overhead from pinning should basically just go away.

It also isn't like using ref is "just free". If it is memory tracked by the GC, then there is still some expense if the GC decides to compact/relocate/etc. You also still have to worry about inserting bounds checks and the like.

That isn't to say that also providing ref overloads shouldn't/couldn't happen, but it would probably be useful to benchmark just how much pinning is actually costing you (as it may actually be saving you some instead).

Jorenkv · 2018-06-09T22:14:36Z

@tannergooding My mistake, I thought the pointer would be possible. Seems strange that a hypothetical Vector128Single struct qualifies as an unmanaged type but Vector128<float> does not, but I guess that's the world we live in.

For the ref thing, I don't have a specific scenario in mind, I just felt it would be nice if the API was usable without much (or any) unsafe code so I thought I'd throw it out there and learn why it's not a great idea.

saucecontrol · 2018-06-09T22:17:18Z

it would probably be useful to benchmark just how much pinning is actually costing you (as it may actually be saving you some instead)

I actually did do some benchmarking, and you're absolutely correct. I always pin in the outer loop and pass pointers around today. I was hoping I could use Span<T> to tidy things up and eliminate some of the pinning I do, but even doing questionable pin-less Unsafe.AsPointer(), perf is still worse than what I get with pinning and pointers.

That said, I think there will be cases where at least for Load/Store, ref arguments would be useful. That would be lower on my wish list than having a complete pointer-based set of APIs, though.

tannergooding · 2018-06-09T22:19:27Z

@Jorenkv, it is on the list of things to discuss :) (whenever we have the next design meeting).

But, like with everything, unsafe doesn't necessarily mean bad and Span or arrays doesn't necessarily mean good.

If it is a hot path, benchmark and profile to find out what works best for your app.

jkotas · 2018-06-09T22:24:01Z

even doing questionable pin-less Unsafe.AsPointer()

The example you have posted above will lead to intermittent crashes and data corruptions unless the memory is pinned in some other way.

Jorenkv · 2018-06-09T22:34:46Z

One more question, to get an aligned buffer and avoid having to deal with pinning I was thinking I would Marshal.AllocHGlobal a block of memory, skip a few bytes at the start to get to a 32-byte boundary, and work with that. Any particular reason why this would be a bad idea? (Assuming we're doing this in moderation.)

saucecontrol · 2018-06-09T23:27:30Z

The example you have posted above will lead to intermittent crashes and data corruptions unless the memory is pinned in some other way.

That was meant to be an example where the ref argument would have been useful. I thought the 'oops' comment next to the Unsafe.AsPointer() call made it clear it wasn't good code, but yes, I agree nobody should use that.

I did find it interesting, though, that the JIT always marks that particular method Partially Interruptible in my testing. If there were some way to guarantee that behavior, that could also reduce the need for pinning in short-running methods.

pentp · 2018-06-22T17:18:17Z

Does point 4 (containment for imm, 1-arg, 3-arg forms) cover VEX encoded SSE2?

vmovaps  xmm0, xmm1
vpsrldq  xmm0, xmm0, 8

fiigii · 2018-06-22T18:10:50Z

cover VEX encoded SSE2?

Yes, it does.

Porges · 2018-07-13T09:10:38Z

Slightly an aside, but it's been discussed earlier here: Given that, even with unmanaged, Vector128 is not going to be pinnable, what's the best way to avoid bounds checks on arrays of Vector128? Array of raw bytes and Sse2.LoadVector128 from that?

Edit: to be clear, that does work, but I'm not sure if there's a better way 🙂

Zhentar · 2018-07-17T04:08:24Z

@Porges I've been using this (with my input starting out as a byte buffer)

				Span<Vector128<uint>> vecBuff = MemoryMarshal.Cast<byte, Vector128<uint>>(byteSpan);
				for (int i = 0; i < vecBuff.Length; i++)
				{
					Vector128<uint> input = vecBuff[i];

The read from the Span is compiling down to

	movsxd  rcx,eax
	shl     rcx,4
	vmovupd xmm3,xmmword ptr [r10+rcx]

Not a whole lot of room for improvement there.

Although... This shaves off two whole instructions:

				var uintBuff = MemoryMarshal.Cast<Vector128<uint>, uint>(vecBuff);
				fixed (uint* buffPos = uintBuff)
				{
					uint* currPos = buffPos;
					uint* end = (uint*)Unsafe.Add<uint>(currPos, uintBuff.Length);

					while(currPos < end)
					{
						var input = Sse2.LoadVector128(currPos);
						currPos = (uint*)Unsafe.Add<uint>(currPos, 4);

I looking forward to finding out how both approaches are wrong 😁

_{my buffer is unmanaged, GC ain't moving it anywhere! Check. Mate.}

Edit: benchmarks say shaving off those two instructions doesn't matter (which isn't too surprising since it should be a pretty pipeline/speculative execution friendly pattern)

voinokin · 2018-09-14T09:48:38Z

@tannergooding @eerhardt
Since .NET Core 2.2-pre was recently made public, I wonder whether there were any changes included with it, the ones related to HW intrinsics handling by JIT ? (would like to test some :-) )

eerhardt · 2018-09-14T14:47:25Z

From my understanding, there have been minimal changes in coreclr for .NET Core 2.2. You can see the changes here:

dotnet/coreclr@release/2.1...release/2.2

It doesn't look like any of them are HW intrinsics related.

fiigii closed this as completed Jan 11, 2019

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the 3.0 milestone Jan 31, 2020

ghost locked as resolved and limited conversation to collaborators Dec 17, 2020

Post-2.1 plan of Intel hardware intrinsic #10260

Post-2.1 plan of Intel hardware intrinsic #10260

Comments

fiigii commented Apr 27, 2018

fiigii commented Apr 27, 2018 • edited Loading

AndyAyersMS commented Apr 30, 2018

redknightlois commented Apr 30, 2018

fiigii commented Apr 30, 2018

fiigii commented Apr 30, 2018

redknightlois commented Apr 30, 2018

CarolEidt commented May 1, 2018

tannergooding commented May 23, 2018

4creators commented May 24, 2018

4creators commented May 24, 2018

fiigii commented May 24, 2018

zpodlovics commented May 28, 2018 • edited Loading

svick commented May 28, 2018 • edited Loading

zpodlovics commented May 28, 2018

redknightlois commented May 28, 2018

danmoseley commented May 28, 2018

fiigii commented May 28, 2018

zpodlovics commented May 29, 2018

voinokin commented May 29, 2018 • edited Loading

svick commented May 29, 2018

zpodlovics commented May 29, 2018 • edited Loading

kouvel commented May 29, 2018

pentp commented Jun 7, 2018

tannergooding commented Jun 7, 2018

pentp commented Jun 7, 2018

tannergooding commented Jun 7, 2018

pentp commented Jun 7, 2018

Jorenkv commented Jun 9, 2018 • edited Loading

tannergooding commented Jun 9, 2018

pentp commented Jun 9, 2018

saucecontrol commented Jun 9, 2018

tannergooding commented Jun 9, 2018

Jorenkv commented Jun 9, 2018 • edited Loading

saucecontrol commented Jun 9, 2018 • edited Loading

tannergooding commented Jun 9, 2018

jkotas commented Jun 9, 2018

Jorenkv commented Jun 9, 2018

saucecontrol commented Jun 9, 2018

pentp commented Jun 22, 2018 • edited Loading

fiigii commented Jun 22, 2018

Porges commented Jul 13, 2018 • edited Loading

Zhentar commented Jul 17, 2018 • edited Loading

voinokin commented Sep 14, 2018

eerhardt commented Sep 14, 2018

fiigii commented Apr 27, 2018 •

edited

Loading

zpodlovics commented May 28, 2018 •

edited

Loading

svick commented May 28, 2018 •

edited

Loading

voinokin commented May 29, 2018 •

edited

Loading

zpodlovics commented May 29, 2018 •

edited

Loading

Jorenkv commented Jun 9, 2018 •

edited

Loading

Jorenkv commented Jun 9, 2018 •

edited

Loading

saucecontrol commented Jun 9, 2018 •

edited

Loading

pentp commented Jun 22, 2018 •

edited

Loading

Porges commented Jul 13, 2018 •

edited

Loading

Zhentar commented Jul 17, 2018 •

edited

Loading