-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Implement the SSE hardware intrinsics. #15538
Conversation
src/jit/emitxarch.cpp
Outdated
@@ -107,7 +108,7 @@ bool emitter::IsDstSrcSrcAVXInstruction(instruction ins) | |||
{ | |||
return IsAVXInstruction(ins) && | |||
(ins == INS_movlpd || ins == INS_movlps || ins == INS_movhpd || ins == INS_movhps || ins == INS_movss || | |||
ins == INS_movlhps || ins == INS_sqrtss || ins == INS_sqrtsd || ins == INS_cvtss2sd || ins == INS_cvtsd2ss); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
INS_movlhps
was incorrectly listed as DstSrcSrc
, which will cause it to generate a different result on AVX vs non-AVX machines.
src/jit/emitxarch.cpp
Outdated
|
||
void emitter::emitIns_SIMD_R_R_R_I(instruction ins, regNumber reg, regNumber reg1, regNumber reg2, int ival, var_types simdtype) | ||
{ | ||
if (UseVEXEncoding() && reg1 != reg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For both this and emitIns_SIMD_R_R_R
above: Do we have a work item tracking support for reg2
being a memory operand when it is properly aligned?
Currently we are generating things like
C4E1791000 vmovupd xmm0, xmmword ptr [rax]
C4E14858F0 vaddps xmm6, xmm0
While in some cases we could generate vaddps xmm6, [rax]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reg2 being a memory operand when it is properly aligned?
And in what circumstances will the memory operand be aligned? Only for local variables I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. It definitely can't be used everywhere, and isn't something I think needs to be immediate.
But, for the cases where the memory will always be aligned (locals, is one example), it would be useful to generate the more efficient code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only for local variables I guess
It is not true. In general devs writing vectorized code would peel off start of loops to align data and do all the remaining calculations on aligned data until the end of the loops where processing may require change of logic.
This means that all data accessed in the main
loops will be aligned and doing anything but vaddps xmm6, [rax]
would be an unacceptable loss of the CPU cycles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general devs writing vectorized code would peel off start of loops to align data
The key here is "in general". The JIT has no way to know that the address is aligned or not and it cannot make such assumptions simply because "in general" developers do that.
I suspect that it will work if the developer uses LoadAligned
to indicate that the memory address is aligned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that, "in general", users writing performant code will try to keep accesses aligned to avoid reads/writes that cross cache line or page boundaries.
However, as @mikedn points out, assuming this is incorrect and detecting it is likely hard.
It's probably worth a separate discussion (will log a bug in a bit) but I believe it is something we should try to enable (maybe a special intrinsic, compiler hint, or attribute to tell the jit memory accesses will be aligned for a given method).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably worth a separate discussion (will log a bug in a bit)
It is the right approach I think as this not that easy to get it right
src/jit/hwintrinsiccodegenxarch.cpp
Outdated
@@ -103,6 +103,133 @@ void CodeGen::genSSEIntrinsic(GenTreeHWIntrinsic* node) | |||
op2Reg = op2->gtRegNum; | |||
emit->emitIns_SIMD_R_R_R(INS_addps, targetReg, op1Reg, op2Reg, TYP_SIMD16); | |||
break; | |||
|
|||
case NI_SSE_And: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the cases (so far) are essentially the same with only the instruction (or immediate value) differing.
It might be worthwhile including this info in the hwintrinsiclist
and having them be entries in an array for constant time lookup...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, unlike "classic" instructions SSE/AVX instructions have a more consistent format so it should possible to drive most codegen from tables. Basically you need to add to the list a "form" value that can be used to select between R_R_R
and R_R_R_I
and of course, the instruction itself.
Well, there's also the problem of memory operands but that's something that the IR node will indicate. So R_R_R
in the hw intrinsic list really stands for R_R_RM
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was actually just thinking of having a single array, something where the entries are {instruction, immediate}. Helper functions (like get_IsSupported
) would have INS_INVALID
and instructions without an immediate would probably have -1
.
There can then be two general groupings. One for instructions without an immediate (those that use SIMD_R_R_R
) and one for instructions with an immediate (those that use SIMD_R_R_R_I
).
Other instructions (such as Load
) which have different semantics can be in their own case grouping.
I don't think we necessarily need a form
value, unless you think that would be more efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not want to tabulate the codegen of hw intrinsic, which won't always be such simple.
For example, intrinsicID
+ baseType
is not enough to indicate codegen for some intrinsic, I would add a new field into GenTreeHWIntrinsic
to represent overloads that would be used here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we necessarily need a form value, unless you think that would be more efficient.
The form
thing was just a suggestion. Any solution that cuts down the amount of repeated code should be good enough, provided that it doesn't cause other problem.
I would add a new field into GenTreeHWIntrinsic to represent overloads that would be used here.
Hmm? It's not clear why that would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should do it for all instructions, but a number of them (especially the binary arithmetic and comparison operators) will all be fairly common and follow roughly the same code path.
Cutting down duplicate code will likely be worthwhile for these operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm? It's not clear why that would be better.
@mikedn For example, we need an additional field to distinghuish these two intrinsics (intrinsicID
+ baseType
is not enough).
Vector128<float> GatherVector128(float* baseAddress, Vector128<long> index, byte scale);
Vector128<float> GatherVector128(float* baseAddress, Vector128<int> index, byte scale);
My current plan is to add a overloads
field in GenTreeHWIntrinsic
to bring that information from importer to CodeGen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the complete list of GatherVector128
to show the necessity.
/// <summary>
/// __m128i _mm_i32gather_epi32 (int const* base_addr, __m128i vindex, const int scale)
/// </summary>
public static unsafe Vector128<int> GatherVector128(int* baseAddress, Vector128<int> index, byte scale) => GatherVector128(baseAddress, index, scale);
/// <summary>
/// __m128i _mm_i32gather_epi32 (int const* base_addr, __m128i vindex, const int scale)
/// </summary>
public static unsafe Vector128<uint> GatherVector128(uint* baseAddress, Vector128<int> index, byte scale) => GatherVector128(baseAddress, index, scale);
/// <summary>
/// __m128i _mm_i32gather_epi64 (__int64 const* base_addr, __m128i vindex, const int scale)
/// </summary>
public static unsafe Vector128<long> GatherVector128(long* baseAddress, Vector128<int> index, byte scale) => GatherVector128(baseAddress, index, scale);
/// <summary>
/// __m128i _mm_i32gather_epi64 (__int64 const* base_addr, __m128i vindex, const int scale)
/// </summary>
public static unsafe Vector128<ulong> GatherVector128(ulong* baseAddress, Vector128<int> index, byte scale) => GatherVector128(baseAddress, index, scale);
/// <summary>
/// __m128 _mm_i32gather_ps (float const* base_addr, __m128i vindex, const int scale)
/// </summary>
public static unsafe Vector128<float> GatherVector128(float* baseAddress, Vector128<int> index, byte scale) => GatherVector128(baseAddress, index, scale);
/// <summary>
/// __m128d _mm_i32gather_pd (double const* base_addr, __m128i vindex, const int scale)
/// </summary>
public static unsafe Vector128<double> GatherVector128(double* baseAddress, Vector128<int> index, byte scale) => GatherVector128(baseAddress, index, scale);
/// <summary>
/// __m128i _mm_i64gather_epi32 (int const* base_addr, __m128i vindex, const int scale)
/// </summary>
public static unsafe Vector128<int> GatherVector128(int* baseAddress, Vector128<long> index, byte scale) => GatherVector128(baseAddress, index, scale);
/// <summary>
/// __m128i _mm_i64gather_epi32 (int const* base_addr, __m128i vindex, const int scale)
/// </summary>
public static unsafe Vector128<uint> GatherVector128(uint* baseAddress, Vector128<long> index, byte scale) => GatherVector128(baseAddress, index, scale);
/// <summary>
/// __m128i _mm_i64gather_epi64 (__int64 const* base_addr, __m128i vindex, const int scale)
/// </summary>
public static unsafe Vector128<long> GatherVector128(long* baseAddress, Vector128<long> index, byte scale) => GatherVector128(baseAddress, index, scale);
/// <summary>
/// __m128i _mm_i64gather_epi64 (__int64 const* base_addr, __m128i vindex, const int scale)
/// </summary>
public static unsafe Vector128<ulong> GatherVector128(ulong* baseAddress, Vector128<long> index, byte scale) => GatherVector128(baseAddress, index, scale);
/// <summary>
/// __m128 _mm_i64gather_ps (float const* base_addr, __m128i vindex, const int scale)
/// </summary>
public static unsafe Vector128<float> GatherVector128(float* baseAddress, Vector128<long> index, byte scale) => GatherVector128(baseAddress, index, scale);
/// <summary>
/// __m128d _mm_i64gather_pd (double const* base_addr, __m128i vindex, const int scale)
/// </summary>
public static unsafe Vector128<double> GatherVector128(double* baseAddress, Vector128<long> index, byte scale) => GatherVector128(baseAddress, index, scale);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
intrinsicID + baseType is not enough
Hmm, can't we use a different intrinsic id for each overload?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, can't we use a different intrinsic id for each overload?
That will break the intrinsic searching because a method name has to map to one intrinsic id in the current [Intrinsic]
system.
src/jit/hwintrinsiclistxarch.h
Outdated
HARDWARE_INTRINSIC(SSE_Multiply, "Multiply", SSE) | ||
HARDWARE_INTRINSIC(SSE_Or, "Or", SSE) | ||
HARDWARE_INTRINSIC(SSE_Reciprocal, "Reciprocal", SSE) | ||
HARDWARE_INTRINSIC(SSE_ReciprocalSqrt, "ReciprocalSquareRoot", SSE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name is getting fixed in #15471, will update after it is merged.
I split and categorized the hardware intrinsic tests to make it easier to tell what coverage exists for each ISA and each instruction. A good portion of the code could be shared and it might be worth refactoring sooner rather than later. |
For the Using the inverse operations is preferred, correct? I was trying to think of reasons why software emulation would be suggested, but came up drawing a blank.... |
FYI. @fiigii I figured I would start on this while waiting for the scalar intrinsic PR to get merged. I believe you are going to be focusing on AVX/AVX2, so I hopefully am not duplicating work you've already done here 😄 |
src/jit/emitxarch.cpp
Outdated
void emitter::emitIns_SIMD_R_R_R_I( | ||
instruction ins, regNumber reg, regNumber reg1, regNumber reg2, int ival, var_types simdtype) | ||
{ | ||
if (UseVEXEncoding() && reg1 != reg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parenthesize reg1 != reg
@tannergooding Thank you so much for the work!
My next step is to implement I recommend you start with simple intrinsics that generate single instruction (e.g., arithmetic, bit manipulation, and comparison, etc.). Then certain helper intrinsics can be expanded to combinations of these simple intrinsics. |
Yes, this feature is in my plan, we have to enable containment analysis on hardware intrinsic in the future. In C++, users can indicate this codegen by deferring pointers of vectors, but C# has no pointer to managed type. So this feature would considerably impact CQ. |
Hmm, it does have that - |
To be clear. var res = Avx.Add(vec1, Avx.Load(address)); should generate vaddps ymm1, ymm2, [adress] |
Yes. Though I suppose it should also work with |
But that requires overloading intrinsics with |
Only the load/store intrinsics would need that. I think that was discussed but I don't remember how exactly we ended up only with pointer versions. Anyway, doesn't matter to much, they can be added if deemed useful at some point. |
I meant C++ can compile: // vec1 is a __m256 and addr is float*
auto a = _mm256_add_ps(vec1, *(__m256*)addr); to vaddps ymm0, ymm0, ymmword ptr [addr] But, in C#, we have to rely on folding |
Most of the aligned and unaligned instructions have no difference of performance and function on modern Intel CPUs when the load/store fall into one cache-line. Users should be responsible for alignment of pointers. |
Yes, and similar code will work in C# except that you can't generate
For SSE2, yes. |
@mikedn |
@fiigii, there are some instructions, like addsubpd, where they explicitly indicate (on the actual instruction page) that the source operand must be aligned or #GP will be generated. It would be useful to determine which section of the manual is accurate. |
Hmm, too lazy to dig through the docs at this hour. Let's test: __m128 a[42];
__asm addps xmm0, xmmword ptr [a+1] Crashes with "read access violation". Runs fine with __m128 a[42];
__asm vaddps xmm0, xmm0, xmmword ptr [a+1] Runs fine. So yes, classic SSE2 (no VEX encoded) instructions require 16 byte alignment. |
@mikedn @tannergooding Sorry, I was worry. |
In summary, we should fold |
See the various In short:
|
Note: CI results are inaccurate until https://github.com/dotnet/coreclr/issues/15618 is resolved as the I am testing locally to ensure Windows x64 is working, but that is all the validation I am doing at this time. |
@tannergooding I will change the importer and codgen of HW intrinsic to a table-driven implementation, so that @mikedn's solution ("add a integer data member to GenTreeHWIntrinsic and store the immediate in it.") may be better to unify the exception code. |
Implemented All that's left now is |
src/jit/hwintrinsiccodegenxarch.cpp
Outdated
|
||
argList = argList->Rest(); | ||
op3 = argList->Current(); | ||
ival = op3->AsIntConCommon()->IconValue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fails for the Shuffle_r
test, but passes for the Shuffle_ro
test. Seems to be because op3
is a GT_CAST
when optimizations are not enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is because we aren't forcing expansion of the intrinsics (on first pass), which causes other downstream processing on the args to occur.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/jit/hwintrinsicxarch.cpp
Outdated
assert(sig->numArgs == 1); | ||
op1 = impSIMDPopStack(TYP_FLOAT); | ||
retNode = gtNewSimdHWIntrinsicNode(TYP_SIMD16, op1, intrinsic, TYP_FLOAT, 16); | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all the helper intrinsics should be expanded to other "real" intrinsics in the importer. That would simplify the subsequent optimization work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean expand to the existing SIMD intrinsics used by System.Numerics.Vector<T>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I mean helper intrinsic should be converted to other HW intrinsics. For example, Sse.SetAll(v)
may be expanded to Sse.Shuffle(aNewLoaclVar(v), 0)
, Sse.SetZero<float>
may be expanded to Sse.Xor(v, v)
.
… Subtract scalar intrinsics
…ubtract scalar intrinsics
…nord scalar intrinsics
…rd scalar intrinsics
…4WithTruncation, Single, and Vector128Single scalar intrinsics
…ithTruncation, Single, and Vector128Single scalar intrinsics
…d scalar intrinsics for op: eq, gt, ge, lt, le, and ne
…scalar intrinsics for op: eq, gt, ge, lt, le, and ne
…LoadScalar intrinsics
…adScalar intrinsics
Had to resolve a merge conflict with the System.Math.Round intrinsic PR (#14736). Went ahead and added the function header comments and changed Will merge once tests are green. |
Do not want to bother you, but frankejit is broken again: So new functions should be under |
No worries. I should have remembered I needed to Fix should be #15900. I tested locally on the TFS sources and it looks like JIT builds succesfully. |
This implements basic codegen support for all currently declared SSE intrinsics except for the
Store
intrinsics.