Fixing some inconsistencies in the x86 HWIntrinsic APIs #15923

tannergooding · 2018-01-19T07:37:37Z

This resolves https://github.com/dotnet/corefx/issues/26433 by adding the missing Sse2.MoveMask API.
This resolves a few issues in the Sse2 and Sse41 intrinsics where the signatures differed from the C++ signatures. This also makes the corresponding Sse intrinsics deviate from their C++ signatures to be inline with the rest of the scalar intrinsics.
This fixes naming of arguments in LoadHigh and LoadLow to be consistent with the naming of other parameters where the operand is used for the upper or lower bits of the returned result.

tannergooding · 2018-01-19T07:38:02Z

I discussed 2 with @fiigii at some length.

My reasonings for the change are basically consistency, clarity, and simplicity:

It makes the APIs consistent

We basically have two types of scalar intrinsics. Those that zero the upper bits, and those that set it to the value of one of the source registers. The behavior of these intrinsics matches the behavior of the underlying hardware instructions (some zero the upper bits and some set it to a value of one of the sources).
All of the binary scalar intrinsics (such as addss, divss, mulss, and subss) set the value of the upper bits to the first source (left).
The unary scalar intrinsics have a split:

Those that take a primitive scalar will zero the upper bits (this includes intrinsics like LoadScalar, SetScalar, ConvertToVector128Int32Scalar, etc.)
Those that take a simd scalar will set the upper bits to the upper bits of one of the source registers

When matching the signature of the C++ intrinsics, all scalar intrinsics except for sqrtss, rsqrtss, and rcpss explicitly take two parameters (a and b) and set the upper bits of the result to the upper bits of a and perform the scalar operation on the lower bits of b. Not having the intrinsics match for the three mentioned meant the user had to remember that the upper bits aren't set to zero for these in particular.

It makes the intent of the API clear

For the scalar intrinsics which do set the upper bits, having an explicit upper value makes the result of the operation clear. That is Sqrt(value) doesn't clearly tell you what the value of the upper bits will be, and the user has to ensure they read the documentation to determine the precise behavior. Sqrt(upper, value) on the other hand makes it very clear what the upper bits will be and what the value that will be operated on will be.

It makes the JIT support for these simpler

Not having an explicit parameter also means that the handling for those intrinsics needs their own code path for codegen (they are 1-argument intrinsics that call SIMD_R_R_R) and they cannot undergo easily undergo containment (only the second operand can be contained, but both src1 and src2 are op1).

Additionally, it means the user has to manually write additional code to handle the scenario where they want upper bits to differ from the existing upper bits in value. The JIT would need to recognize and support this pattern, or the user would be stuck with larger (by number of bytes) and potentially slower code (more instructions filling the pipeline).

tannergooding · 2018-01-19T07:38:14Z

FYI. @CarolEidt, @fiigii, @eerhardt

fiigii · 2018-01-19T07:54:57Z

src/mscorlib/src/System/Runtime/Intrinsics/X86/Sse2.cs

@@ -1101,7 +1105,7 @@ public static class Sse2
        /// <summary>
        /// __m128d _mm_sqrt_sd (__m128d a)
        /// </summary>
-        public static Vector128<double> SqrtScalar(Vector128<double> value) => SqrtScalar(value);
+        public static Vector128<double> SqrtScalar(Vector128<double> upper, Vector128<double> value) => SqrtScalar(upper, value);


From the performance perspective, we always want to duplicate the source register for some of these instructions with VEX-encoding. For example, vsqrtss xmm0, xmm1, xmm1 is better than vsqrtss xmm0, xmm0, xmm1 and vsqrtss xmm0, xmm2, xmm1. So SqrtScalar(upper, value) may make performance issues that is hard to detected.

Form the user experience perspective, SqrtScalar(value) usually is more straightforward and useful than SqrtScalar(upper, value).

In summary, I think we can have SqrtScalar(value) and SqrtScalar(upper, value) both if we really need the 2-arg version.

we always want to duplicate the source register for some of these instructions with VEX-encoding.

I think this should say: The recommendation is to duplicate the source register for some of these instructions.

I think its one of those things where an analyzer comes in handy (just like the user could do SetAll(0.0f), instead of SetZero(), even though the latter is more performant).

For example, vsqrtss xmm0, xmm1, xmm1 is better than vsqrtss xmm0, xmm0, xmm1 and vsqrtss xmm0, xmm2, xmm1

However, vsqrtss target, upper, value (when upperReg != valueReg) will produce smaller codegen then movss value, upper, value; vsqrtss target, value, value, which is what users would need to code otherwise (via value = MoveScalar(upper, value); target = SqrtScalar(value)). It is also (theoretically) faster since there are fewer instructions filling the pipeline (the equivalent non-vex instructions behave similarly).

In summary, I think we can have SqrtScalar(value) and SqrtScalar(upper, value) both if we really need the 2-arg version.

I was in agreement that this could/should be left open as an option. It is not possible to do for the binary scalar simd instructions (like AddScalar), but does allow easy exposure of the "recommended" use for the unary scalar simd instructions (with the caveats that it "may" reduce consistency/clarity of the instructions behavior).

I agree with @fiigii that good solution would be to have two scalar overloads (i) two arg version and (ii) one arg version

It is not possible to do for the binary scalar simd instructions (like AddScalar),

Absolutely, I meant just for certain 1-arg SIMD scalar intrinsic, such as SqrtScalar, RoundCurrentDirectionScalar, ReciprocalScalar, etc.

I'm fine with the two overloads, though I begin to be a bit concerned about even further exploding the API surface.

though I begin to be a bit concerned about even further exploding the API surface.

That is one of my concerns as well.

These could possibly be implemented in terms of SqrtScalar(value) => SqrtScalar(value, value), with an aggressive inlining attribute or as a general helper library (CoreFXExtensions) for commonly used algorithms/helpers...

These could possibly be implemented in terms of SqrtScalar(value) => SqrtScalar(value, value), with an aggressive inlining attribute or as a general helper library (CoreFXExtensions) for commonly used algorithms/helpers...

That will make JIT/VM more complex because all the methods in this namespace are automatically labeled with [Intrinsic]. If we want the two versions both, implementing them in JIT via the table-driven framework is really easy.

a bit concerned about even further exploding the API surface

We are much more concise than C++ implementation. The overloads which are proposed are the closest match to original C++ intrinsics and there will be only few of them. This should not have any significant impact on number of methods which would be around 1 300 for current accepted Intel ISA. I have not counted Arm but it will be significantly more concise as there are more methods which can be expressed as generics according to our API design rules.

AH, I forgot about the implicit [Intrinsic] handling for all methods here

CarolEidt · 2018-01-19T19:14:51Z

src/mscorlib/src/System/Runtime/Intrinsics/X86/Sse.cs

@@ -344,7 +344,7 @@ public static class Sse
        /// <summary>
        /// __m128 _mm_rcp_ss (__m128 a)
        /// </summary>
-        public static Vector128<float> ReciprocalScalar(Vector128<float> value) => ReciprocalScalar(value);
+        public static Vector128<float> ReciprocalScalar(Vector128<float> upper, Vector128<float> value) => ReciprocalScalar(upper, value);


This leads to confusion where the C++ intrinsic doesn't include the additional argument, and makes me wonder whether we should include both the C++intrinsic as well as the actual opcode, and use the parameter names, e.g. for this case we would have:

/// <summary> /// __m128 _mm_rcp_ss (__m128 value) /// rcpps upper, value /// </summary>

Thoughts?

There are only 3 intrinsics which differ (rcp_ss, rsqrt_ss, and sqrt_ss), we could probably just add an explicit comment.

/// <summary> /// __m128 _mm_rcp_ss (__m128 value) /// We deviate from the native signature and allow the user to specify upper, rather than having it implicitly come from value. This is for consistency with the other scalar intrinsics. /// </summary>

CarolEidt · 2018-01-19T19:16:15Z

src/mscorlib/src/System/Runtime/Intrinsics/X86/Sse2.cs

@@ -1101,7 +1105,7 @@ public static class Sse2
        /// <summary>
        /// __m128d _mm_sqrt_sd (__m128d a)
        /// </summary>
-        public static Vector128<double> SqrtScalar(Vector128<double> value) => SqrtScalar(value);
+        public static Vector128<double> SqrtScalar(Vector128<double> upper, Vector128<double> value) => SqrtScalar(upper, value);


I'm fine with the two overloads, though I begin to be a bit concerned about even further exploding the API surface.

tannergooding · 2018-01-20T18:00:40Z

Addressed PR feedback.

For the SSE2 and SSE4.1 intrinsics, I added an overload which only takes value. These also have a comment stating: The above native signature does not exist. We provide this additional overload for the recommended use case of this intrinsic.

For the SSE intrinsics, I added an overload which takes upper and value. These also have a comment stating: The above native signature does not exist. We provide this additional overload for consistency with the other scalar APIs.

This adds a total of 14 new APIs.

4creators · 2018-01-22T15:03:55Z

Actually there a couple of other missing APIs in Sse2 - see https://github.com/dotnet/corefx/issues/25926#issuecomment-359447469. These are:

// __m128i _mm_move_epi64 (__m128i a)
public static Vector128<long> MoveScalar(Vector128<long> upper, Vector128<long> value) => MoveScalar(upper, value);
public static Vector128<ulong> MoveScalar(Vector128<ulong> upper, Vector128<ulong> value) => MoveScalar(upper, value);

// __m64 _mm_movepi64_pi64 (__m128i a)
public static long* StoreScalar(Vector128<long> value);
public static ulong* StoreScalar(Vector128<ulong> value);

// void _mm_clflush (void const* p)
public static void FlushCache(void* p);

// void _mm_mfence (void)
public static void MemoryFence();

tannergooding · 2018-01-22T16:54:46Z

FlushCache is not controlled by the same CPUID flag as the SSE2 ISA, so if it were exposed, it would need to happen in a new ISA. (FlushCache is the CLFSH flag).

As for MemoryFence, I think we had explicitly not exposed some of the fencing instructions at this time.

_mm_movepi64_pi64 is not a Store instruction, it is a MMX interop instruction, and doesn't expose any non register-to-register encodings:

The _mm_move_epi64 do look to be missing. Will update when I get a chance.

4creators · 2018-01-22T18:31:25Z

You are right about memory instructions and _mm_movepi64_pi64.

fiigii · 2018-01-22T22:57:32Z

There is one more inconsistency:

// SSE4.1
public static unsafe Vector128<sbyte> LoadAlignedNonTemporal(sbyte* address) {}

// AVX2
public static unsafe Vector256<sbyte> LoadAlignedVector256NonTemporal(sbyte* address) {}

@tannergooding could you helpe me to change LoadAlignedNonTemporal to LoadAlignedVector128NonTemporal in this PR?

Sorry for the inconvenience.

tannergooding · 2018-01-24T20:40:35Z

Updated. Added documentation on what instruction an intrinsic maps to.

The below list shows the intrinsics that are not exposed.

MMX instructions aren't exposed because they generally aren't useful and cause a transition penalty
Explicitly not exposed instructions are generally special memory fencing instructions which was decided to be reviewed at a later date/time
Partially exposed instructions are just the reg/reg encodings for the move instructions.
SSE
- MMX Extensions (cause transition from x87 FPU to MMX)
  - CVTPI2PS
  - CVTPS2PI
  - CVTTPS2PI
  - MASKMOVQ
  - MOVNTQ
  - PAVGB
  - PAVGW
  - PEXTRW
  - PINSWR
  - PMAXUB
  - PMAXSW
  - PMINUB
  - PMINSW
  - PMOVMSKB
  - PMULHUW
  - PSADBW
  - PSHUFW
- Not Exposed (explicit)
  - LDMXCSR
  - ~~PREFETCHh~~ exposed
  - ~~SFENCE~~ exposed
  - STMXCSR
- Partially Exposed (missing encodings listed)
  - ~~MOVAPS xmm, xmm~~ not exposed, handled by block copy
  - ~~MOVUPS xmm, xmm~~ not exposed, handled by block copy
SSE2
- MMX Extensions (cause transition from x87 FPU to MMX)
  - CVTPI2PD
  - CVTPD2PI
  - CVTTPD2PI
  - MOVQ2DQ
  - MOVDQ2Q
- Not Exposed (explicit)
  - ~~CLFLUSH~~ CLFSH CPUID
  - ~~LFENCE~~ exposed
  - ~~MFENCE~~ exposed
  - PAUSE
- Not Exposed (unknown)
  - ~~MOVNTI~~ exposed
- Partially Exposed (missing encodings listed)
  - ~~MOVAPD xmm, xmm~~ not exposed, handled by block copy
  - ~~MOVUPD xmm, xmm~~ not exposed, handled by block copy
SSE3
- x87 FPU Extensions
  - FISTTP
- Not Exposed (explicit)
  - ~~MONITOR~~ MONITOR CPUID
  - ~~MWAIT~~ MONITOR CPUID
SSE4.1
- Not Exposed (unknown)
  - ~~PEXTRW~~ not exposed, handled by containment
AVX
- Partially Exposed (missing encodings listed)
  - ~~VMOVDQU ymm, ymm~~ not exposed, handled by block copy
  - ~~VMOVUPS ymm, ymm~~ not exposed, handled by block copy
  - ~~VMOVUPD ymm, ymm~~ not exposed, handled by block copy
  - ~~VMOVDQA ymm, ymm~~ not exposed, handled by block copy
  - ~~VMOVAPS ymm, ymm~~ not exposed, handled by block copy
  - ~~VMOVAPD ymm, ymm~~ not exposed, handled by block copy
AVX2
- Partially Exposed (missing encodings listed)
  - ~~VPBROADCASTB xmm, m8~~ not exposed, handled by containment
  - ~~VPBROADCASTB ymm, m8~~ not exposed, handled by containment
  - ~~VPBROADCASTW xmm, m16~~ not exposed, handled by containment
  - ~~VPBROADCASTW ymm, m16~~ not exposed, handled by containment
  - ~~VPBROADCASTD xmm, m32~~ not exposed, handled by containment
  - ~~VPBROADCASTD ymm, m32~~ not exposed, handled by containment
  - ~~VPBROADCASTQ xmm, m64~~ not exposed, handled by containment
  - ~~VPBROADCASTQ ymm, m64~~ not exposed, handled by containment

4creators · 2018-01-24T21:10:28Z

SSE4.1 PEXTRW reg/m16, xmm, imm8 allows to extract directly to memory what is not possible with SSE2 instructions. All other PEXTRB/D/Q allow to extract to memory or register. This last instruction unifies all extract API. Perhaps we should include it as well.

tannergooding · 2018-01-24T23:43:16Z

@fiigii, thoughts?

fiigii · 2018-01-24T23:48:20Z

src/mscorlib/src/System/Runtime/Intrinsics/X86/Fma.PlatformNotSupported.cs

        /// </summary>
        public static Vector128<float> MultiplyAdd(Vector128<float> a, Vector128<float> b, Vector128<float> c) { throw new PlatformNotSupportedException(); }
        /// <summary>
-        /// __m128d _mm_fmadd_pd (__m128d a, __m128d b, __m128d c)
+        /// __m128d _mm_fmadd_pd (__m128d a, __m128d b, __m128d c); VFMADDPD xmm, xmm, xmm/m128


Putting the instruction in a new line maybe looks better?

/// <summary> /// __m128d _mm_fmadd_pd (__m128d a, __m128d b, __m128d c) /// VFMADDPD xmm, xmm, xmm/m128 /// </summary>

Possibly. It doesn't look as good for the intrinsics with multiple mappings, however.

The current format will work fairly well for a search and replace to improve the documentation however.

@eerhardt could you suggest the comment format?

FWIW, my preference would also be to have the instruction(s) on a separate line.

I agree that separate lines are more readable.

tannergooding · 2018-01-24T23:50:20Z

src/mscorlib/src/System/Runtime/Intrinsics/X86/Avx.cs

        /// </summary>
-        public static Vector256<double> ConvertToVector256Double(Vector256<float> value) => ConvertToVector256Double(value);
+        public static Vector256<double> ConvertToVector256Double(Vector128<float> value) => ConvertToVector256Double(value);


The parameter was incorrect, changed from Vector256 to Vector128

Thank you!!

tannergooding · 2018-01-24T23:50:26Z

src/mscorlib/src/System/Runtime/Intrinsics/X86/Avx.cs

        /// </summary>
-        public static Vector256<double> ConvertToVector256Double(Vector256<int> value) => ConvertToVector256Double(value);
+        public static Vector256<double> ConvertToVector256Double(Vector128<int> value) => ConvertToVector256Double(value);


The parameter was incorrect, changed from Vector256 to Vector128

tannergooding · 2018-01-24T23:52:41Z

src/mscorlib/src/System/Runtime/Intrinsics/X86/Avx2.cs

        /// </summary>
-        public static Vector256<byte> Max(Vector256<byte> left, Vector256<short> right) => Max(left, right);


Second parameter's base type was incorrect. Changed from short to byte

fiigii · 2018-01-24T23:55:02Z

@fiigii, thoughts?

Do you mean the unexposed instructions? I think we talked most of them previously.
For SSE4.1 PEXTRW reg/m16, xmm, imm8, I think we should follow the current design that does not expose the encoding form as intrinsics. This instruction should be generated by folding Store*.

tannergooding · 2018-01-24T23:55:10Z

src/mscorlib/src/System/Runtime/Intrinsics/X86/Sse.cs

        /// </summary>
-        public static unsafe Vector128<float> LoadScalar(float* address) => LoadScalar(address);


Fixed name to be LoadScalarVector128

Not necessary, we do not have LoadScalarVector256, and Scalar already implies Vector128.

I changed it here for consistency with the other APIs naming convention. I figured it was better to be consistent.

Fair enough.

I also prefer the addition of Vector128 to the name; I think it makes it clearer, especially since when encountered in code it may not always be clear that it is returning a vector.

Fixed on the remaining.

tannergooding · 2018-01-24T23:55:34Z

src/mscorlib/src/System/Runtime/Intrinsics/X86/Sse.cs

        /// </summary>
-        public static unsafe Vector128<float> LoadHigh(Vector128<float> value, float* address) => LoadHigh(value, address);


Name of first parameter change to be lower

tannergooding · 2018-01-24T23:55:46Z

src/mscorlib/src/System/Runtime/Intrinsics/X86/Sse.cs

        /// </summary>
-        public static unsafe Vector128<float> LoadLow(Vector128<float> value, float* address) => LoadLow(value, address);


Name of first parameter changed to be upper

tannergooding · 2018-01-24T23:56:04Z

src/mscorlib/src/System/Runtime/Intrinsics/X86/Sse.cs

        /// </summary>
-        public static Vector128<float> SetScalar(float value) => SetScalar(value);


Name changed to be SetScalarVector128

Not necessary,

tannergooding · 2018-01-24T23:56:51Z

src/mscorlib/src/System/Runtime/Intrinsics/X86/Sse2.cs

        /// </summary>
-        public static unsafe Vector128<double> LoadHigh(Vector128<double> value, double* address) => LoadHigh(value, address);


Name of first parameter changed to lower.

tannergooding · 2018-01-24T23:57:03Z

src/mscorlib/src/System/Runtime/Intrinsics/X86/Sse2.cs

        /// </summary>
-        public static unsafe Vector128<double> LoadLow(Vector128<double> value, double* address) => LoadLow(value, address);


Name of first parameter changed to upper.

tannergooding · 2018-01-24T23:57:27Z

src/mscorlib/src/System/Runtime/Intrinsics/X86/Sse2.cs

        /// </summary>
-        public static Vector128<double> SetScalar(double value) => SetScalar(value);


Named changed to SetScalarVector128

IMO from user perspective it is better to have names with Vector128/Vector256 even when Vector128 is implied.

fiigii · 2018-01-25T00:07:18Z

Thank you so much for the work!

fiigii · 2018-01-25T00:08:22Z

src/mscorlib/src/System/Runtime/Intrinsics/X86/Sse2.cs

        /// </summary
        public static Vector128<ushort> UnpackHigh(Vector128<ushort> left, Vector128<ushort> right) => UnpackHigh(left, right);
        /// <summary>
-        /// __m128i _mm_unpackhi_epi32 (__m128i a,  __m128i b)
+        /// __m128i _mm_unpackhi_epi32 (__m128i a,  __m128i b); PUNPCKHDQ xmm, xmm/m128


Could you fix the above /// </summary?

fiigii · 2018-01-25T00:09:10Z

src/mscorlib/src/System/Runtime/Intrinsics/X86/Sse2.PlatformNotSupported.cs

        /// </summary>
-        public static Vector128<byte> Or(Vector128<byte> left,  Vector128<byte> right) { throw new PlatformNotSupportedException(); }
+        public static Vector128<byte> Or(Vector128<byte> left, Vector128<byte> right) { throw new PlatformNotSupportedException(); }



fiigii · 2018-01-25T00:12:25Z

src/mscorlib/src/System/Runtime/Intrinsics/X86/Avx2.PlatformNotSupported.cs

        /// </summary>
        public static Vector256<long> Insert(Vector256<long> value, Vector128<long> data, byte index) { throw new PlatformNotSupportedException(); }
        /// <summary>
-        /// __m256i _mm256_inserti128_si256 (__m256i a, __m128i b, const int imm8)
+        /// __m256i _mm256_inserti128_si256 (__m256i a, __m128i b, const int imm8); VINSERTI128 ymm, ymm, m128, imm8
        /// </summary>
        public static unsafe Vector256<long> Insert(Vector256<long> value, long* address,


CarolEidt

One question and one suggestion

CarolEidt · 2018-01-25T00:11:50Z

src/mscorlib/src/System/Runtime/Intrinsics/X86/Fma.PlatformNotSupported.cs

        /// </summary>
        public static Vector128<float> MultiplyAdd(Vector128<float> a, Vector128<float> b, Vector128<float> c) { throw new PlatformNotSupportedException(); }
        /// <summary>
-        /// __m128d _mm_fmadd_pd (__m128d a, __m128d b, __m128d c)
+        /// __m128d _mm_fmadd_pd (__m128d a, __m128d b, __m128d c); VFMADDPD xmm, xmm, xmm/m128


FWIW, my preference would also be to have the instruction(s) on a separate line.

CarolEidt · 2018-01-25T00:27:34Z

src/mscorlib/src/System/Runtime/Intrinsics/X86/Sse.cs

        /// </summary>
-        public static unsafe Vector128<float> LoadScalar(float* address) => LoadScalar(address);


I also prefer the addition of Vector128 to the name; I think it makes it clearer, especially since when encountered in code it may not always be clear that it is returning a vector.

CarolEidt · 2018-01-25T00:34:52Z

src/mscorlib/src/System/Runtime/Intrinsics/X86/Sse2.cs

-        /// __m128i _mm_mul_epu32 (__m128i a,  __m128i b)
+        /// __m128i _mm_move_epi64 (__m128i a); MOVQ xmm, xmm
+        /// </summary>
+        public static Vector128<long> MoveScalar(Vector128<long> value) => MoveScalar(value);


What is the reasoning for this being MoveScalar, but changing SetScalar to SetScalarVector128 when I believe in both cases the entire 128-bit vector is defined.

The former (MoveScalar) takes and returns Vector128, so it shouldn't cause overload resolution problems. The latter (SetScalar) takes a primitive and returns a Vector128, so it could potentially conflict.

I can change if you think its clearer to provide it here as well.

I think it would be more consistent, but I don't feel strongly about it.

CarolEidt · 2018-01-25T00:50:41Z

src/mscorlib/src/System/Runtime/Intrinsics/X86/Sse41.PlatformNotSupported.cs

        /// </summary>
        public static Vector128<double> BlendVariable(Vector128<double> left, Vector128<double> right, Vector128<double> mask) { throw new PlatformNotSupportedException(); }

        /// <summary>
-        /// __m128 _mm_ceil_ps (__m128 a)
+        /// __m128 _mm_ceil_ps (__m128 a); ROUNDPS xmm, xmm/m128, imm8(10)


Would it be clearer to write the implicit immediate as imm8=0b10?

Probably. 👍

fiigii · 2018-01-25T11:53:27Z

@tannergooding @CarolEidt @mikedn Shall we change the APIs of LoadScalar to take ref parameter instead of pointer?
That keeps consistent with BroadcastElementToVector128/256, so that load/store from/to a scalar value via ref because it does not need alignment.

4creators · 2018-01-30T01:39:17Z

src/mscorlib/src/System/Runtime/Intrinsics/X86/Sse2.cs

        /// </summary>
        public static Vector128<short> PackSignedSaturate(Vector128<int> left, Vector128<int> right) => PackSignedSaturate(left, right);

        /// <summary>
-        /// __m128i _mm_packus_epi16 (__m128i a,  __m128i b)
+        /// __m128i _mm_packus_epi16 (__m128i a,  __m128i b);
+        ///   PACKUSWB xmm, xmm/m128
        /// </summary>
        public static Vector128<byte> PackUnsignedSaturate(Vector128<short> left, Vector128<short> right) => PackUnsignedSaturate(left, right);



It seems that void _mm_pause (void) or PAUSE intrinsic is missing

Explicitly avoiding this one as it could have undesired impacts on threading/GC, etc.

Needs a separate, in depth discussion, to determine if it is okay to expose.

pause is already exposed as Thread.SpinWait(). It is not inlined by the JIT today, but that can be fixed... .

It is incredibly hard to use this instruction directly because of it has very different performance characteristics on different processor models (https://github.com/dotnet/coreclr/issues/13388).

@tannergooding @fiigii - void _mm_pause (void) presence is not controlled by any CPUID flag and is available since Pentium 4 there is no reson to exclude it - anyway I remember that in corefx repo there were several issues with correctly tuning spin locks and this instruction could be a perfect solution to this problems

@jkotas, ~~if it is already "exposed" and shouldn't cause issues with threading or the GC, would it be good to expose here anyways, for completeness?~~

Edit: Page didn't refresh before I commented, sounds like exposing won't buy us much. We can always come back to this later if enough people request it.

I do not think so.

I think in such case problem is already "solved" by instruction being hard to use

Seems that github is working today really very slow and I got out of the context very fast

Thread.SpinWait() that gives you some consistency to get your spin loop tuned properly
has very different performance characteristics on different processor models

Right, we already did lots of work to tune the spin.

4creators · 2018-01-30T01:40:14Z

src/mscorlib/src/System/Runtime/Intrinsics/X86/Sse2.cs

        /// </summary>
-        public static Vector128<double> SetScalar(double value) => SetScalar(value);
+        public static Vector128<double> SetScalarVector128(double value) => SetScalarVector128(value);


naming: I will stop commenting on order in naming convention

4creators · 2018-01-30T01:41:21Z

src/mscorlib/src/System/Runtime/Intrinsics/X86/Sse2.cs

        /// </summary>
        public static Vector128<short> ShuffleLow(Vector128<short> value, byte control) => ShuffleLow(value, control);
        /// <summary>
-        /// __m128i _mm_shufflelo_epi16 (__m128i a,  int control)
+        /// __m128i _mm_shufflelo_epi16 (__m128i a,  int control);
+        ///   PSHUFLW xmm, xmm/m128, imm8
        /// </summary>
        public static Vector128<ushort> ShuffleLow(Vector128<ushort> value, byte control) => ShuffleLow(value, control);



Functions are not in alphabetical order - perhaps it would be easier to work with them if they will be sorted

That can be dealt with by a separate PR.

fiigii · 2018-01-30T01:49:44Z

src/mscorlib/src/System/Runtime/Intrinsics/X86/Sse.cs

+        /// void _mm_prefetch(char* p, int i)
+        ///   PREFETCHT1 m8
+        /// </summary>
+        public static unsafe void Prefetch1(void* address) => Prefetch1(address);


Shall we update certain Load*/Store* intrinsics to void* as well?
For example,

//Avx public static unsafe Vector256<sbyte> LoadVector256(sbyte* address) => LoadVector256(address); public static unsafe Vector256<byte> LoadVector256(byte* address) => LoadVector256(address); public static unsafe Vector256<short> LoadVector256(short* address) => LoadVector256(address); public static unsafe Vector256<ushort> LoadVector256(ushort* address) => LoadVector256(address); public static unsafe Vector256<int> LoadVector256(int* address) => LoadVector256(address); public static unsafe Vector256<uint> LoadVector256(uint* address) => LoadVector256(address); public static unsafe Vector256<long> LoadVector256(long* address) => LoadVector256(address); public static unsafe Vector256<ulong> LoadVector256(ulong* address) => LoadVector256(address); public static unsafe Vector256<float> LoadVector256(float* address) => LoadVector256(address); public static unsafe Vector256<double> LoadVector256(double* address) => LoadVector256(address);

=>

public static unsafe Vector256<T> LoadVector256<T>(void* address) => LoadVector256<T>(address);

cc @jkotas @CarolEidt

In this case we load into vector of predefined type and enforcing pointer to be of the same type seems correct

Agree, thanks!

…atures.

tannergooding · 2018-01-30T02:13:18Z

src/mscorlib/src/System/Runtime/Intrinsics/X86/Avx.PlatformNotSupported.cs

        /// </summary>
-        public static Vector128<float> BroadcastElementToVector128(ref float source) { throw new PlatformNotSupportedException(); }
+        public static unsafe Vector128<float> BroadcastElementToVector128(float* source) { throw new PlatformNotSupportedException(); }


@fiigii, should these be called BroadcastScalarToVector128?

(for consistency with the other operations that take a scalar.

IMO in this case we have prior art in the form of established Intel docs which use term element and programmers IMHO probably got used to read those instructions in this way - at least this is my experience

That would be more consistent, I think

Will update.

And Avx2 counterparts should be updated as well.

tannergooding · 2018-01-30T02:26:25Z

Put up the CoreFX side PR: dotnet/corefx#26668

It is much easier to see just the API changes via it (where-as this one also contains the doc comment updates).

tannergooding · 2018-01-30T02:28:50Z

@4creators, looking at the names, it actually looks to be fairly consistent already.

For example, LoadAlignedVector128NonTemporal:

Load is the action
Aligned Vector128 is the type
Non Temporal is the modifier

For LoadScalarVector128

Load is the action
Scalar Vector128 is the type

4creators · 2018-01-30T02:37:58Z

looking at the names, it actually looks to be fairly consistent already

The point is that from my perspective and experience Vector128 represents mostly xmm register and action is load aligned or load unaligned which we don't use - it is personal perspective on both sides so perhaps we should ask API designers?

tannergooding · 2018-01-30T02:41:41Z

erspective and experience Vector128 represents mostly xmm register and action is load aligned or load unaligned

I think it is also consistent from that perspective, except for a couple (like NonTemporal, where it should be LoadAlignedNonTemporalVector128, from that perspective)

tannergooding · 2018-01-30T02:53:58Z

@CarolEidt, @fiigii. Do you have a preference on whether it should be LoadAlignedNonTemporalVector128, or LoadAlignedVector128NonTemporal (we are currently the latter).

fiigii · 2018-01-30T02:57:53Z

I prefer LoadAlignedVector128NonTemporal.

tannergooding · 2018-01-30T02:59:08Z

src/mscorlib/src/System/Runtime/Intrinsics/X86/Sse.cs

        /// </summary>
        public static float ConvertToSingle(Vector128<float> value) => ConvertToSingle(value);

        /// <summary>
        /// __m128 _mm_cvtsi32_ss (__m128 a, int b)
+        ///   CVTSI2SS xmm, reg/m32
        /// </summary>
        public static Vector128<float> ConvertToVector128SingleScalar(Vector128<float> upper, int value) => ConvertToVector128SingleScalar(upper, value);


ConvertScalarToVector128Single might be a better name. Thoughts?

tannergooding · 2018-01-30T03:22:42Z

Ok. I think all feedback and questions have once again been answered.

Both this and the CoreFX PR (dotnet/corefx#26668) have been updated to be in sync.

I think all missing APIs that we want to expose have been exposed. Any that aren't exposed have been discussed and agreed upon.

I think all wanted renames have now happend (except for potentially #15923 (comment), which I would like to hear from @CarolEidt and @eerhardt on).

CarolEidt · 2018-01-30T17:38:51Z

I think it is also consistent from that perspective, except for a couple (like NonTemporal, where it should be LoadAlignedNonTemporalVector128, from that perspective)

I'm not sure I understand what perspective you're talking about. I thought that we were moving toward "Aligned Vector128" as the type, and "NonTemporal" as the modifier, and action + type + modifier it would be LoadAlignedVector128NonTemporal, which I think also reads better than LoadAlignedNonTemporalVector128

tannergooding · 2018-01-30T17:52:10Z

@CarolEidt. @4creators was reading it differently, in that LoadAligned was the action and Vector128 was the type, where I was viewing it as Load being the action and AlignedVector128 was the type. (his perspective vs mine).

For the most part, our two perspectives actually line up, regardless of how you view the action vs type. The place they don't is for NonTemporal, where you could view LoadAlignedNonTemporal as the action and Vector128 as the type (from @4creators perspective) or you could view it as Load (action), AlignedVector128 (type), and NonTemporal (modifier).

This is also, actually, the only instruction I saw that has a "modifier", so it might make sense to change it to be LoadAlignedNonTemporalVector128 (we originally called it just LoadAlignedNonTemporal, before we added the type in to disambiguate).

CarolEidt · 2018-01-30T18:03:00Z

@tannergooding - I see now, thanks. I don't have a strong opinion either way, but if I were to be pedantic about it, I would say that alignment is very much a property of the object being loaded, while NonTemporal is the way in which the same object may or may not be loaded. So it seems to me that the Aligned and Vector128 should bind more closely than the NonTemporal. If anything the NonTemporal would be part of the action (if not a modifier) and therefore: LoadNonTemporalAlignedVector128. That is unfortunately a bit odd because if I'm looking for aligned loads I might specifically be looking for LoadAligned, which leads me again to the "modifier" approach.

But, as I say, I don't have a strong opinion.

tannergooding · 2018-01-30T18:06:28Z

That is unfortunately a bit odd because if I'm looking for aligned loads I might specifically be looking for LoadAligned, which leads me again to the "modifier" approach.

I agree. I'll stick with LoadAlignedVector128NonTemporal, which is what @fiigii voted for as well.

That should be all the issues and this should be good for final review (assuming no other issues crop up 😄).

eerhardt

The API changes look good to me.

Thanks @tannergooding for pushing this forward.

benaadams · 2018-12-01T18:37:28Z

@tannergooding why is PAUSE Not Exposed (explicit)?

Thread.SpinWait(1) is a bit of a heavyweight way to get there

tannergooding · 2018-12-01T19:14:56Z

@benaadams, I believe most of the conversation was captured here: https://github.com/dotnet/coreclr/issues/17818#issuecomment-392617183. Also see the comment chain here: #15923 (comment)

fiigii reviewed Jan 19, 2018

View reviewed changes

CarolEidt reviewed Jan 19, 2018

View reviewed changes

fiigii reviewed Jan 24, 2018

View reviewed changes

tannergooding commented Jan 24, 2018

View reviewed changes

fiigii reviewed Jan 25, 2018

View reviewed changes

CarolEidt reviewed Jan 25, 2018

View reviewed changes

4creators reviewed Jan 30, 2018

View reviewed changes

fiigii reviewed Jan 30, 2018

View reviewed changes

Adding missing instruction mapping documentation and fixing some sign…

4f5fc32

…atures.

tannergooding commented Jan 30, 2018

View reviewed changes

tannergooding mentioned this pull request Jan 30, 2018

Updating the x86 HWIntrinsic ref assembly dotnet/corefx#26668

Closed

Changing BroadcastElement to BroadcastScalar

b7462fb

tannergooding commented Jan 30, 2018

View reviewed changes

Changing ConvertTo.*Scalar to ConvertScalarTo.*

3f348a5

Temporaily disable some HWIntrinsic tests due to the naming changes.

5266d3e

tannergooding closed this Jan 30, 2018

tannergooding reopened this Jan 30, 2018

CarolEidt approved these changes Jan 30, 2018

View reviewed changes

eerhardt approved these changes Jan 30, 2018

View reviewed changes

tannergooding merged commit 94c6a4e into dotnet:master Jan 30, 2018

tannergooding mentioned this pull request Feb 2, 2018

Update CoreClr, CoreFx to preview2-26202-06, preview2-26202-01, respectively (master) #16169

Merged

		/// </summary>
		public static Vector256<byte> Max(Vector256<byte> left, Vector256<short> right) => Max(left, right);

		/// </summary>
		public static unsafe Vector128<float> LoadScalar(float* address) => LoadScalar(address);

		/// </summary>
		public static Vector128<float> SetScalar(float value) => SetScalar(value);

		/// </summary>
		public static unsafe Vector128<double> LoadHigh(Vector128<double> value, double* address) => LoadHigh(value, address);

		/// </summary>
		public static Vector128<double> SetScalar(double value) => SetScalar(value);

Fixing some inconsistencies in the x86 HWIntrinsic APIs #15923

Fixing some inconsistencies in the x86 HWIntrinsic APIs #15923

Conversation

tannergooding commented Jan 19, 2018

tannergooding commented Jan 19, 2018

tannergooding commented Jan 19, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

4creators Jan 19, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fiigii Jan 19, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding commented Jan 20, 2018

4creators commented Jan 22, 2018

tannergooding commented Jan 22, 2018

4creators commented Jan 22, 2018

fiigii commented Jan 22, 2018

tannergooding commented Jan 24, 2018 • edited Loading

4creators commented Jan 24, 2018 • edited Loading

tannergooding commented Jan 24, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fiigii commented Jan 24, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fiigii commented Jan 25, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CarolEidt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fiigii commented Jan 25, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkotas Jan 30, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding Jan 30, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

4creators Jan 19, 2018 •

edited

Loading

fiigii Jan 19, 2018 •

edited

Loading

tannergooding commented Jan 24, 2018 •

edited

Loading

4creators commented Jan 24, 2018 •

edited

Loading

jkotas Jan 30, 2018 •

edited

Loading

tannergooding Jan 30, 2018 •

edited

Loading

fiigii Jan 30, 2018 •

edited

Loading