Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Fixing some inconsistencies in the x86 HWIntrinsic APIs #15923

Merged
merged 23 commits into from
Jan 30, 2018
Merged

Fixing some inconsistencies in the x86 HWIntrinsic APIs #15923

merged 23 commits into from
Jan 30, 2018

Conversation

tannergooding
Copy link
Member

  1. This resolves https://github.com/dotnet/corefx/issues/26433 by adding the missing Sse2.MoveMask API.

  2. This resolves a few issues in the Sse2 and Sse41 intrinsics where the signatures differed from the C++ signatures. This also makes the corresponding Sse intrinsics deviate from their C++ signatures to be inline with the rest of the scalar intrinsics.

  3. This fixes naming of arguments in LoadHigh and LoadLow to be consistent with the naming of other parameters where the operand is used for the upper or lower bits of the returned result.

@tannergooding
Copy link
Member Author

I discussed 2 with @fiigii at some length.

My reasonings for the change are basically consistency, clarity, and simplicity:

  1. It makes the APIs consistent

We basically have two types of scalar intrinsics. Those that zero the upper bits, and those that set it to the value of one of the source registers. The behavior of these intrinsics matches the behavior of the underlying hardware instructions (some zero the upper bits and some set it to a value of one of the sources).
All of the binary scalar intrinsics (such as addss, divss, mulss, and subss) set the value of the upper bits to the first source (left).
The unary scalar intrinsics have a split:

  • Those that take a primitive scalar will zero the upper bits (this includes intrinsics like LoadScalar, SetScalar, ConvertToVector128Int32Scalar, etc.)
  • Those that take a simd scalar will set the upper bits to the upper bits of one of the source registers

When matching the signature of the C++ intrinsics, all scalar intrinsics except for sqrtss, rsqrtss, and rcpss explicitly take two parameters (a and b) and set the upper bits of the result to the upper bits of a and perform the scalar operation on the lower bits of b. Not having the intrinsics match for the three mentioned meant the user had to remember that the upper bits aren't set to zero for these in particular.

  1. It makes the intent of the API clear

For the scalar intrinsics which do set the upper bits, having an explicit upper value makes the result of the operation clear. That is Sqrt(value) doesn't clearly tell you what the value of the upper bits will be, and the user has to ensure they read the documentation to determine the precise behavior. Sqrt(upper, value) on the other hand makes it very clear what the upper bits will be and what the value that will be operated on will be.

  1. It makes the JIT support for these simpler

Not having an explicit parameter also means that the handling for those intrinsics needs their own code path for codegen (they are 1-argument intrinsics that call SIMD_R_R_R) and they cannot undergo easily undergo containment (only the second operand can be contained, but both src1 and src2 are op1).

Additionally, it means the user has to manually write additional code to handle the scenario where they want upper bits to differ from the existing upper bits in value. The JIT would need to recognize and support this pattern, or the user would be stuck with larger (by number of bytes) and potentially slower code (more instructions filling the pipeline).

@tannergooding
Copy link
Member Author

FYI. @CarolEidt, @fiigii, @eerhardt

@@ -1101,7 +1105,7 @@ public static class Sse2
/// <summary>
/// __m128d _mm_sqrt_sd (__m128d a)
/// </summary>
public static Vector128<double> SqrtScalar(Vector128<double> value) => SqrtScalar(value);
public static Vector128<double> SqrtScalar(Vector128<double> upper, Vector128<double> value) => SqrtScalar(upper, value);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the performance perspective, we always want to duplicate the source register for some of these instructions with VEX-encoding. For example, vsqrtss xmm0, xmm1, xmm1 is better than vsqrtss xmm0, xmm0, xmm1 and vsqrtss xmm0, xmm2, xmm1. So SqrtScalar(upper, value) may make performance issues that is hard to detected.

Form the user experience perspective, SqrtScalar(value) usually is more straightforward and useful than SqrtScalar(upper, value).

In summary, I think we can have SqrtScalar(value) and SqrtScalar(upper, value) both if we really need the 2-arg version.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we always want to duplicate the source register for some of these instructions with VEX-encoding.

I think this should say: The recommendation is to duplicate the source register for some of these instructions.

I think its one of those things where an analyzer comes in handy (just like the user could do SetAll(0.0f), instead of SetZero(), even though the latter is more performant).

For example, vsqrtss xmm0, xmm1, xmm1 is better than vsqrtss xmm0, xmm0, xmm1 and vsqrtss xmm0, xmm2, xmm1

However, vsqrtss target, upper, value (when upperReg != valueReg) will produce smaller codegen then movss value, upper, value; vsqrtss target, value, value, which is what users would need to code otherwise (via value = MoveScalar(upper, value); target = SqrtScalar(value)). It is also (theoretically) faster since there are fewer instructions filling the pipeline (the equivalent non-vex instructions behave similarly).

In summary, I think we can have SqrtScalar(value) and SqrtScalar(upper, value) both if we really need the 2-arg version.

I was in agreement that this could/should be left open as an option. It is not possible to do for the binary scalar simd instructions (like AddScalar), but does allow easy exposure of the "recommended" use for the unary scalar simd instructions (with the caveats that it "may" reduce consistency/clarity of the instructions behavior).

Copy link

@4creators 4creators Jan 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @fiigii that good solution would be to have two scalar overloads (i) two arg version and (ii) one arg version

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not possible to do for the binary scalar simd instructions (like AddScalar),

Absolutely, I meant just for certain 1-arg SIMD scalar intrinsic, such as SqrtScalar, RoundCurrentDirectionScalar, ReciprocalScalar, etc.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with the two overloads, though I begin to be a bit concerned about even further exploding the API surface.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though I begin to be a bit concerned about even further exploding the API surface.

That is one of my concerns as well.

These could possibly be implemented in terms of SqrtScalar(value) => SqrtScalar(value, value), with an aggressive inlining attribute or as a general helper library (CoreFXExtensions) for commonly used algorithms/helpers...

Copy link

@fiigii fiigii Jan 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These could possibly be implemented in terms of SqrtScalar(value) => SqrtScalar(value, value), with an aggressive inlining attribute or as a general helper library (CoreFXExtensions) for commonly used algorithms/helpers...

That will make JIT/VM more complex because all the methods in this namespace are automatically labeled with [Intrinsic]. If we want the two versions both, implementing them in JIT via the table-driven framework is really easy.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bit concerned about even further exploding the API surface

We are much more concise than C++ implementation. The overloads which are proposed are the closest match to original C++ intrinsics and there will be only few of them. This should not have any significant impact on number of methods which would be around 1 300 for current accepted Intel ISA. I have not counted Arm but it will be significantly more concise as there are more methods which can be expressed as generics according to our API design rules.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AH, I forgot about the implicit [Intrinsic] handling for all methods here

@@ -344,7 +344,7 @@ public static class Sse
/// <summary>
/// __m128 _mm_rcp_ss (__m128 a)
/// </summary>
public static Vector128<float> ReciprocalScalar(Vector128<float> value) => ReciprocalScalar(value);
public static Vector128<float> ReciprocalScalar(Vector128<float> upper, Vector128<float> value) => ReciprocalScalar(upper, value);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This leads to confusion where the C++ intrinsic doesn't include the additional argument, and makes me wonder whether we should include both the C++intrinsic as well as the actual opcode, and use the parameter names, e.g. for this case we would have:

         /// <summary>
          /// __m128 _mm_rcp_ss (__m128 value)
          /// rcpps upper, value
          /// </summary>

Thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are only 3 intrinsics which differ (rcp_ss, rsqrt_ss, and sqrt_ss), we could probably just add an explicit comment.

/// <summary>
/// __m128 _mm_rcp_ss (__m128 value)
/// We deviate from the native signature and allow the user to specify upper, rather than having it implicitly come from value. This is for consistency with the other scalar intrinsics.
/// </summary>

@@ -1101,7 +1105,7 @@ public static class Sse2
/// <summary>
/// __m128d _mm_sqrt_sd (__m128d a)
/// </summary>
public static Vector128<double> SqrtScalar(Vector128<double> value) => SqrtScalar(value);
public static Vector128<double> SqrtScalar(Vector128<double> upper, Vector128<double> value) => SqrtScalar(upper, value);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with the two overloads, though I begin to be a bit concerned about even further exploding the API surface.

@tannergooding
Copy link
Member Author

Addressed PR feedback.

For the SSE2 and SSE4.1 intrinsics, I added an overload which only takes value. These also have a comment stating: The above native signature does not exist. We provide this additional overload for the recommended use case of this intrinsic.

For the SSE intrinsics, I added an overload which takes upper and value. These also have a comment stating: The above native signature does not exist. We provide this additional overload for consistency with the other scalar APIs.

This adds a total of 14 new APIs.

@4creators
Copy link

Actually there a couple of other missing APIs in Sse2 - see https://github.com/dotnet/corefx/issues/25926#issuecomment-359447469. These are:

// __m128i _mm_move_epi64 (__m128i a)
public static Vector128<long> MoveScalar(Vector128<long> upper, Vector128<long> value) => MoveScalar(upper, value);
public static Vector128<ulong> MoveScalar(Vector128<ulong> upper, Vector128<ulong> value) => MoveScalar(upper, value);

// __m64 _mm_movepi64_pi64 (__m128i a)
public static long* StoreScalar(Vector128<long> value);
public static ulong* StoreScalar(Vector128<ulong> value);

// void _mm_clflush (void const* p)
public static void FlushCache(void* p);

// void _mm_mfence (void)
public static void MemoryFence();

@tannergooding
Copy link
Member Author

FlushCache is not controlled by the same CPUID flag as the SSE2 ISA, so if it were exposed, it would need to happen in a new ISA. (FlushCache is the CLFSH flag).

As for MemoryFence, I think we had explicitly not exposed some of the fencing instructions at this time.

_mm_movepi64_pi64 is not a Store instruction, it is a MMX interop instruction, and doesn't expose any non register-to-register encodings:
image

The _mm_move_epi64 do look to be missing. Will update when I get a chance.

@4creators
Copy link

You are right about memory instructions and _mm_movepi64_pi64.

@fiigii
Copy link

fiigii commented Jan 22, 2018

There is one more inconsistency:

// SSE4.1
public static unsafe Vector128<sbyte> LoadAlignedNonTemporal(sbyte* address) {}

// AVX2
public static unsafe Vector256<sbyte> LoadAlignedVector256NonTemporal(sbyte* address) {}

@tannergooding could you helpe me to change LoadAlignedNonTemporal to LoadAlignedVector128NonTemporal in this PR?

Sorry for the inconvenience.

@tannergooding
Copy link
Member Author

tannergooding commented Jan 24, 2018

Updated. Added documentation on what instruction an intrinsic maps to.

The below list shows the intrinsics that are not exposed.

  • MMX instructions aren't exposed because they generally aren't useful and cause a transition penalty

  • Explicitly not exposed instructions are generally special memory fencing instructions which was decided to be reviewed at a later date/time

  • Partially exposed instructions are just the reg/reg encodings for the move instructions.

  • SSE

    • MMX Extensions (cause transition from x87 FPU to MMX)
      • CVTPI2PS
      • CVTPS2PI
      • CVTTPS2PI
      • MASKMOVQ
      • MOVNTQ
      • PAVGB
      • PAVGW
      • PEXTRW
      • PINSWR
      • PMAXUB
      • PMAXSW
      • PMINUB
      • PMINSW
      • PMOVMSKB
      • PMULHUW
      • PSADBW
      • PSHUFW
    • Not Exposed (explicit)
      • LDMXCSR
      • PREFETCHh exposed
      • SFENCE exposed
      • STMXCSR
    • Partially Exposed (missing encodings listed)
      • MOVAPS xmm, xmm not exposed, handled by block copy
      • MOVUPS xmm, xmm not exposed, handled by block copy
  • SSE2

    • MMX Extensions (cause transition from x87 FPU to MMX)
      • CVTPI2PD
      • CVTPD2PI
      • CVTTPD2PI
      • MOVQ2DQ
      • MOVDQ2Q
    • Not Exposed (explicit)
      • CLFLUSH CLFSH CPUID
      • LFENCE exposed
      • MFENCE exposed
      • PAUSE
    • Not Exposed (unknown)
      • MOVNTI exposed
    • Partially Exposed (missing encodings listed)
      • MOVAPD xmm, xmm not exposed, handled by block copy
      • MOVUPD xmm, xmm not exposed, handled by block copy
  • SSE3

    • x87 FPU Extensions
      • FISTTP
    • Not Exposed (explicit)
      • MONITOR MONITOR CPUID
      • MWAIT MONITOR CPUID
  • SSE4.1

    • Not Exposed (unknown)
      • PEXTRW not exposed, handled by containment
  • AVX

    • Partially Exposed (missing encodings listed)
      • VMOVDQU ymm, ymm not exposed, handled by block copy
      • VMOVUPS ymm, ymm not exposed, handled by block copy
      • VMOVUPD ymm, ymm not exposed, handled by block copy
      • VMOVDQA ymm, ymm not exposed, handled by block copy
      • VMOVAPS ymm, ymm not exposed, handled by block copy
      • VMOVAPD ymm, ymm not exposed, handled by block copy
  • AVX2

    • Partially Exposed (missing encodings listed)
      • VPBROADCASTB xmm, m8 not exposed, handled by containment
      • VPBROADCASTB ymm, m8 not exposed, handled by containment
      • VPBROADCASTW xmm, m16 not exposed, handled by containment
      • VPBROADCASTW ymm, m16 not exposed, handled by containment
      • VPBROADCASTD xmm, m32 not exposed, handled by containment
      • VPBROADCASTD ymm, m32 not exposed, handled by containment
      • VPBROADCASTQ xmm, m64 not exposed, handled by containment
      • VPBROADCASTQ ymm, m64 not exposed, handled by containment

@4creators
Copy link

4creators commented Jan 24, 2018

SSE4.1 PEXTRW reg/m16, xmm, imm8 allows to extract directly to memory what is not possible with SSE2 instructions. All other PEXTRB/D/Q allow to extract to memory or register. This last instruction unifies all extract API. Perhaps we should include it as well.

@tannergooding
Copy link
Member Author

@fiigii, thoughts?

/// </summary>
public static Vector128<float> MultiplyAdd(Vector128<float> a, Vector128<float> b, Vector128<float> c) { throw new PlatformNotSupportedException(); }
/// <summary>
/// __m128d _mm_fmadd_pd (__m128d a, __m128d b, __m128d c)
/// __m128d _mm_fmadd_pd (__m128d a, __m128d b, __m128d c); VFMADDPD xmm, xmm, xmm/m128
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting the instruction in a new line maybe looks better?

/// <summary>
/// __m128d _mm_fmadd_pd (__m128d a, __m128d b, __m128d c)
/// VFMADDPD xmm, xmm, xmm/m128
/// </summary>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly. It doesn't look as good for the intrinsics with multiple mappings, however.

The current format will work fairly well for a search and replace to improve the documentation however.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eerhardt could you suggest the comment format?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, my preference would also be to have the instruction(s) on a separate line.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that separate lines are more readable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

/// </summary>
public static Vector256<double> ConvertToVector256Double(Vector256<float> value) => ConvertToVector256Double(value);
public static Vector256<double> ConvertToVector256Double(Vector128<float> value) => ConvertToVector256Double(value);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter was incorrect, changed from Vector256 to Vector128

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!!

/// </summary>
public static Vector256<double> ConvertToVector256Double(Vector256<int> value) => ConvertToVector256Double(value);
public static Vector256<double> ConvertToVector256Double(Vector128<int> value) => ConvertToVector256Double(value);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter was incorrect, changed from Vector256 to Vector128

/// </summary>
public static Vector256<byte> Max(Vector256<byte> left, Vector256<short> right) => Max(left, right);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second parameter's base type was incorrect. Changed from short to byte

@fiigii
Copy link

fiigii commented Jan 24, 2018

@fiigii, thoughts?

Do you mean the unexposed instructions? I think we talked most of them previously.
For SSE4.1 PEXTRW reg/m16, xmm, imm8, I think we should follow the current design that does not expose the encoding form as intrinsics. This instruction should be generated by folding Store*.

/// </summary>
public static unsafe Vector128<float> LoadScalar(float* address) => LoadScalar(address);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed name to be LoadScalarVector128

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary, we do not have LoadScalarVector256, and Scalar already implies Vector128.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it here for consistency with the other APIs naming convention. I figured it was better to be consistent.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also prefer the addition of Vector128 to the name; I think it makes it clearer, especially since when encountered in code it may not always be clear that it is returning a vector.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed on the remaining.

/// </summary>
public static unsafe Vector128<float> LoadHigh(Vector128<float> value, float* address) => LoadHigh(value, address);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name of first parameter change to be lower

/// </summary>
public static unsafe Vector128<float> LoadLow(Vector128<float> value, float* address) => LoadLow(value, address);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name of first parameter changed to be upper

/// </summary>
public static Vector128<float> SetScalar(float value) => SetScalar(value);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name changed to be SetScalarVector128

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary,

/// </summary>
public static unsafe Vector128<double> LoadHigh(Vector128<double> value, double* address) => LoadHigh(value, address);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name of first parameter changed to lower.

/// </summary>
public static unsafe Vector128<double> LoadLow(Vector128<double> value, double* address) => LoadLow(value, address);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name of first parameter changed to upper.

/// </summary>
public static Vector128<double> SetScalar(double value) => SetScalar(value);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Named changed to SetScalarVector128

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO from user perspective it is better to have names with Vector128/Vector256 even when Vector128 is implied.

@fiigii
Copy link

fiigii commented Jan 25, 2018

Thank you so much for the work!

/// </summary
public static Vector128<ushort> UnpackHigh(Vector128<ushort> left, Vector128<ushort> right) => UnpackHigh(left, right);
/// <summary>
/// __m128i _mm_unpackhi_epi32 (__m128i a, __m128i b)
/// __m128i _mm_unpackhi_epi32 (__m128i a, __m128i b); PUNPCKHDQ xmm, xmm/m128
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you fix the above /// </summary?

/// </summary>
public static Vector128<byte> Or(Vector128<byte> left, Vector128<byte> right) { throw new PlatformNotSupportedException(); }
public static Vector128<byte> Or(Vector128<byte> left, Vector128<byte> right) { throw new PlatformNotSupportedException(); }

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here.

/// </summary>
public static Vector256<long> Insert(Vector256<long> value, Vector128<long> data, byte index) { throw new PlatformNotSupportedException(); }
/// <summary>
/// __m256i _mm256_inserti128_si256 (__m256i a, __m128i b, const int imm8)
/// __m256i _mm256_inserti128_si256 (__m256i a, __m128i b, const int imm8); VINSERTI128 ymm, ymm, m128, imm8
/// </summary>
public static unsafe Vector256<long> Insert(Vector256<long> value, long* address,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question and one suggestion

/// </summary>
public static Vector128<float> MultiplyAdd(Vector128<float> a, Vector128<float> b, Vector128<float> c) { throw new PlatformNotSupportedException(); }
/// <summary>
/// __m128d _mm_fmadd_pd (__m128d a, __m128d b, __m128d c)
/// __m128d _mm_fmadd_pd (__m128d a, __m128d b, __m128d c); VFMADDPD xmm, xmm, xmm/m128

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, my preference would also be to have the instruction(s) on a separate line.

/// </summary>
public static unsafe Vector128<float> LoadScalar(float* address) => LoadScalar(address);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also prefer the addition of Vector128 to the name; I think it makes it clearer, especially since when encountered in code it may not always be clear that it is returning a vector.

/// __m128i _mm_mul_epu32 (__m128i a, __m128i b)
/// __m128i _mm_move_epi64 (__m128i a); MOVQ xmm, xmm
/// </summary>
public static Vector128<long> MoveScalar(Vector128<long> value) => MoveScalar(value);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reasoning for this being MoveScalar, but changing SetScalar to SetScalarVector128 when I believe in both cases the entire 128-bit vector is defined.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The former (MoveScalar) takes and returns Vector128, so it shouldn't cause overload resolution problems. The latter (SetScalar) takes a primitive and returns a Vector128, so it could potentially conflict.

I can change if you think its clearer to provide it here as well.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more consistent, but I don't feel strongly about it.

/// </summary>
public static Vector128<double> BlendVariable(Vector128<double> left, Vector128<double> right, Vector128<double> mask) { throw new PlatformNotSupportedException(); }

/// <summary>
/// __m128 _mm_ceil_ps (__m128 a)
/// __m128 _mm_ceil_ps (__m128 a); ROUNDPS xmm, xmm/m128, imm8(10)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be clearer to write the implicit immediate as imm8=0b10?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably. 👍

@fiigii
Copy link

fiigii commented Jan 25, 2018

@tannergooding @CarolEidt @mikedn Shall we change the APIs of LoadScalar to take ref parameter instead of pointer?
That keeps consistent with BroadcastElementToVector128/256, so that load/store from/to a scalar value via ref because it does not need alignment.

/// </summary>
public static Vector128<short> PackSignedSaturate(Vector128<int> left, Vector128<int> right) => PackSignedSaturate(left, right);

/// <summary>
/// __m128i _mm_packus_epi16 (__m128i a, __m128i b)
/// __m128i _mm_packus_epi16 (__m128i a, __m128i b);
/// PACKUSWB xmm, xmm/m128
/// </summary>
public static Vector128<byte> PackUnsignedSaturate(Vector128<short> left, Vector128<short> right) => PackUnsignedSaturate(left, right);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that void _mm_pause (void) or PAUSE intrinsic is missing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicitly avoiding this one as it could have undesired impacts on threading/GC, etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a separate, in depth discussion, to determine if it is okay to expose.

Copy link
Member

@jkotas jkotas Jan 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pause is already exposed as Thread.SpinWait(). It is not inlined by the JIT today, but that can be fixed... .

It is incredibly hard to use this instruction directly because of it has very different performance characteristics on different processor models (https://github.com/dotnet/coreclr/issues/13388).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding @fiigii - void _mm_pause (void) presence is not controlled by any CPUID flag and is available since Pentium 4 there is no reson to exclude it - anyway I remember that in corefx repo there were several issues with correctly tuning spin locks and this instruction could be a perfect solution to this problems

Copy link
Member Author

@tannergooding tannergooding Jan 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkotas, if it is already "exposed" and shouldn't cause issues with threading or the GC, would it be good to expose here anyways, for completeness?

Edit: Page didn't refresh before I commented, sounds like exposing won't buy us much. We can always come back to this later if enough people request it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think so.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in such case problem is already "solved" by instruction being hard to use

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that github is working today really very slow and I got out of the context very fast

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thread.SpinWait() that gives you some consistency to get your spin loop tuned properly
has very different performance characteristics on different processor models

Right, we already did lots of work to tune the spin.

/// </summary>
public static Vector128<double> SetScalar(double value) => SetScalar(value);
public static Vector128<double> SetScalarVector128(double value) => SetScalarVector128(value);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming: I will stop commenting on order in naming convention

/// </summary>
public static Vector128<short> ShuffleLow(Vector128<short> value, byte control) => ShuffleLow(value, control);
/// <summary>
/// __m128i _mm_shufflelo_epi16 (__m128i a, int control)
/// __m128i _mm_shufflelo_epi16 (__m128i a, int control);
/// PSHUFLW xmm, xmm/m128, imm8
/// </summary>
public static Vector128<ushort> ShuffleLow(Vector128<ushort> value, byte control) => ShuffleLow(value, control);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functions are not in alphabetical order - perhaps it would be easier to work with them if they will be sorted

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That can be dealt with by a separate PR.

/// void _mm_prefetch(char* p, int i)
/// PREFETCHT1 m8
/// </summary>
public static unsafe void Prefetch1(void* address) => Prefetch1(address);
Copy link

@fiigii fiigii Jan 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we update certain Load*/Store* intrinsics to void* as well?
For example,

//Avx
        public static unsafe Vector256<sbyte> LoadVector256(sbyte* address) => LoadVector256(address);
        public static unsafe Vector256<byte> LoadVector256(byte* address) => LoadVector256(address);
        public static unsafe Vector256<short> LoadVector256(short* address) => LoadVector256(address);
        public static unsafe Vector256<ushort> LoadVector256(ushort* address) => LoadVector256(address);
        public static unsafe Vector256<int> LoadVector256(int* address) => LoadVector256(address);
        public static unsafe Vector256<uint> LoadVector256(uint* address) => LoadVector256(address);
        public static unsafe Vector256<long> LoadVector256(long* address) => LoadVector256(address);
        public static unsafe Vector256<ulong> LoadVector256(ulong* address) => LoadVector256(address);
        public static unsafe Vector256<float> LoadVector256(float* address) => LoadVector256(address);
        public static unsafe Vector256<double> LoadVector256(double* address) => LoadVector256(address);

=>

public static unsafe Vector256<T> LoadVector256<T>(void* address) => LoadVector256<T>(address);

cc @jkotas @CarolEidt

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case we load into vector of predefined type and enforcing pointer to be of the same type seems correct

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, thanks!

/// </summary>
public static Vector128<float> BroadcastElementToVector128(ref float source) { throw new PlatformNotSupportedException(); }
public static unsafe Vector128<float> BroadcastElementToVector128(float* source) { throw new PlatformNotSupportedException(); }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fiigii, should these be called BroadcastScalarToVector128?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(for consistency with the other operations that take a scalar.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO in this case we have prior art in the form of established Intel docs which use term element and programmers IMHO probably got used to read those instructions in this way - at least this is my experience

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be more consistent, I think

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And Avx2 counterparts should be updated as well.

@tannergooding
Copy link
Member Author

Put up the CoreFX side PR: dotnet/corefx#26668

It is much easier to see just the API changes via it (where-as this one also contains the doc comment updates).

@tannergooding
Copy link
Member Author

@4creators, looking at the names, it actually looks to be fairly consistent already.

For example, LoadAlignedVector128NonTemporal:

  • Load is the action
  • Aligned Vector128 is the type
  • Non Temporal is the modifier

For LoadScalarVector128

  • Load is the action
  • Scalar Vector128 is the type

@4creators
Copy link

looking at the names, it actually looks to be fairly consistent already

The point is that from my perspective and experience Vector128 represents mostly xmm register and action is load aligned or load unaligned which we don't use - it is personal perspective on both sides so perhaps we should ask API designers?

@tannergooding
Copy link
Member Author

erspective and experience Vector128 represents mostly xmm register and action is load aligned or load unaligned

I think it is also consistent from that perspective, except for a couple (like NonTemporal, where it should be LoadAlignedNonTemporalVector128, from that perspective)

@tannergooding
Copy link
Member Author

@CarolEidt, @fiigii. Do you have a preference on whether it should be LoadAlignedNonTemporalVector128, or LoadAlignedVector128NonTemporal (we are currently the latter).

@fiigii
Copy link

fiigii commented Jan 30, 2018

I prefer LoadAlignedVector128NonTemporal.

/// </summary>
public static float ConvertToSingle(Vector128<float> value) => ConvertToSingle(value);

/// <summary>
/// __m128 _mm_cvtsi32_ss (__m128 a, int b)
/// CVTSI2SS xmm, reg/m32
/// </summary>
public static Vector128<float> ConvertToVector128SingleScalar(Vector128<float> upper, int value) => ConvertToVector128SingleScalar(upper, value);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConvertScalarToVector128Single might be a better name. Thoughts?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree.

@tannergooding
Copy link
Member Author

Ok. I think all feedback and questions have once again been answered.

Both this and the CoreFX PR (dotnet/corefx#26668) have been updated to be in sync.

I think all missing APIs that we want to expose have been exposed. Any that aren't exposed have been discussed and agreed upon.

I think all wanted renames have now happend (except for potentially #15923 (comment), which I would like to hear from @CarolEidt and @eerhardt on).

@CarolEidt
Copy link

I think it is also consistent from that perspective, except for a couple (like NonTemporal, where it should be LoadAlignedNonTemporalVector128, from that perspective)

I'm not sure I understand what perspective you're talking about. I thought that we were moving toward "Aligned Vector128" as the type, and "NonTemporal" as the modifier, and action + type + modifier it would be LoadAlignedVector128NonTemporal, which I think also reads better than LoadAlignedNonTemporalVector128

@tannergooding
Copy link
Member Author

@CarolEidt. @4creators was reading it differently, in that LoadAligned was the action and Vector128 was the type, where I was viewing it as Load being the action and AlignedVector128 was the type. (his perspective vs mine).

For the most part, our two perspectives actually line up, regardless of how you view the action vs type. The place they don't is for NonTemporal, where you could view LoadAlignedNonTemporal as the action and Vector128 as the type (from @4creators perspective) or you could view it as Load (action), AlignedVector128 (type), and NonTemporal (modifier).

This is also, actually, the only instruction I saw that has a "modifier", so it might make sense to change it to be LoadAlignedNonTemporalVector128 (we originally called it just LoadAlignedNonTemporal, before we added the type in to disambiguate).

@CarolEidt
Copy link

@tannergooding - I see now, thanks. I don't have a strong opinion either way, but if I were to be pedantic about it, I would say that alignment is very much a property of the object being loaded, while NonTemporal is the way in which the same object may or may not be loaded. So it seems to me that the Aligned and Vector128 should bind more closely than the NonTemporal. If anything the NonTemporal would be part of the action (if not a modifier) and therefore: LoadNonTemporalAlignedVector128. That is unfortunately a bit odd because if I'm looking for aligned loads I might specifically be looking for LoadAligned, which leads me again to the "modifier" approach.

But, as I say, I don't have a strong opinion.

@tannergooding
Copy link
Member Author

That is unfortunately a bit odd because if I'm looking for aligned loads I might specifically be looking for LoadAligned, which leads me again to the "modifier" approach.

I agree. I'll stick with LoadAlignedVector128NonTemporal, which is what @fiigii voted for as well.

That should be all the issues and this should be good for final review (assuming no other issues crop up 😄).

Copy link
Member

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API changes look good to me.

Thanks @tannergooding for pushing this forward.

@benaadams
Copy link
Member

@tannergooding why is PAUSE Not Exposed (explicit)?

Thread.SpinWait(1) is a bit of a heavyweight way to get there

@tannergooding
Copy link
Member Author

@benaadams, I believe most of the conversation was captured here: https://github.com/dotnet/coreclr/issues/17818#issuecomment-392617183. Also see the comment chain here: #15923 (comment)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Sse2.MoveMask that takes Vector128<byte>
8 participants