-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Fixing some inconsistencies in the x86 HWIntrinsic APIs #15923
Conversation
I discussed 2 with @fiigii at some length. My reasonings for the change are basically consistency, clarity, and simplicity:
We basically have two types of scalar intrinsics. Those that zero the upper bits, and those that set it to the value of one of the source registers. The behavior of these intrinsics matches the behavior of the underlying hardware instructions (some zero the upper bits and some set it to a value of one of the sources).
When matching the signature of the C++ intrinsics, all scalar intrinsics except for sqrtss, rsqrtss, and rcpss explicitly take two parameters (a and b) and set the upper bits of the result to the upper bits of a and perform the scalar operation on the lower bits of b. Not having the intrinsics match for the three mentioned meant the user had to remember that the upper bits aren't set to zero for these in particular.
For the scalar intrinsics which do set the upper bits, having an explicit upper value makes the result of the operation clear. That is
Not having an explicit parameter also means that the handling for those intrinsics needs their own code path for codegen (they are 1-argument intrinsics that call Additionally, it means the user has to manually write additional code to handle the scenario where they want upper bits to differ from the existing upper bits in |
FYI. @CarolEidt, @fiigii, @eerhardt |
@@ -1101,7 +1105,7 @@ public static class Sse2 | |||
/// <summary> | |||
/// __m128d _mm_sqrt_sd (__m128d a) | |||
/// </summary> | |||
public static Vector128<double> SqrtScalar(Vector128<double> value) => SqrtScalar(value); | |||
public static Vector128<double> SqrtScalar(Vector128<double> upper, Vector128<double> value) => SqrtScalar(upper, value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the performance perspective, we always want to duplicate the source register for some of these instructions with VEX-encoding. For example, vsqrtss xmm0, xmm1, xmm1
is better than vsqrtss xmm0, xmm0, xmm1
and vsqrtss xmm0, xmm2, xmm1
. So SqrtScalar(upper, value)
may make performance issues that is hard to detected.
Form the user experience perspective, SqrtScalar(value)
usually is more straightforward and useful than SqrtScalar(upper, value)
.
In summary, I think we can have SqrtScalar(value)
and SqrtScalar(upper, value)
both if we really need the 2-arg version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we always want to duplicate the source register for some of these instructions with VEX-encoding.
I think this should say: The recommendation is to duplicate the source register for some of these instructions.
I think its one of those things where an analyzer comes in handy (just like the user could do SetAll(0.0f)
, instead of SetZero()
, even though the latter is more performant).
For example, vsqrtss xmm0, xmm1, xmm1 is better than vsqrtss xmm0, xmm0, xmm1 and vsqrtss xmm0, xmm2, xmm1
However, vsqrtss target, upper, value
(when upperReg != valueReg) will produce smaller codegen then movss value, upper, value; vsqrtss target, value, value
, which is what users would need to code otherwise (via value = MoveScalar(upper, value); target = SqrtScalar(value)
). It is also (theoretically) faster since there are fewer instructions filling the pipeline (the equivalent non-vex instructions behave similarly).
In summary, I think we can have SqrtScalar(value) and SqrtScalar(upper, value) both if we really need the 2-arg version.
I was in agreement that this could/should be left open as an option. It is not possible to do for the binary scalar simd instructions (like AddScalar
), but does allow easy exposure of the "recommended" use for the unary scalar simd instructions (with the caveats that it "may" reduce consistency/clarity of the instructions behavior).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @fiigii that good solution would be to have two scalar overloads (i) two arg version and (ii) one arg version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not possible to do for the binary scalar simd instructions (like AddScalar),
Absolutely, I meant just for certain 1-arg SIMD scalar intrinsic, such as SqrtScalar
, RoundCurrentDirectionScalar
, ReciprocalScalar
, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with the two overloads, though I begin to be a bit concerned about even further exploding the API surface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
though I begin to be a bit concerned about even further exploding the API surface.
That is one of my concerns as well.
These could possibly be implemented in terms of SqrtScalar(value) => SqrtScalar(value, value)
, with an aggressive inlining attribute or as a general helper library (CoreFXExtensions) for commonly used algorithms/helpers...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These could possibly be implemented in terms of SqrtScalar(value) => SqrtScalar(value, value), with an aggressive inlining attribute or as a general helper library (CoreFXExtensions) for commonly used algorithms/helpers...
That will make JIT/VM more complex because all the methods in this namespace are automatically labeled with [Intrinsic]
. If we want the two versions both, implementing them in JIT via the table-driven framework is really easy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a bit concerned about even further exploding the API surface
We are much more concise than C++ implementation. The overloads which are proposed are the closest match to original C++ intrinsics and there will be only few of them. This should not have any significant impact on number of methods which would be around 1 300 for current accepted Intel ISA. I have not counted Arm but it will be significantly more concise as there are more methods which can be expressed as generics according to our API design rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AH, I forgot about the implicit [Intrinsic]
handling for all methods here
@@ -344,7 +344,7 @@ public static class Sse | |||
/// <summary> | |||
/// __m128 _mm_rcp_ss (__m128 a) | |||
/// </summary> | |||
public static Vector128<float> ReciprocalScalar(Vector128<float> value) => ReciprocalScalar(value); | |||
public static Vector128<float> ReciprocalScalar(Vector128<float> upper, Vector128<float> value) => ReciprocalScalar(upper, value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This leads to confusion where the C++ intrinsic doesn't include the additional argument, and makes me wonder whether we should include both the C++intrinsic as well as the actual opcode, and use the parameter names, e.g. for this case we would have:
/// <summary>
/// __m128 _mm_rcp_ss (__m128 value)
/// rcpps upper, value
/// </summary>
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are only 3 intrinsics which differ (rcp_ss
, rsqrt_ss
, and sqrt_ss
), we could probably just add an explicit comment.
/// <summary>
/// __m128 _mm_rcp_ss (__m128 value)
/// We deviate from the native signature and allow the user to specify upper, rather than having it implicitly come from value. This is for consistency with the other scalar intrinsics.
/// </summary>
@@ -1101,7 +1105,7 @@ public static class Sse2 | |||
/// <summary> | |||
/// __m128d _mm_sqrt_sd (__m128d a) | |||
/// </summary> | |||
public static Vector128<double> SqrtScalar(Vector128<double> value) => SqrtScalar(value); | |||
public static Vector128<double> SqrtScalar(Vector128<double> upper, Vector128<double> value) => SqrtScalar(upper, value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with the two overloads, though I begin to be a bit concerned about even further exploding the API surface.
Addressed PR feedback. For the SSE2 and SSE4.1 intrinsics, I added an overload which only takes value. These also have a comment stating: For the SSE intrinsics, I added an overload which takes upper and value. These also have a comment stating: This adds a total of 14 new APIs. |
Actually there a couple of other missing APIs in Sse2 - see https://github.com/dotnet/corefx/issues/25926#issuecomment-359447469. These are: // __m128i _mm_move_epi64 (__m128i a)
public static Vector128<long> MoveScalar(Vector128<long> upper, Vector128<long> value) => MoveScalar(upper, value);
public static Vector128<ulong> MoveScalar(Vector128<ulong> upper, Vector128<ulong> value) => MoveScalar(upper, value);
// __m64 _mm_movepi64_pi64 (__m128i a)
public static long* StoreScalar(Vector128<long> value);
public static ulong* StoreScalar(Vector128<ulong> value);
// void _mm_clflush (void const* p)
public static void FlushCache(void* p);
// void _mm_mfence (void)
public static void MemoryFence(); |
You are right about memory instructions and _mm_movepi64_pi64. |
There is one more inconsistency: // SSE4.1
public static unsafe Vector128<sbyte> LoadAlignedNonTemporal(sbyte* address) {}
// AVX2
public static unsafe Vector256<sbyte> LoadAlignedVector256NonTemporal(sbyte* address) {} @tannergooding could you helpe me to change Sorry for the inconvenience. |
Updated. Added documentation on what instruction an intrinsic maps to. The below list shows the intrinsics that are not exposed.
|
SSE4.1 |
@fiigii, thoughts? |
/// </summary> | ||
public static Vector128<float> MultiplyAdd(Vector128<float> a, Vector128<float> b, Vector128<float> c) { throw new PlatformNotSupportedException(); } | ||
/// <summary> | ||
/// __m128d _mm_fmadd_pd (__m128d a, __m128d b, __m128d c) | ||
/// __m128d _mm_fmadd_pd (__m128d a, __m128d b, __m128d c); VFMADDPD xmm, xmm, xmm/m128 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting the instruction in a new line maybe looks better?
/// <summary>
/// __m128d _mm_fmadd_pd (__m128d a, __m128d b, __m128d c)
/// VFMADDPD xmm, xmm, xmm/m128
/// </summary>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly. It doesn't look as good for the intrinsics with multiple mappings, however.
The current format will work fairly well for a search and replace to improve the documentation however.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eerhardt could you suggest the comment format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, my preference would also be to have the instruction(s) on a separate line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that separate lines are more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
/// </summary> | ||
public static Vector256<double> ConvertToVector256Double(Vector256<float> value) => ConvertToVector256Double(value); | ||
public static Vector256<double> ConvertToVector256Double(Vector128<float> value) => ConvertToVector256Double(value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter was incorrect, changed from Vector256
to Vector128
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!!
/// </summary> | ||
public static Vector256<double> ConvertToVector256Double(Vector256<int> value) => ConvertToVector256Double(value); | ||
public static Vector256<double> ConvertToVector256Double(Vector128<int> value) => ConvertToVector256Double(value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter was incorrect, changed from Vector256
to Vector128
/// </summary> | ||
public static Vector256<byte> Max(Vector256<byte> left, Vector256<short> right) => Max(left, right); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second parameter's base type was incorrect. Changed from short
to byte
Do you mean the unexposed instructions? I think we talked most of them previously. |
/// </summary> | ||
public static unsafe Vector128<float> LoadScalar(float* address) => LoadScalar(address); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed name to be LoadScalarVector128
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessary, we do not have LoadScalarVector256
, and Scalar
already implies Vector128
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it here for consistency with the other APIs naming convention. I figured it was better to be consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also prefer the addition of Vector128
to the name; I think it makes it clearer, especially since when encountered in code it may not always be clear that it is returning a vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed on the remaining.
/// </summary> | ||
public static unsafe Vector128<float> LoadHigh(Vector128<float> value, float* address) => LoadHigh(value, address); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name of first parameter change to be lower
/// </summary> | ||
public static unsafe Vector128<float> LoadLow(Vector128<float> value, float* address) => LoadLow(value, address); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name of first parameter changed to be upper
/// </summary> | ||
public static Vector128<float> SetScalar(float value) => SetScalar(value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name changed to be SetScalarVector128
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessary,
/// </summary> | ||
public static unsafe Vector128<double> LoadHigh(Vector128<double> value, double* address) => LoadHigh(value, address); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name of first parameter changed to lower
.
/// </summary> | ||
public static unsafe Vector128<double> LoadLow(Vector128<double> value, double* address) => LoadLow(value, address); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name of first parameter changed to upper
.
/// </summary> | ||
public static Vector128<double> SetScalar(double value) => SetScalar(value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Named changed to SetScalarVector128
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO from user perspective it is better to have names with Vector128/Vector256 even when Vector128 is implied.
Thank you so much for the work! |
/// </summary | ||
public static Vector128<ushort> UnpackHigh(Vector128<ushort> left, Vector128<ushort> right) => UnpackHigh(left, right); | ||
/// <summary> | ||
/// __m128i _mm_unpackhi_epi32 (__m128i a, __m128i b) | ||
/// __m128i _mm_unpackhi_epi32 (__m128i a, __m128i b); PUNPCKHDQ xmm, xmm/m128 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you fix the above /// </summary
?
/// </summary> | ||
public static Vector128<byte> Or(Vector128<byte> left, Vector128<byte> right) { throw new PlatformNotSupportedException(); } | ||
public static Vector128<byte> Or(Vector128<byte> left, Vector128<byte> right) { throw new PlatformNotSupportedException(); } | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here.
/// </summary> | ||
public static Vector256<long> Insert(Vector256<long> value, Vector128<long> data, byte index) { throw new PlatformNotSupportedException(); } | ||
/// <summary> | ||
/// __m256i _mm256_inserti128_si256 (__m256i a, __m128i b, const int imm8) | ||
/// __m256i _mm256_inserti128_si256 (__m256i a, __m128i b, const int imm8); VINSERTI128 ymm, ymm, m128, imm8 | ||
/// </summary> | ||
public static unsafe Vector256<long> Insert(Vector256<long> value, long* address, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question and one suggestion
/// </summary> | ||
public static Vector128<float> MultiplyAdd(Vector128<float> a, Vector128<float> b, Vector128<float> c) { throw new PlatformNotSupportedException(); } | ||
/// <summary> | ||
/// __m128d _mm_fmadd_pd (__m128d a, __m128d b, __m128d c) | ||
/// __m128d _mm_fmadd_pd (__m128d a, __m128d b, __m128d c); VFMADDPD xmm, xmm, xmm/m128 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, my preference would also be to have the instruction(s) on a separate line.
/// </summary> | ||
public static unsafe Vector128<float> LoadScalar(float* address) => LoadScalar(address); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also prefer the addition of Vector128
to the name; I think it makes it clearer, especially since when encountered in code it may not always be clear that it is returning a vector.
/// __m128i _mm_mul_epu32 (__m128i a, __m128i b) | ||
/// __m128i _mm_move_epi64 (__m128i a); MOVQ xmm, xmm | ||
/// </summary> | ||
public static Vector128<long> MoveScalar(Vector128<long> value) => MoveScalar(value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reasoning for this being MoveScalar
, but changing SetScalar
to SetScalarVector128
when I believe in both cases the entire 128-bit vector is defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The former (MoveScalar
) takes and returns Vector128
, so it shouldn't cause overload resolution problems. The latter (SetScalar
) takes a primitive and returns a Vector128
, so it could potentially conflict.
I can change if you think its clearer to provide it here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be more consistent, but I don't feel strongly about it.
/// </summary> | ||
public static Vector128<double> BlendVariable(Vector128<double> left, Vector128<double> right, Vector128<double> mask) { throw new PlatformNotSupportedException(); } | ||
|
||
/// <summary> | ||
/// __m128 _mm_ceil_ps (__m128 a) | ||
/// __m128 _mm_ceil_ps (__m128 a); ROUNDPS xmm, xmm/m128, imm8(10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be clearer to write the implicit immediate as imm8=0b10
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably. 👍
@tannergooding @CarolEidt @mikedn Shall we change the APIs of |
/// </summary> | ||
public static Vector128<short> PackSignedSaturate(Vector128<int> left, Vector128<int> right) => PackSignedSaturate(left, right); | ||
|
||
/// <summary> | ||
/// __m128i _mm_packus_epi16 (__m128i a, __m128i b) | ||
/// __m128i _mm_packus_epi16 (__m128i a, __m128i b); | ||
/// PACKUSWB xmm, xmm/m128 | ||
/// </summary> | ||
public static Vector128<byte> PackUnsignedSaturate(Vector128<short> left, Vector128<short> right) => PackUnsignedSaturate(left, right); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that void _mm_pause (void)
or PAUSE
intrinsic is missing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explicitly avoiding this one as it could have undesired impacts on threading/GC, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a separate, in depth discussion, to determine if it is okay to expose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pause is already exposed as Thread.SpinWait()
. It is not inlined by the JIT today, but that can be fixed... .
It is incredibly hard to use this instruction directly because of it has very different performance characteristics on different processor models (https://github.com/dotnet/coreclr/issues/13388).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tannergooding @fiigii - void _mm_pause (void)
presence is not controlled by any CPUID flag and is available since Pentium 4 there is no reson to exclude it - anyway I remember that in corefx repo there were several issues with correctly tuning spin locks and this instruction could be a perfect solution to this problems
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkotas, if it is already "exposed" and shouldn't cause issues with threading or the GC, would it be good to expose here anyways, for completeness?
Edit: Page didn't refresh before I commented, sounds like exposing won't buy us much. We can always come back to this later if enough people request it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in such case problem is already "solved" by instruction being hard to use
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that github is working today really very slow and I got out of the context very fast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thread.SpinWait() that gives you some consistency to get your spin loop tuned properly
has very different performance characteristics on different processor models
Right, we already did lots of work to tune the spin.
/// </summary> | ||
public static Vector128<double> SetScalar(double value) => SetScalar(value); | ||
public static Vector128<double> SetScalarVector128(double value) => SetScalarVector128(value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming: I will stop commenting on order in naming convention
/// </summary> | ||
public static Vector128<short> ShuffleLow(Vector128<short> value, byte control) => ShuffleLow(value, control); | ||
/// <summary> | ||
/// __m128i _mm_shufflelo_epi16 (__m128i a, int control) | ||
/// __m128i _mm_shufflelo_epi16 (__m128i a, int control); | ||
/// PSHUFLW xmm, xmm/m128, imm8 | ||
/// </summary> | ||
public static Vector128<ushort> ShuffleLow(Vector128<ushort> value, byte control) => ShuffleLow(value, control); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functions are not in alphabetical order - perhaps it would be easier to work with them if they will be sorted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That can be dealt with by a separate PR.
/// void _mm_prefetch(char* p, int i) | ||
/// PREFETCHT1 m8 | ||
/// </summary> | ||
public static unsafe void Prefetch1(void* address) => Prefetch1(address); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we update certain Load*/Store* intrinsics to void*
as well?
For example,
//Avx
public static unsafe Vector256<sbyte> LoadVector256(sbyte* address) => LoadVector256(address);
public static unsafe Vector256<byte> LoadVector256(byte* address) => LoadVector256(address);
public static unsafe Vector256<short> LoadVector256(short* address) => LoadVector256(address);
public static unsafe Vector256<ushort> LoadVector256(ushort* address) => LoadVector256(address);
public static unsafe Vector256<int> LoadVector256(int* address) => LoadVector256(address);
public static unsafe Vector256<uint> LoadVector256(uint* address) => LoadVector256(address);
public static unsafe Vector256<long> LoadVector256(long* address) => LoadVector256(address);
public static unsafe Vector256<ulong> LoadVector256(ulong* address) => LoadVector256(address);
public static unsafe Vector256<float> LoadVector256(float* address) => LoadVector256(address);
public static unsafe Vector256<double> LoadVector256(double* address) => LoadVector256(address);
=>
public static unsafe Vector256<T> LoadVector256<T>(void* address) => LoadVector256<T>(address);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case we load into vector of predefined type and enforcing pointer to be of the same type seems correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, thanks!
/// </summary> | ||
public static Vector128<float> BroadcastElementToVector128(ref float source) { throw new PlatformNotSupportedException(); } | ||
public static unsafe Vector128<float> BroadcastElementToVector128(float* source) { throw new PlatformNotSupportedException(); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fiigii, should these be called BroadcastScalarToVector128
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(for consistency with the other operations that take a scalar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO in this case we have prior art in the form of established Intel docs which use term element
and programmers IMHO probably got used to read those instructions in this way - at least this is my experience
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be more consistent, I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And Avx2 counterparts should be updated as well.
Put up the CoreFX side PR: dotnet/corefx#26668 It is much easier to see just the API changes via it (where-as this one also contains the doc comment updates). |
@4creators, looking at the names, it actually looks to be fairly consistent already. For example,
For
|
The point is that from my perspective and experience Vector128 represents mostly |
I think it is also consistent from that perspective, except for a couple (like |
@CarolEidt, @fiigii. Do you have a preference on whether it should be |
I prefer |
/// </summary> | ||
public static float ConvertToSingle(Vector128<float> value) => ConvertToSingle(value); | ||
|
||
/// <summary> | ||
/// __m128 _mm_cvtsi32_ss (__m128 a, int b) | ||
/// CVTSI2SS xmm, reg/m32 | ||
/// </summary> | ||
public static Vector128<float> ConvertToVector128SingleScalar(Vector128<float> upper, int value) => ConvertToVector128SingleScalar(upper, value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ConvertScalarToVector128Single
might be a better name. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree.
Ok. I think all feedback and questions have once again been answered. Both this and the CoreFX PR (dotnet/corefx#26668) have been updated to be in sync. I think all missing APIs that we want to expose have been exposed. Any that aren't exposed have been discussed and agreed upon. I think all wanted renames have now happend (except for potentially #15923 (comment), which I would like to hear from @CarolEidt and @eerhardt on). |
I'm not sure I understand what perspective you're talking about. I thought that we were moving toward "Aligned Vector128" as the type, and "NonTemporal" as the modifier, and action + type + modifier it would be |
@CarolEidt. @4creators was reading it differently, in that For the most part, our two perspectives actually line up, regardless of how you view the action vs type. The place they don't is for This is also, actually, the only instruction I saw that has a "modifier", so it might make sense to change it to be |
@tannergooding - I see now, thanks. I don't have a strong opinion either way, but if I were to be pedantic about it, I would say that alignment is very much a property of the object being loaded, while But, as I say, I don't have a strong opinion. |
I agree. I'll stick with That should be all the issues and this should be good for final review (assuming no other issues crop up 😄). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API changes look good to me.
Thanks @tannergooding for pushing this forward.
@tannergooding why is
|
@benaadams, I believe most of the conversation was captured here: https://github.com/dotnet/coreclr/issues/17818#issuecomment-392617183. Also see the comment chain here: #15923 (comment) |
This resolves https://github.com/dotnet/corefx/issues/26433 by adding the missing Sse2.MoveMask API.
This resolves a few issues in the Sse2 and Sse41 intrinsics where the signatures differed from the C++ signatures. This also makes the corresponding Sse intrinsics deviate from their C++ signatures to be inline with the rest of the scalar intrinsics.
This fixes naming of arguments in LoadHigh and LoadLow to be consistent with the naming of other parameters where the operand is used for the upper or lower bits of the returned result.