-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorize IEnumerable<T>.Sum where possible #84519
Conversation
Tagging subscribers to this area: @dotnet/area-system-linq Issue DetailsWhen performing Note that basic vectorized addition doesn't perform overflow checks, so the checks must be implemented in code. Despite this extra cost vectorization appears to be a net gain. The exception is 128-bit vectors holding 64-bit long integers, so they are excluded from this optimization. This should also have knock-on improvements for Benchmarks below are on an Intel Core i7 for both BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.22621.1485) PowerPlanMode=00000000-0000-0000-0000-000000000000 IterationTime=250.0000 ms MaxIterationCount=20
BenchmarkDotNet=v0.13.2.2052-nightly, OS=ubuntu 22.04 PowerPlanMode=00000000-0000-0000-0000-000000000000 IterationTime=250.0000 ms MaxIterationCount=20
|
@brantburnett according to your benchmarks table, the vectorized version is slower for all cases, this might be a mistake I suppose? |
Yes, silly typo when I was replacing the long |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be interesting if you could share any benchmark results showcasing the improvements of the change.
You can find details on how to benchmark private builds in this doc.
There are benchmark results for both x64 and ARM in the main PR description, are there other benchmarks you'd like to see? |
D'oh! Completely skipped past those :-) |
Vector<T> accumulator = Vector<T>.Zero; | ||
|
||
// Build a test vector with only the sign bit set in each element. JIT will fold this into a constant. | ||
Vector<T> overflowTestVector = new(T.RotateRight(T.MultiplicativeIdentity, 1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we know T
works with Vector
, we know it must be a primitive value which is 2's complement. So we can add where T : IMinMaxValue<T>
and simply use T.MinValue
instead to get a mask of the sign
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
ptr = ref Unsafe.Add(ref ptr, Vector<T>.Count * 4); | ||
length -= Vector<T>.Count * 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar feedback has been given on other PRs, but we'd prefer to avoid mutating the byref
where possible in favor of standard indexing and using LoadUnsafe(ref baseAddress, index)
instead. You can then increment index
by Vector<T>.Count
in the loop.
This can result in slightly worse codegen in some cases, but it makes the code simpler and less prone to GC holes and similar bugs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, and the JIT output still looks decent. It was able to encode the offset arithmetic into the operation on x64:
lea ebx, [eax+edi+0x40]
Unfortunately, it's an extra lea
operation. While this did impact the benchmarks, they are still an improvement over unvectorized.
// Add any remaining elements | ||
for (int i = 0; i < length; i++) | ||
{ | ||
checked { result += Unsafe.Add(ref ptr, i); } | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be handled by doing a backtrack and masking off already processed data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any suggestions on how to build a mask like that in a Vector<T>
? So far the only way I've found is looping and calling WithElement
, but looking at the resulting JIT of that loop I feel like just adding the elements from the span will be way better.
L0126: vxorps ymm3, ymm3, ymm3
L012a: vmovupd [ebp-0x2c], ymm3
L012f: cmp eax, 8
L0132: jae L020a
L0138: vmovupd [ebp-0x2c], ymm2
L013d: lea ecx, [ebp-0x2c]
L0140: xor esi, esi
L0142: mov [ecx+eax*4], esi
L0145: vmovupd ymm2, [ebp-0x2c]
L014a: inc eax
L014b: cmp edx, eax
L014d: jg short L0126
I considered a static ROS of zero elements followed by all bits set elements that could be indexed to find the mask. However, this is problematic because we're dealing with generic integers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I think I have a partial solution. Using left byte-shift operations on vectors will work to fill with zero elements, since we don't care if we move the elements being summed into different lanes. The problem is that this only works with Vector128<T>
on Intel. So to use this solution we'd need to either:
A) Only implement as Vector128<T>
, losing long
support and running slower on Intel.
or B) Implement split Vector128<T>
and Vector256<T>
implementations, with lots of code duplication. We'd then only gain the backtrack advantages on short int
arrays that fallback to Vector128<T>
, old Intel processors, and ARM.
I don't think A is a good option. And I'm not sure B is worth it? Let me know what you think.
Breaking change doc: dotnet/docs#37734 |
When performing
Sum()
onIEnumerable<T>
of typeint
orlong
and whenIEnumerable<T>
is representable as aReadOnlySpan<T>
(such as arrays andList<T>
) it is possible to vectorize the implementation for improved performance. The only added cost for the slow-path fallback is a length check to be sure theReadOnlySpan<T>
is at least 4Vector<T>
long.Note that basic vectorized addition doesn't perform overflow checks, so the checks must be implemented in code. Despite this extra cost vectorization appears to be a net gain. The exception is 128-bit vectors holding 64-bit long integers, so they are excluded from this optimization.
This should also have knock-on improvements for
Average()
in thelong
case on Intel. Theint
case ofAverage()
is already vectorized using a specialized approach.Benchmarks below are on an Intel Core i7 for both
int
andlong
and on an ARM AWS Graviton2 forint
.long
was not tested on ARM because ARM uses 128-bit vectors.BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.22621.1485)
Intel Core i7-10850H CPU 2.70GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=8.0.100-preview.2.23157.25
[Host] : .NET 8.0.0 (8.0.23.12803), X64 RyuJIT AVX2
Job-HVWQID : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-GLTODN : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
PowerPlanMode=00000000-0000-0000-0000-000000000000 IterationTime=250.0000 ms MaxIterationCount=20
MinIterationCount=15 WarmupCount=1
BenchmarkDotNet=v0.13.2.2052-nightly, OS=ubuntu 22.04
AWS m6g.xlarge Graviton2
.NET SDK=8.0.100-preview.1.23115.2
[Host] : .NET 8.0.0 (8.0.23.11008), Arm64 RyuJIT AdvSIMD
Job-MEXHPT : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Job-VGHVOM : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
PowerPlanMode=00000000-0000-0000-0000-000000000000 IterationTime=250.0000 ms MaxIterationCount=20
MinIterationCount=15 WarmupCount=1