Vectorize IEnumerable<T>.Sum where possible #84519

brantburnett · 2023-04-08T18:15:52Z

When performing Sum() on IEnumerable<T> of type int or long and when IEnumerable<T> is representable as a ReadOnlySpan<T> (such as arrays and List<T>) it is possible to vectorize the implementation for improved performance. The only added cost for the slow-path fallback is a length check to be sure the ReadOnlySpan<T> is at least 4 Vector<T> long.

Note that basic vectorized addition doesn't perform overflow checks, so the checks must be implemented in code. Despite this extra cost vectorization appears to be a net gain. The exception is 128-bit vectors holding 64-bit long integers, so they are excluded from this optimization.

This should also have knock-on improvements for Average() in the long case on Intel. The int case of Average() is already vectorized using a specialized approach.

Benchmarks below are on an Intel Core i7 for both int and long and on an ARM AWS Graviton2 for int. long was not tested on ARM because ARM uses 128-bit vectors.

BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.22621.1485)
Intel Core i7-10850H CPU 2.70GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=8.0.100-preview.2.23157.25
[Host] : .NET 8.0.0 (8.0.23.12803), X64 RyuJIT AVX2
Job-HVWQID : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-GLTODN : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2

PowerPlanMode=00000000-0000-0000-0000-000000000000 IterationTime=250.0000 ms MaxIterationCount=20
MinIterationCount=15 WarmupCount=1

Type	Job	Length	Mean	Error	StdDev	Median	Min	Max	Ratio
Int	Current	32	14.34 ns	0.172 ns	0.152 ns	14.32 ns	14.09 ns	14.67 ns	1.00
Int	Vectorized	32	10.34 ns	0.152 ns	0.142 ns	10.35 ns	10.16 ns	10.56 ns	0.72

Int	Current	128	48.33 ns	0.132 ns	0.110 ns	48.34 ns	47.99 ns	48.42 ns	1.00
Int	Vectorized	128	19.28 ns	0.686 ns	0.790 ns	19.32 ns	18.19 ns	20.75 ns	0.41

Int	Current	1024	341.92 ns	6.583 ns	6.465 ns	345.28 ns	329.47 ns	348.16 ns	1.00
Int	Vectorized	1024	83.02 ns	0.090 ns	0.080 ns	83.01 ns	82.91 ns	83.18 ns	0.24

Long	Current	32	13.93 ns	0.093 ns	0.082 ns	13.93 ns	13.68 ns	14.01 ns	1.00
Long	Vectorized	32	11.22 ns	0.078 ns	0.069 ns	11.24 ns	10.99 ns	11.27 ns	0.81

Long	Current	128	47.72 ns	0.142 ns	0.133 ns	47.77 ns	47.30 ns	47.82 ns	1.00
Long	Vectorized	128	24.58 ns	0.165 ns	0.146 ns	24.62 ns	24.08 ns	24.67 ns	0.52

Long	Current	1024	333.73 ns	0.799 ns	0.748 ns	333.88 ns	331.41 ns	334.44 ns	1.00
Long	Vectorized	1024	150.21 ns	0.237 ns	0.222 ns	150.13 ns	149.89 ns	150.70 ns	0.45

BenchmarkDotNet=v0.13.2.2052-nightly, OS=ubuntu 22.04
AWS m6g.xlarge Graviton2
.NET SDK=8.0.100-preview.1.23115.2
[Host] : .NET 8.0.0 (8.0.23.11008), Arm64 RyuJIT AdvSIMD
Job-MEXHPT : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Job-VGHVOM : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD

PowerPlanMode=00000000-0000-0000-0000-000000000000 IterationTime=250.0000 ms MaxIterationCount=20
MinIterationCount=15 WarmupCount=1

Type	Job	Length	Mean	Error	StdDev	Median	Min	Max	Ratio
Int	Current	32	28.12 ns	0.043 ns	0.041 ns	28.12 ns	28.08 ns	28.19 ns	1.00
Int	Vectorized	32	16.93 ns	0.031 ns	0.026 ns	16.93 ns	16.89 ns	16.98 ns	0.60

Int	Current	128	103.80 ns	0.023 ns	0.021 ns	103.80 ns	103.75 ns	103.83 ns	1.00
Int	Vectorized	128	56.63 ns	0.072 ns	0.064 ns	56.62 ns	56.53 ns	56.77 ns	0.55

Int	Current	1024	832.41 ns	0.108 ns	0.090 ns	832.37 ns	832.28 ns	832.58 ns	1.00
Int	Vectorized	1024	432.59 ns	0.087 ns	0.081 ns	432.58 ns	432.46 ns	432.74 ns	0.52

ghost · 2023-04-08T18:16:04Z

Tagging subscribers to this area: @dotnet/area-system-linq
See info in area-owners.md if you want to be subscribed.

Issue Details

When performing Sum() on IEnumerable<T> of type int or longand whenIEnumerableis representable as aReadOnlySpan(such as arrays andList) it is possible to vectorize the implementation for improved performance. The only added cost for the slow-path fallback is a length check to be sure the ReadOnlySpanis at least 4Vector` long.

Note that basic vectorized addition doesn't perform overflow checks, so the checks must be implemented in code. Despite this extra cost vectorization appears to be a net gain. The exception is 128-bit vectors holding 64-bit long integers, so they are excluded from this optimization.

This should also have knock-on improvements for Average() in the long case on Intel. The int case of Average() is already vectorized using a specialized approach.

Benchmarks below are on an Intel Core i7 for both int and long and on an ARM AWS Graviton2 for int. long was not tested on ARM because ARM uses 128-bit vectors.

BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.22621.1485)
Intel Core i7-10850H CPU 2.70GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=8.0.100-preview.2.23157.25
[Host] : .NET 8.0.0 (8.0.23.12803), X64 RyuJIT AVX2
Job-HVWQID : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-GLTODN : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2

PowerPlanMode=00000000-0000-0000-0000-000000000000 IterationTime=250.0000 ms MaxIterationCount=20
MinIterationCount=15 WarmupCount=1

Type	Job	Length	Mean	Error	StdDev	Median	Min	Max	Ratio
Int	Vectorized	32	14.34 ns	0.172 ns	0.152 ns	14.32 ns	14.09 ns	14.67 ns	1.00
Int	Current	32	10.34 ns	0.152 ns	0.142 ns	10.35 ns	10.16 ns	10.56 ns	0.72

Int	Vectorized	128	48.33 ns	0.132 ns	0.110 ns	48.34 ns	47.99 ns	48.42 ns	1.00
Int	Current	128	19.28 ns	0.686 ns	0.790 ns	19.32 ns	18.19 ns	20.75 ns	0.41

Int	Vectorized	1024	341.92 ns	6.583 ns	6.465 ns	345.28 ns	329.47 ns	348.16 ns	1.00
Int	Current	1024	83.02 ns	0.090 ns	0.080 ns	83.01 ns	82.91 ns	83.18 ns	0.24

Long	Vectorized	32	13.93 ns	0.093 ns	0.082 ns	13.93 ns	13.68 ns	14.01 ns	1.00
Long	Current	32	11.22 ns	0.078 ns	0.069 ns	11.24 ns	10.99 ns	11.27 ns	0.81

Long	Vectorized	128	47.72 ns	0.142 ns	0.133 ns	47.77 ns	47.30 ns	47.82 ns	1.00
Long	Current	128	24.58 ns	0.165 ns	0.146 ns	24.62 ns	24.08 ns	24.67 ns	0.52

Long	Vectorized	1024	333.73 ns	0.799 ns	0.748 ns	333.88 ns	331.41 ns	334.44 ns	1.00
Long	Current	1024	150.21 ns	0.237 ns	0.222 ns	150.13 ns	149.89 ns	150.70 ns	0.45

BenchmarkDotNet=v0.13.2.2052-nightly, OS=ubuntu 22.04
AWS m6g.xlarge Graviton2
.NET SDK=8.0.100-preview.1.23115.2
[Host] : .NET 8.0.0 (8.0.23.11008), Arm64 RyuJIT AdvSIMD
Job-MEXHPT : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Job-VGHVOM : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD

PowerPlanMode=00000000-0000-0000-0000-000000000000 IterationTime=250.0000 ms MaxIterationCount=20
MinIterationCount=15 WarmupCount=1

Type	Job	Length	Mean	Error	StdDev	Median	Min	Max	Ratio
Int	Vectorized	32	28.12 ns	0.043 ns	0.041 ns	28.12 ns	28.08 ns	28.19 ns	1.00
Int	Current	32	16.93 ns	0.031 ns	0.026 ns	16.93 ns	16.89 ns	16.98 ns	0.60

Int	Vectorized	128	103.80 ns	0.023 ns	0.021 ns	103.80 ns	103.75 ns	103.83 ns	1.00
Int	Current	128	56.63 ns	0.072 ns	0.064 ns	56.62 ns	56.53 ns	56.77 ns	0.55

Int	Vectorized	1024	832.41 ns	0.108 ns	0.090 ns	832.37 ns	832.28 ns	832.58 ns	1.00
Int	Current	1024	432.59 ns	0.087 ns	0.081 ns	432.58 ns	432.46 ns	432.74 ns	0.52

Author:	brantburnett
Assignees:	-
Labels:	`area-System.Linq`, `community-contribution`
Milestone:	-

KeterSCP · 2023-04-09T19:24:50Z

@brantburnett according to your benchmarks table, the vectorized version is slower for all cases, this might be a mistake I suppose?

brantburnett · 2023-04-09T21:34:53Z

@brantburnett according to your benchmarks table, the vectorized version is slower for all cases, this might be a mistake I suppose?

Yes, silly typo when I was replacing the long --corerun paths with something meaningful. Thanks for pointing it out, I've corrected it.

src/libraries/System.Linq/src/System/Linq/Sum.cs

eiriktsarpalis

It would be interesting if you could share any benchmark results showcasing the improvements of the change.

You can find details on how to benchmark private builds in this doc.

brantburnett · 2023-04-21T12:16:47Z

It would be interesting if you could share any benchmark results showcasing the improvements of the change.

You can find details on how to benchmark private builds in this doc.

There are benchmark results for both x64 and ARM in the main PR description, are there other benchmarks you'd like to see?

eiriktsarpalis · 2023-04-21T12:27:42Z

There are benchmark results for both x64 and ARM in the main PR description, are there other benchmarks you'd like to see?

D'oh! Completely skipped past those :-)

tannergooding · 2023-04-21T13:33:05Z

src/libraries/System.Linq/src/System/Linq/Sum.cs

+            Vector<T> accumulator = Vector<T>.Zero;
+
+            // Build a test vector with only the sign bit set in each element. JIT will fold this into a constant.
+            Vector<T> overflowTestVector = new(T.RotateRight(T.MultiplicativeIdentity, 1));


Given that we know T works with Vector, we know it must be a primitive value which is 2's complement. So we can add where T : IMinMaxValue<T> and simply use T.MinValue instead to get a mask of the sign.

tannergooding · 2023-04-21T13:39:27Z

src/libraries/System.Linq/src/System/Linq/Sum.cs

+                ptr = ref Unsafe.Add(ref ptr, Vector<T>.Count * 4);
+                length -= Vector<T>.Count * 4;


Similar feedback has been given on other PRs, but we'd prefer to avoid mutating the byref where possible in favor of standard indexing and using LoadUnsafe(ref baseAddress, index) instead. You can then increment index by Vector<T>.Count in the loop.

This can result in slightly worse codegen in some cases, but it makes the code simpler and less prone to GC holes and similar bugs.

Done, and the JIT output still looks decent. It was able to encode the offset arithmetic into the operation on x64:

lea ebx, [eax+edi+0x40]

Unfortunately, it's an extra lea operation. While this did impact the benchmarks, they are still an improvement over unvectorized.

src/libraries/System.Linq/src/System/Linq/Sum.cs

tannergooding · 2023-04-21T13:52:04Z

src/libraries/System.Linq/src/System/Linq/Sum.cs

+            // Add any remaining elements
+            for (int i = 0; i < length; i++)
+            {
+                checked { result += Unsafe.Add(ref ptr, i); }
+            }


This could be handled by doing a backtrack and masking off already processed data.

Do you have any suggestions on how to build a mask like that in a Vector<T>? So far the only way I've found is looping and calling WithElement, but looking at the resulting JIT of that loop I feel like just adding the elements from the span will be way better.

L0126: vxorps ymm3, ymm3, ymm3 L012a: vmovupd [ebp-0x2c], ymm3 L012f: cmp eax, 8 L0132: jae L020a L0138: vmovupd [ebp-0x2c], ymm2 L013d: lea ecx, [ebp-0x2c] L0140: xor esi, esi L0142: mov [ecx+eax*4], esi L0145: vmovupd ymm2, [ebp-0x2c] L014a: inc eax L014b: cmp edx, eax L014d: jg short L0126

I considered a static ROS of zero elements followed by all bits set elements that could be indexed to find the mask. However, this is problematic because we're dealing with generic integers.

Okay, I think I have a partial solution. Using left byte-shift operations on vectors will work to fill with zero elements, since we don't care if we move the elements being summed into different lanes. The problem is that this only works with Vector128<T> on Intel. So to use this solution we'd need to either:

A) Only implement as Vector128<T>, losing long support and running slower on Intel.
or B) Implement split Vector128<T> and Vector256<T> implementations, with lots of code duplication. We'd then only gain the backtrack advantages on short int arrays that fallback to Vector128<T>, old Intel processors, and ARM.

I don't think A is a good option. And I'm not sure B is worth it? Let me know what you think.

eiriktsarpalis · 2023-10-25T17:53:51Z

Breaking change doc: dotnet/docs#37734

Vectorize IEnumerable<T>.Sum where possible

2b1bd70

ghost added the community-contribution Indicates that the PR has been added by a community member label Apr 8, 2023

dotnet-issue-labeler bot added the area-System.Linq label Apr 8, 2023

Remove unnecessary CreateChecked and fix long test

dff12c5

build-analysis bot mentioned this pull request Apr 9, 2023

Tracking issue for CI build timeouts #76454

Closed

brantburnett marked this pull request as ready for review April 9, 2023 00:50

eiriktsarpalis reviewed Apr 20, 2023

View reviewed changes