-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement NarrowUtf16ToAscii for AArch64 #70080
Implement NarrowUtf16ToAscii for AArch64 #70080
Conversation
Tagging subscribers to this area: @dotnet/area-system-text-encoding Issue DetailsFixes #41292 partially.
|
Hi @kunalspathak, you might want to take a look at this PR. |
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
@dotnet/jit-contrib |
Thanks @SwapnilGaikwad for your contribution. Could you also share some performance numbers? You can start with https://github.com/dotnet/performance/blob/d7dac8a7ca12a28d099192f8a901cf8e30361384/src/benchmarks/micro/libraries/System.Text.Encoding/Perf.Encoding.cs. |
Hi Kunal, did a quick test on A72 for the |
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
@SwapnilGaikwad - Did you get chance to make any progress? |
Hi @kunalspathak, we made progress. Currently benchmarking the version that combines SSE2 and ASIMD implementations along with the above comments. I hope to get it ready by tomorrow. |
ef51903
to
daf52d6
Compare
Hi @kunalspathak, now the patch makes use of the vector API more. For ASCII strings of 512 size, it executes in about 0.92x on AArch64 and 0.86x on x86 compared to the execution time for the HEAD. |
That sounds great. Do you mind posting the actual numbers like done in #70654 (comment)?
I think it has a typo. Can you try with |
Hi @kunalspathak, here are the numbers.
On x86 (Xeon Gold 5120T)
The |
You can build everything release - coreclr/libraries and then drop a checked clrjit to make that environment variable work. |
Thanks for sharing the numbers. It seems the ascii-16 bytes is slightly slower. Do you know why? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes looks good overall. I would like to see the disasm code difference of SSE2 as well as the disasm for AdvSimd. @tannergooding - do you mind taking a look as well?
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
Sse2.StoreScalar((ulong*)pAsciiBuffer, asciiVector.AsUInt64()); // ulong* calculated here is UNALIGNED | ||
|
||
Vector128<byte> asciiVector = ExtractAsciiVector(utf16VectorFirst); | ||
asciiVector.GetLower().StoreUnsafe(ref asciiBuffer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be curious to see the assembly difference for SSE2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍. Vector64<T>
is a bit odd on x86/x64 since its treated as a regular struct { ulong _value; }
, so I'd expect promotion/etc to do the right thing here
but it might not be as efficient as extracting the lowest UInt64
scalar (it probably could be with some tweaks however if it does differ)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that by switching to StoreUnsafe
introduced a bigger sequence of assembly with a few unnecessary movs.
On HEAD, we get the following assembly for extract and store.
vpackuswb xmm0, xmm0, xmm0
vmovq qword ptr [r15], xmm0
mov r12d, 8
test r15b, 8
With the above change, extract and store looks as following.
mov r12, rbx
vpackuswb xmm0, xmm0, xmm0
vmovapd xmmword ptr [rbp-80H], xmm0
mov rax, qword ptr [rbp-80H]
mov qword ptr [rbp-58H], rax
mov rax, qword ptr [rbp-58H]
mov qword ptr [r12], rax
mov r13d, 8
test bl, 8
Would you suggest to stick to Vector128 api in this case?
Full assembly dumps are available here - HEAD, PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant - can you share the arm64 versions? You can also turn off the address displays using set COMPlus_JitDiffableDasm=1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are identical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the updated one. For the PR, assembly for NarrowUtf16ToAscii_Intrinsified
should have been dumped instead of NarrowUtf16ToAscii
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this one uses StoreUnsafe
because I don't see AV related code generated as you pointed out in #70080 (comment)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, this one uses StoreUnsafe
. The aligned stores are only performed in a loop, now changing them to StoreUnsafe
.
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
It seems the loop peeling logic to get the destination pointing to 8-byte aligned address adds a slight overhead for smaller strings. [1] Microbenchmark results after removing the aligned write logic:
|
How can I dump the assembly for
Also, re-building with @kunalspathak @tannergooding Do you guys have any better ways to extract the assembly reliably? 🤔 |
Follow #70080 (comment). To remove unnecessary asserts, you will need release version of SPC. Did you try #70080 (comment)? |
daf52d6
to
a5bb5cf
Compare
Thanks! Now can extract the assembly. The missing piece was that the build wasn't checked. |
a5bb5cf
to
6968765
Compare
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
9fc032e
to
5ae8594
Compare
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
5ae8594
to
3b775f3
Compare
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
3b775f3
to
e6f6cb9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to update a comment before we can go ahead with the merge.
Also, I see that when the review comments are addressed, they are squashed in previous commits. It's not something that is common on this repo. As a reviewer, I have to review entire changes instead of just the updates that were done as part of review comment. A better approach would be to not squash the commit and so those can be reviewed as a standalone change, and it would make my role as reviewer a lot easier. Is there some benefit to the approach you are using?
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
Sure. Apologies for inconvenience. I'll use add separate commits going forward. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you for your contribution!
Fixes #41292 partially.