Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Slightly relax "const vector propagation" heuristics #76788

Closed
wants to merge 5 commits into from

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Oct 9, 2022

Fixes #76781

By changing too conservative heuristic I introduced in #70378, also changed SPMI to print PerfScore (will revert before merge)

Snippet from #76781

static void Foo(ref byte src, nuint length, ref byte dest)
{
    nuint idx = 0;
    Vector128<byte> mask = Vector128.Create((byte)0x7F);
    if (length >= (uint)Vector128<byte>.Count)
    {
        nuint vecEnd = length - (uint)Vector128<byte>.Count;
        do
        {
            Vector128<byte> vec = Vector128.LoadUnsafe(ref src, idx);
            vec &= mask;
            vec.StoreUnsafe(ref dest, idx);
            idx += (uint)Vector128<byte>.Count;
        } while (idx <= vecEnd);
    }
}
; Method Program:Foo(byref,ulong,byref)
G_M5292_IG01:              
       C5F877               vzeroupper 
G_M5292_IG02:              
       33C0                 xor      eax, eax
+      C4E179100532000000   vmovupd  xmm0, xmmword ptr [reloc @RWD00]
       4883FA10             cmp      rdx, 16
       7223                 jb       SHORT G_M5292_IG05
G_M5292_IG03:              
       4883C2F0             add      rdx, -16
-      90                   align    [1 bytes for IG04]
+      0F1F840000000000     align    [8 bytes for IG04]
G_M5292_IG04:              
-      C5FA6F0401           vmovdqu  xmm0, xmmword ptr [rcx+rax]
-      C5F9DB0513000000     vpand    xmm0, xmm0, xmmword ptr [reloc @RWD00]
-      C4C17A7F0400         vmovdqu  xmmword ptr [r8+rax], xmm0
+      C5F9DB0C01           vpand    xmm1, xmm0, xmmword ptr [rcx+rax]
+      C4C17A7F0C00         vmovdqu  xmmword ptr [r8+rax], xmm1
       4883C010             add      rax, 16
       483BC2               cmp      rax, rdx
       76E4                 jbe      SHORT G_M5292_IG04
G_M5292_IG05:              
       C3                   ret      
RWD00  	dq	7F7F7F7F7F7F7F7Fh, 7F7F7F7F7F7F7F7Fh
-; Total bytes of code: 45
+; Total bytes of code: 53  ;; but better overall perfscore!

@ghost ghost assigned EgorBo Oct 9, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 9, 2022
@ghost
Copy link

ghost commented Oct 9, 2022

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Fixes #76781

By changing too conservative heuristic I introduced in #70378, also changed SPMI to print PerfScore (will revert before merge)

Snippet from #76781

static void Foo(ref byte src, nuint length, ref byte dest)
{
    nuint idx = 0;
    Vector128<byte> mask = Vector128.Create((byte)0x7F);
    if (length >= (uint)Vector128<byte>.Count)
    {
        nuint vecEnd = length - (uint)Vector128<byte>.Count;
        do
        {
            Vector128<byte> vec = Vector128.LoadUnsafe(ref src, idx);
            vec &= mask;     // use of hoisted constant

            vec.StoreUnsafe(ref dest, idx);

            idx += (uint)Vector128<byte>.Count;
        } while (idx <= vecEnd);
    }
}
; Method Program:Foo(byref,ulong,byref)
G_M5292_IG01:              
       C5F877               vzeroupper 
G_M5292_IG02:              
       33C0                 xor      eax, eax
+      C4E179100532000000   vmovupd  xmm0, xmmword ptr [reloc @RWD00]
       4883FA10             cmp      rdx, 16
       7223                 jb       SHORT G_M5292_IG05
G_M5292_IG03:              
       4883C2F0             add      rdx, -16
-      90                   align    [1 bytes for IG04]
+      0F1F840000000000     align    [8 bytes for IG04]
G_M5292_IG04:              
-      C5FA6F0401           vmovdqu  xmm0, xmmword ptr [rcx+rax]
-      C5F9DB0513000000     vpand    xmm0, xmm0, xmmword ptr [reloc @RWD00]
-      C4C17A7F0400         vmovdqu  xmmword ptr [r8+rax], xmm0
+      C5F9DB0C01           vpand    xmm1, xmm0, xmmword ptr [rcx+rax]
+      C4C17A7F0C00         vmovdqu  xmmword ptr [r8+rax], xmm1
       4883C010             add      rax, 16
       483BC2               cmp      rax, rdx
       76E4                 jbe      SHORT G_M5292_IG04
G_M5292_IG05:              
       C3                   ret      
RWD00  	dq	7F7F7F7F7F7F7F7Fh, 7F7F7F7F7F7F7F7Fh
-; Total bytes of code: 45
+; Total bytes of code: 53
Author: EgorBo
Assignees: EgorBo
Labels:

area-CodeGen-coreclr

Milestone: -

@EgorBo
Copy link
Member Author

EgorBo commented Oct 9, 2022

@jakobbotsch is this a correct way to switch to PerfScore for superpmi-diff job?

@jakobbotsch
Copy link
Member

@jakobbotsch is this a correct way to switch to PerfScore for superpmi-diff job?

I think it should work depending on the size of the diffs. If there are massive number of diffs then you will hit the problem we always had before #76238, but if it manages to complete I think the jit-analyze output should show perfscore analysis.

@EgorBo
Copy link
Member Author

EgorBo commented Oct 14, 2022

This need more thinking judging by regressions 🙁

@EgorBo EgorBo closed this Oct 14, 2022
@EgorBo EgorBo reopened this Nov 6, 2022
@EgorBo EgorBo force-pushed the fix-cns-vec-prop2 branch from 9b3ae74 to c1af69a Compare November 6, 2022 01:36
@EgorBo EgorBo closed this Nov 8, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Dec 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CQ: Manually hoisted Vector{256,128} constant is reloaded inside the loop (sometimes)
2 participants