Add ARM64 encodings for groups IF_SVE_CE,CF #98409

snickolls-arm · 2024-02-14T11:26:50Z

Adds ARM64 encodings for pmov variants. Matching capstone output:

pmov p2.b, z12
pmov p15.d, z7[7]
pmov p7.d, z16[0]
pmov p0.h, z31[1]
pmov p1.h, z1[0]
pmov p3.s, z9[3]
pmov p10.s, z4[0]
pmov p11.b, z12
pmov p6.d, z8[7]
pmov p9.d, z7[0]
pmov p8.h, z4[1]
pmov p5.h, z9[0]
pmov p14.s, z2[3]
pmov p3.s, z15[0]

Contributing towards #94549.

ghost · 2024-02-14T11:26:59Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Adds ARM64 encodings for pmov variants. Matching capstone output:

pmov p2.b, z12
pmov p15.d, z7[7]
pmov p7.d, z16[0]
pmov p0.h, z31[1]
pmov p1.h, z1[0]
pmov p3.s, z9[3]
pmov p10.s, z4[0]
pmov p11.b, z12
pmov p6.d, z8[7]
pmov p9.d, z7[0]
pmov p8.h, z4[1]
pmov p5.h, z9[0]
pmov p14.s, z2[3]
pmov p3.s, z15[0]

Contributing towards #94549.

Author:	snickolls-arm
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

snickolls-arm · 2024-02-14T11:27:11Z

@a74nh @kunalspathak @dotnet/arm64-contrib

ryujit-bot · 2024-02-14T13:53:06Z

Diff results for #98409

Throughput diffs

Throughput diffs for osx/arm64 ran on windows/x64

MinOpts (-0.01% to +0.00%)

Collection	PDIFF
realworld.run.osx.arm64.checked.mch	-0.01%

Throughput diffs for windows/arm64 ran on windows/x64

MinOpts (-0.01% to +0.00%)

Collection	PDIFF
libraries.pmi.windows.arm64.checked.mch	-0.01%

Details here

ryujit-bot · 2024-02-14T17:53:59Z

Diff results for #98409

Throughput diffs

Throughput diffs for linux/arm64 ran on windows/x64

MinOpts (-0.00% to +0.01%)

Collection	PDIFF
libraries.pmi.linux.arm64.checked.mch	+0.01%

Throughput diffs for windows/arm64 ran on windows/x64

MinOpts (-0.01% to +0.01%)

Collection	PDIFF
libraries.pmi.windows.arm64.checked.mch	+0.01%
realworld.run.windows.arm64.checked.mch	-0.01%

Details here

ryujit-bot · 2024-02-14T18:54:15Z

Diff results for #98409

Throughput diffs

Throughput diffs for linux/arm64 ran on windows/x64

MinOpts (-0.00% to +0.01%)

Collection	PDIFF
libraries.pmi.linux.arm64.checked.mch	+0.01%

Throughput diffs for windows/arm64 ran on windows/x64

MinOpts (-0.01% to +0.01%)

Collection	PDIFF
libraries.pmi.windows.arm64.checked.mch	+0.01%
realworld.run.windows.arm64.checked.mch	-0.01%

Details here

TIHan · 2024-02-15T00:08:28Z

src/coreclr/jit/emitarm64.cpp

+    assert(isVectorRegister(reg));
+    printf("%s", emitSveRegName(reg));
+
+    printf("[%d]", static_cast<int>(index));


nit: we have emitDispElementIndex which will do this.

So something like this:

//------------------------------------------------------------------------ // emitDispSveRegIndex: Display a scalable vector register with indexed element // void emitter::emitDispSveRegIndex(regNumber reg, ssize_t index, bool addComma) { assert(isVectorRegister(reg)); printf("%s", emitSveRegName(reg)); emitDispElementIndex(index, addComma); }

TIHan · 2024-02-15T00:13:59Z

src/coreclr/jit/emitarm64.h

+// Encodes an immediate value in consecutive bits from most signficant position 'hi' to least significant
+// position 'lo'.
+template <const size_t hi, const size_t lo>
+static code_t insEncodeUimm(size_t imm)


We tend to write explicit insEncodeUimm*, but I'm not opposed to have something more generalized.
@BruceForstall @kunalspathak what are the guidelines for writing generalized functions like this?

I like this idea. Thanks @snickolls-arm . Will it be possible to replace existing methods with this one and make sure you don't hit any asserts?

Yes I'll take a look at replacing these. Would you prefer this in a separate patch or on this one?

separate path is fine.

TIHan · 2024-02-15T00:14:56Z

src/coreclr/jit/emitarm64.h

+// value between them. The bit ranges are from hi1-lo1, and hi2-lo2 where the second range is at a less
+// significant position relative to the first.
+template <const size_t hi1, const size_t lo1, const size_t hi2, const size_t lo2>
+static code_t insEncodeSplitUimm(size_t imm)


Same as the comment for insEncodeUimm

TIHan · 2024-02-15T00:15:36Z

src/coreclr/jit/emitarm64.h

@@ -701,6 +737,13 @@ static bool isValidSimm4_MultipleOf32(ssize_t value)
    return (-256 <= value) && (value <= 224) && (value % 32 == 0);
 };

+template <const size_t bits>


Same as the comment for insEncodeUimm

TIHan

LGTM, just a small nit.

I like the generalized functions being introduced, but we should wait for @kunalspathak or @BruceForstall to weigh-in on it.

kunalspathak · 2024-02-15T15:00:03Z

src/coreclr/jit/emitarm64.cpp

+            switch (opt)
+            {
+                case INS_OPTS_SCALABLE_D:
+                    assert(isValidUimm3(imm));


please use the new methods that you have, here and elsewhere.

kunalspathak · 2024-02-15T15:05:54Z

src/coreclr/jit/emitarm64.cpp

@@ -9203,6 +9264,48 @@ void emitter::emitIns_R_R_I(instruction     ins,
            fmt = IF_SVE_AM_2A;
            break;

+        case INS_sve_pmov:
+            switch (opt)


might be better to rewrite this switch/case something like this:

if (isPredicateRegister(reg1) && isVectorRegister(reg2)) { switch (opt) { case INS_OPTS_SCALABLE_D: assert(isValidUimm3(imm)); fmt = IF_SVE_CE_2B; break; case INS_OPTS_SCALABLE_S: assert(isValidUimm2(imm)); fmt = IF_SVE_CE_2D; break; case INS_OPTS_SCALABLE_H: assert(isValidImm1(imm)); fmt = IF_SVE_CE_2C; break; default: unreached(); } } else if (isVectorRegister(reg1) && isPredicateRegister(reg2)) { switch (opt) { case INS_OPTS_SCALABLE_D: assert(isValidUimm3(imm)); fmt = IF_SVE_CF_2B; break; case INS_OPTS_SCALABLE_S: assert(isValidUimm2(imm)); fmt = IF_SVE_CF_2D; break; case INS_OPTS_SCALABLE_H: assert(isValidImm1(imm)); fmt = IF_SVE_CF_2C; break; default: unreached(); } } else { unreached(); }

kunalspathak · 2024-02-15T15:10:22Z

src/coreclr/jit/emitarm64.cpp

+void emitter::emitDispSveReg(regNumber reg, bool addComma)
+{
+    assert(isVectorRegister(reg));
+    printf("%s", emitSveRegName(reg));


Suggested change

printf("%s", emitSveRegName(reg));

printf(emitSveRegName(reg));

kunalspathak · 2024-02-15T15:10:35Z

src/coreclr/jit/emitarm64.cpp

+void emitter::emitDispSveRegIndex(regNumber reg, ssize_t index, bool addComma)
+{
+    assert(isVectorRegister(reg));
+    printf("%s", emitSveRegName(reg));


Suggested change

printf("%s", emitSveRegName(reg));

printf(emitSveRegName(reg));

kunalspathak · 2024-02-15T15:12:17Z

src/coreclr/jit/emitarm64.cpp

+    assert(isVectorRegister(reg));
+    printf("%s", emitSveRegName(reg));
+
+    printf("[%d]", static_cast<int>(index));


Suggested change

printf("[%d]", static_cast<int>(index));

printf("[%d]", (int)index);

BruceForstall · 2024-02-15T22:46:01Z

src/coreclr/jit/emitarm64.h

@@ -576,6 +578,40 @@ static code_t insEncodeSveElemsize_dtype(instruction ins, emitAttr size, code_t
 // for the 'dtype' field.
 static code_t insEncodeSveElemsize_dtype_ld1w(instruction ins, insFormat fmt, emitAttr size, code_t code);

+// Encodes an immediate value in consecutive bits from most signficant position 'hi' to least significant


nit

Suggested change

// Encodes an immediate value in consecutive bits from most signficant position 'hi' to least significant

// Encodes an immediate value in consecutive bits from most significant position 'hi' to least significant

BruceForstall · 2024-02-15T22:47:21Z

src/coreclr/jit/emitarm64.h

+template <const size_t hi, const size_t lo>
+static code_t insEncodeUimm(size_t imm)
+{
+    static_assert(hi >= lo && hi < sizeof(code_t) * 8);


nit

Suggested change

static_assert(hi >= lo && hi < sizeof(code_t) * 8);

static_assert((hi >= lo) && (hi < sizeof(code_t) * BITS_PER_BYTE));

BruceForstall · 2024-02-15T22:54:02Z

src/coreclr/jit/emitarm64.h

+{
+    static_assert(hi >= lo && hi < sizeof(code_t) * 8);
+    const size_t imm_bits = hi - lo + 1;
+    const size_t imm_max  = 1 << imm_bits;


We'll never use this function as insEncodeUimm<63,0>(imm) (for a 64-bit number), but this expression will overflow in that case. Maybe add a assert(imm_bits < sizeof(code_t) * BITS_PER_BYTE)?

BruceForstall · 2024-02-15T22:55:04Z

src/coreclr/jit/emitarm64.h

+template <const size_t hi1, const size_t lo1, const size_t hi2, const size_t lo2>
+static code_t insEncodeSplitUimm(size_t imm)
+{
+    static_assert(hi1 >= lo1 && lo1 > hi2 && hi2 >= lo2);


nit

Suggested change

static_assert(hi1 >= lo1 && lo1 > hi2 && hi2 >= lo2);

static_assert((hi1 >= lo1) && (lo1 > hi2) && (hi2 >= lo2));

BruceForstall · 2024-02-15T23:03:24Z

src/coreclr/jit/emitarm64.h

+    const size_t imm_bits = hi - lo + 1;
+    const size_t imm_max  = 1 << imm_bits;
+    assert(imm < imm_max);
+    return static_cast<code_t>(imm << lo);


Should we have an assert that we're not losing any high-order bits? E.g.,

code_t result = static_cast<code_t>(imm << lo); assert((result >> lo) == imm); return result;

?

I think it's worth adding this as it can tighten any portability assumptions that come with code_t and size_t.

kunalspathak

LGTM. Thanks, please fix the formatting errors.

ryujit-bot · 2024-02-21T15:26:50Z

Diff results for #98409

Throughput diffs

Throughput diffs for windows/arm64 ran on windows/x64

MinOpts (-0.00% to +0.01%)

Collection	PDIFF
libraries.pmi.windows.arm64.checked.mch	+0.01%

Details here

ryujit-bot · 2024-02-21T16:27:16Z

Diff results for #98409

Throughput diffs

Throughput diffs for windows/arm64 ran on windows/x64

MinOpts (-0.00% to +0.01%)

Collection	PDIFF
libraries.pmi.windows.arm64.checked.mch	+0.01%

Details here

ryujit-bot · 2024-02-21T17:27:40Z

Diff results for #98409

Throughput diffs

Throughput diffs for linux/arm64 ran on windows/x64

MinOpts (-0.00% to +0.01%)

Collection	PDIFF
libraries.pmi.linux.arm64.checked.mch	+0.01%

Throughput diffs for osx/arm64 ran on windows/x64

MinOpts (-0.01% to +0.00%)

Collection	PDIFF
libraries.pmi.osx.arm64.checked.mch	-0.01%

Throughput diffs for windows/arm64 ran on windows/x64

MinOpts (-0.01% to +0.00%)

Collection	PDIFF
libraries.pmi.windows.arm64.checked.mch	-0.01%

Details here

a74nh · 2024-02-22T11:55:40Z

Looking at the output:

pmov p2.b, z12
pmov p15.d, z7[7]
pmov p7.d, z16[0]
pmov p0.h, z31[1]
pmov p1.h, z1[0]
pmov p3.s, z9[3]
pmov p10.s, z4[0]
pmov p11.b, z12
pmov p6.d, z8[7]
pmov p9.d, z7[0]
pmov p8.h, z4[1]
pmov p5.h, z9[0]
pmov p14.s, z2[3]
pmov p3.s, z15[0]

On the last bunch of instructions, they should have a destination Z and a source P. Looks like the encodings got mixed up somewhere.

Needs fixing before merge.

TIHan · 2024-02-22T23:53:19Z

Looks like the display code for the vector and predicate registers just need to flip. The encodings look right though.

snickolls-arm · 2024-02-23T12:33:02Z

Looking at the output:
pmov p2.b, z12
pmov p15.d, z7[7]
pmov p7.d, z16[0]
pmov p0.h, z31[1]
pmov p1.h, z1[0]
pmov p3.s, z9[3]
pmov p10.s, z4[0]
pmov p11.b, z12
pmov p6.d, z8[7]
pmov p9.d, z7[0]
pmov p8.h, z4[1]
pmov p5.h, z9[0]
pmov p14.s, z2[3]
pmov p3.s, z15[0]
On the last bunch of instructions, they should have a destination Z and a source P. Looks like the encodings got mixed up somewhere.

Needs fixing before merge.

We've identified this to come from my implementation relying on isPredicateRegister, currently as we're aliasing predicate registers to vector registers this is causing it to only emit the pmov (to predicate) variant. I think it's correctly written in the code it's just that it's always taking paths where isPredicateRegister(reg1) && isVectorRegister(reg2) == true because isPredicateRegister is returning true for vector registers as well.

@a74nh has suggested adding a new scalable option for this which would fix it in the meantime until the predicates are distinctly defined.

from predicate registers

a74nh · 2024-02-23T17:13:51Z

src/coreclr/jit/emitarm64.cpp

@@ -8629,6 +8675,25 @@ void emitter::emitIns_R_R(instruction     ins,
            }
            break;

+        case INS_sve_pmov:
+            if (sopt == INS_SCALABLE_OPTS_TO_PREDICATE)


Could you add a check so that INS_OPTS_SCALABLE_B is required for opt (and all the other places IF_SVE_CE_2A and IF_SVE_CF_2A are checked)

This ensures the code calling it doesn't call with a different sized scalable. Eg: We want theEmitter->emitIns_R_R(INS_sve_pmov, EA_SCALABLE, REG_P2, REG_V12, INS_OPTS_SCALABLE_H, INS_SCALABLE_OPTS_TO_PREDICATE); to fail

Actually, what would be even more useful is if opt is INS_OPTS_SCALABLE_H/S/D then it should just do:
return emitIns_R_R_I(INS_sve_pmov, attr, reg1, reg2, 0, opt, sopt, INS_SCALABLE_OPTS_TO_PREDICATE);

Why? Because it's a useful shortcut. Most of the time the code calling all of this will be wanting to do a pmov with immediate 0, and it's a pain to have to write:
if (size == B) then emitIns_R_R(pmov,....) else emitIns_R_R_I(pmov,....,0);

kunalspathak

LGTM

isPredicateRegister is returning true for vector registers as well.

Yes, unfortunately we will have to wait for this to get added. There are just 1 or 2 places where we rely on this method and hopefully it doesn't cause any problem until the support to handle them comes.

kunalspathak · 2024-02-27T18:17:30Z

@a74nh - do you have any additional comments?

a74nh

LGTM

Add ARM64 encodings for groups IF_SVE_CE,CF

376b4d9

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 14, 2024

build-analysis bot mentioned this pull request Feb 14, 2024

[8.0] Failing to build native components #94823

Closed

Move left-shifted integers into explicit types

f2cfb1e

kunalspathak added the arm-sve Work related to arm64 SVE/SVE2 support label Feb 14, 2024

kunalspathak requested a review from TIHan February 14, 2024 18:57

TIHan mentioned this pull request Feb 14, 2024

Arm64: Implement SVE encodings #94549

Closed

TIHan reviewed Feb 15, 2024

View reviewed changes

TIHan approved these changes Feb 15, 2024

View reviewed changes

kunalspathak requested changes Feb 15, 2024

View reviewed changes

BruceForstall reviewed Feb 15, 2024

View reviewed changes

snickolls-arm added 2 commits February 21, 2024 09:34

Merge branch 'main' into github-IF_SVE_CE,CF

fc39e86

Address review comments

028d5c6

kunalspathak approved these changes Feb 21, 2024

View reviewed changes

Fix formatting errors

9f8fc26

Merge branch 'main' into github-IF_SVE_CE,CF

e2299de

Fix a typo with register ID and add a scalable option for moving to and

1e43376

from predicate registers

a74nh reviewed Feb 23, 2024

View reviewed changes

snickolls-arm added 2 commits February 27, 2024 09:53

Add shortcut for pmov with zero index

8fd7a28

Fix formatting

d1f8d6d

kunalspathak approved these changes Feb 27, 2024

View reviewed changes

build-analysis bot mentioned this pull request Feb 27, 2024

Tracking issue for CI build timeouts #76454

Closed

a74nh approved these changes Feb 28, 2024

View reviewed changes

Merge branch 'main' into github-IF_SVE_CE,CF

54b0b75

build-analysis bot mentioned this pull request Feb 29, 2024

Profiler test profiler\multiple\multiple\multiple.cmd failing #99091

Closed

Merge remote-tracking branch 'origin/main' into github-IF_SVE_CE,CF

e88f014

kunalspathak merged commit 8dcca1c into dotnet:main Feb 29, 2024
127 of 129 checks passed

github-actions bot locked and limited conversation to collaborators Mar 31, 2024

snickolls-arm deleted the github-IF_SVE_CE,CF branch January 17, 2025 16:07

	printf("%s", emitSveRegName(reg));
	printf(emitSveRegName(reg));

	printf("[%d]", static_cast<int>(index));
	printf("[%d]", (int)index);

	// Encodes an immediate value in consecutive bits from most signficant position 'hi' to least significant
	// Encodes an immediate value in consecutive bits from most significant position 'hi' to least significant

	static_assert(hi >= lo && hi < sizeof(code_t) * 8);
	static_assert((hi >= lo) && (hi < sizeof(code_t) * BITS_PER_BYTE));

	static_assert(hi1 >= lo1 && lo1 > hi2 && hi2 >= lo2);
	static_assert((hi1 >= lo1) && (lo1 > hi2) && (hi2 >= lo2));

Add ARM64 encodings for groups IF_SVE_CE,CF #98409

Add ARM64 encodings for groups IF_SVE_CE,CF #98409

Conversation

snickolls-arm commented Feb 14, 2024

ghost commented Feb 14, 2024

snickolls-arm commented Feb 14, 2024

ryujit-bot commented Feb 14, 2024

Throughput diffs

Throughput diffs for osx/arm64 ran on windows/x64

Throughput diffs for windows/arm64 ran on windows/x64

ryujit-bot commented Feb 14, 2024

Throughput diffs

Throughput diffs for linux/arm64 ran on windows/x64

Throughput diffs for windows/arm64 ran on windows/x64

ryujit-bot commented Feb 14, 2024

Throughput diffs

Throughput diffs for linux/arm64 ran on windows/x64

Throughput diffs for windows/arm64 ran on windows/x64

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TIHan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kunalspathak left a comment

Choose a reason for hiding this comment

ryujit-bot commented Feb 21, 2024

Throughput diffs

Throughput diffs for windows/arm64 ran on windows/x64

ryujit-bot commented Feb 21, 2024

Throughput diffs

Throughput diffs for windows/arm64 ran on windows/x64

ryujit-bot commented Feb 21, 2024

Throughput diffs

Throughput diffs for linux/arm64 ran on windows/x64

Throughput diffs for osx/arm64 ran on windows/x64

Throughput diffs for windows/arm64 ran on windows/x64

a74nh commented Feb 22, 2024

TIHan commented Feb 22, 2024

snickolls-arm commented Feb 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kunalspathak left a comment

Choose a reason for hiding this comment

kunalspathak commented Feb 27, 2024

a74nh left a comment

Choose a reason for hiding this comment