Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT ARM64/SVE2: Add increment/decrement from predicate register instructions #94811

Merged
merged 28 commits into from
Dec 18, 2023

Conversation

amanasifkhalid
Copy link
Member

@amanasifkhalid amanasifkhalid commented Nov 15, 2023

This implements the following instructions:

  • incp/decp <Xdn>, <Pm>.<T>: increment/decrement general register by predicate count
  • incp/decp <Zdn>.<T>, <Pm>.<T>: increment/decrement vector register by predicate count
  • sqdecp/sqincp <Xdn>, <Pm>.<T>, <Wdn>: signed saturating increment/decrement general register by predicate count
  • sqdecp/sqincp <Zdn>.<T>, <Pm>.<T>: signed saturating increment/decrement vector register by predicate count
  • uqdecp/uqincp <Wdn>, <Pm>.<T>: unsigned saturating increment/decrement general register by predicate count
  • uqdecp/uqincp <Zdn>.<T>, <Pm>.<T>: unsigned saturating increment/decrement vector register by predicate count

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Nov 15, 2023
@ghost ghost assigned amanasifkhalid Nov 15, 2023
@ghost
Copy link

ghost commented Nov 15, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

These changes are modeled on Kunal's work in #94529; once that work is merged in, I'll rebase these changes on top of it. I've implemented the following instructions:

  • incp/decp <Xdn>, <Pm>.<T>: increment/decrement general register by predicate count
  • incp/decp <Zdn>.<T>, <Pm>.<T>: increment/decrement vector register by predicate count
  • sqdecp/sqincp <Xdn>, <Pm>.<T>, <Wdn>: signed saturating increment/decrement general register by predicate count
  • sqdecp/sqincp <Zdn>.<T>, <Pm>.<T>: signed saturating increment/decrement vector register by predicate count
  • uqdecp/uqincp <Wdn>, <Pm>.<T>: unsigned saturating increment/decrement general register by predicate count
  • uqdecp/uqincp <Zdn>.<T>, <Pm>.<T>: unsigned saturating increment/decrement vector register by predicate count
Author: amanasifkhalid
Assignees: amanasifkhalid
Labels:

area-CodeGen-coreclr

Milestone: -

@amanasifkhalid amanasifkhalid added the arm-sve Work related to arm64 SVE/SVE2 support label Nov 15, 2023
Comment on lines 958 to 959
case IF_SVE_DN_2A: // ........xx...... .......MMMMddddd -- SVE inc/dec vector by predicate count
case IF_SVE_DP_2A: // ........xx...... .......MMMMddddd -- SVE saturating inc/dec vector by predicate count
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these forms be merged?

(Or should we wait to decide on merging until after many/all instructions have been implemented?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with either doing it now or waiting until after we've implemented all of them. @kunalspathak @TIHan do you have any preference for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

waiting until after we've implemented all of them

Will be good to wait so we can track the coverage of them as listed in #94549. We can then have one pass to combine all the duplicate ones.

{
// TODO-SVE: Fix once we add predicate registers
assert(isPredicateRegister(reg));
printf(emitVectorRegName(reg));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we need a emitPredicateRegName than handles the "fake" predicate registers but still outputs p0 through p15

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kunal has a temporary implementation for this in #95016. As an aside, there seems to be potential for circular dependencies between these two PRs. Should I wait for Kunal's to be merged in first? I can then update my implementation with the new methods #95016 adds in (like emitPredicateRegName) and also update the predicate register TODOs to use the aliases you suggested above.

inline static bool isPredicateRegister(regNumber reg)
{
// TODO-SVE: Fix once we add predicate registers
return (reg >= REG_FP_FIRST && reg <= REG_FP_LAST);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could use the new REG_P0 to REG_P15. And/or define REG_P_FIRST and REG_P_LAST?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done in #95016

@kunalspathak
Copy link
Member

@amanasifkhalid - I will update #95016 to add the following:

  • Add emitDispSveRegister() to display "Z" registers
  • Add fake predicate registers and corresponding methods (thanks @BruceForstall for the suggestion)

Once #95016 along with these changes are merged, it will become easier to base the changes on top of it. We still won't be able to merge it because of encoding validation.

@amanasifkhalid
Copy link
Member Author

I merged in Kunal's changes from #95016 and fixed the unit test additions. I'm waiting for #95105 to be merged in before using the updated machine-generated files, since the new emitOutputInstr code relies on the insEncodeReg_* methods.

We still won't be able to merge it because of encoding validation.

Would it make sense for me to open more (draft) PRs like this one if we will have to wait to merge them? I guess this is a good way to find issues with the tool-generated code.

@kunalspathak
Copy link
Member

@amanasifkhalid - I have shared few steps offline on validating the encoding, see if you can check them. Please hold off on doing more draft PRs as I am updating the tool to fix few things that came up in Alan's #95127 . You might want to consume the new files.

@amanasifkhalid
Copy link
Member Author

@kunalspathak thank you for sharing the steps! The jitdump of the new instructions prints the predicate registers as vector registers for now:

decp    x0, v0.b, 
decp    v0.b, v0.b, 
incp    x0, v0.b, 
incp    v0.b, v0.b, 
sqdecp  x0, v0.b, 
sqdecp  v0.b, v0.b, 
sqincp  x0, v0.b, 
sqincp  v0.b, v0.b, 
uqdecp  x0, v0.b, 
uqdecp  v0.b, v0.b, 
uqincp  x0, v0.b, 
uqincp  v0.b, v0.b,

I think I'll have to wait to merge changes in from #95127 to fix this (along with the EA_SCALABLE-related changes). cstool.exe doesn't seem to be able to parse sqdecp/sqincp/uqdecp/uqincp correctly. The cstool output only includes decp and incp:

fc  00 88 ed 25  decp   x0, p0.d
100  00 80 ed 25  decp  z0.d, p0.d
104  00 88 ec 25  incp  x0, p0.d
108  00 80 ec 25  incp  z0.d, p0.d

Here's the hex I'm passing to cstool.exe (comments mine):

0088ED25 // decp
0080ED25 // decp
0088EC25 // incp
0080EC25 // incp
0088EAE5 // sqdecp
0080EA25 // sqdecp
0088E8E5 // sqincp
0080E825 // sqincp
0088EBE5 // uqdecp
0080EB25 // uqdecp
0088E9E5 // uqincp
0080E925 // uqincp

@amanasifkhalid amanasifkhalid marked this pull request as ready for review December 11, 2023 16:11
@amanasifkhalid
Copy link
Member Author

@dotnet/arm64-contrib, sorry for the delay; please take a look. Here's the JIT disasm output:

            decp    x0, p0.b
            decp    x1, p1.h
            decp    x2, p2.s
            decp    x3, p3.d
            incp    x4, p4.b
            incp    x5, p5.h
            incp    x6, p6.s
            incp    x7, p7.d
            decp    z0.h, p0.h
            decp    z1.s, p1.s
            decp    z2.d, p2.d
            incp    z3.h, p3.h
            incp    z4.s, p4.s
            incp    z5.d, p5.d
            sqdecp  x0, p0.b, w0
            sqdecp  x1, p1.h, w1
            sqdecp  x2, p2.s, w2
            sqdecp  x3, p3.d, w3
            sqdecp  x4, p4.b
            sqdecp  x5, p5.h
            sqdecp  x6, p6.s
            sqdecp  x7, p7.d
            sqincp  x0, p0.h, w0
            sqincp  x1, p1.s, w1
            sqincp  x2, p2.b, w2
            sqincp  x3, p3.d, w3
            sqincp  x4, p4.b
            sqincp  x5, p5.h
            sqincp  x6, p6.s
            sqincp  x7, p7.d
            uqdecp  w0, p0.b
            uqdecp  w1, p1.h
            uqdecp  w2, p2.s
            uqdecp  w3, p3.d
            uqdecp  x4, p4.b
            uqdecp  x5, p5.h
            uqdecp  x6, p6.s
            uqdecp  x7, p7.d
            uqincp  w0, p0.b
            uqincp  w1, p1.h
            uqincp  w2, p2.s
            uqincp  w3, p3.d
            uqincp  x4, p4.b
            uqincp  x5, p5.h
            uqincp  x6, p6.s
            uqincp  x7, p7.d
            sqdecp  z0.h, p0.h
            sqdecp  z1.s, p1.s
            sqdecp  z2.d, p2.d
            sqincp  z3.h, p3.h
            sqincp  z4.s, p4.s
            sqincp  z5.d, p5.d
            uqdecp  z6.h, p6.h
            uqdecp  z7.s, p7.s
            uqdecp  z8.d, p0.d
            uqincp  z9.h, p1.h
            uqincp  z10.s, p2.s
            uqincp  z11.d, p3.d

and the cstool output:

00882D25  decp  x0, p0.b
21886D25  decp  x1, p1.h
4288AD25  decp  x2, p2.s
6388ED25  decp  x3, p3.d
84882C25  incp  x4, p4.b
A5886C25  incp  x5, p5.h
C688AC25  incp  x6, p6.s
E788EC25  incp  x7, p7.d
00806D25  decp  z0.h, p0.h
2180AD25  decp  z1.s, p1.s
4280ED25  decp  z2.d, p2.d
63806C25  incp  z3.h, p3.h
8480AC25  incp  z4.s, p4.s
A580EC25  incp  z5.d, p5.d
00882A25  sqdecp        x0, p0.b, w0
21886A25  sqdecp        x1, p1.h, w1
4288AA25  sqdecp        x2, p2.s, w2
6388EA25  sqdecp        x3, p3.d, w3
848C2A25  sqdecp        x4, p4.b
A58C6A25  sqdecp        x5, p5.h
C68CAA25  sqdecp        x6, p6.s
E78CEA25  sqdecp        x7, p7.d
00886825  sqincp        x0, p0.h, w0
2188A825  sqincp        x1, p1.s, w1
42882825  sqincp        x2, p2.b, w2
6388E825  sqincp        x3, p3.d, w3
848C2825  sqincp        x4, p4.b
A58C6825  sqincp        x5, p5.h
C68CA825  sqincp        x6, p6.s
E78CE825  sqincp        x7, p7.d
00882B25  uqdecp        w0, p0.b
21886B25  uqdecp        w1, p1.h
4288AB25  uqdecp        w2, p2.s
6388EB25  uqdecp        w3, p3.d
848C2B25  uqdecp        x4, p4.b
A58C6B25  uqdecp        x5, p5.h
C68CAB25  uqdecp        x6, p6.s
E78CEB25  uqdecp        x7, p7.d
00882925  uqincp        w0, p0.b
21886925  uqincp        w1, p1.h
4288A925  uqincp        w2, p2.s
6388E925  uqincp        w3, p3.d
848C2925  uqincp        x4, p4.b
A58C6925  uqincp        x5, p5.h
C68CA925  uqincp        x6, p6.s
E78CE925  uqincp        x7, p7.d
00806A25  sqdecp        z0.h, p0.h
2180AA25  sqdecp        z1.s, p1.s
4280EA25  sqdecp        z2.d, p2.d
63806825  sqincp        z3.h, p3.h
8480A825  sqincp        z4.s, p4.s
A580E825  sqincp        z5.d, p5.d
C6806B25  uqdecp        z6.h, p6.h
E780AB25  uqdecp        z7.s, p7.s
0880EB25  uqdecp        z8.d, p0.d
29806925  uqincp        z9.h, p1.h
4A80A925  uqincp        z10.s, p2.s
6B80E925  uqincp        z11.d, p3.d

@amanasifkhalid
Copy link
Member Author

cc @kunalspathak @TIHan

@BruceForstall
Copy link
Member

Can you update the comment at the top? (If nothing else, the "once that work is merged in, I'll rebase these changes on top of it" comment is obsolete)

@BruceForstall
Copy link
Member

cc @a74nh

Copy link
Contributor

@a74nh a74nh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few small things

@@ -5098,6 +5098,106 @@ void CodeGen::genArm64EmitterUnitTestsSve()
theEmitter->emitIns_R_R_I(INS_sve_uqrshrn, EA_SCALABLE, REG_V15, REG_V12, 1,
INS_OPTS_SCALABLE_H); // UQRSHRN <Zd>.H, {<Zn1>.S-<Zn2>.S }, #<const>

// IF_SVE_DM_2A
theEmitter->emitIns_R_R(INS_sve_decp, EA_8BYTE, REG_R0, REG_P0, INS_OPTS_SCALABLE_B); // DECP <Xdn>, <Pm>.<T>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency with instructions elsewhere, INS_OPTS_SCALABLE_B should be INS_OPTS_SCALABLE_B_WITH_SCALAR. Same for anything else where EA_ is a value instead of EA_SCALABLE.

@@ -1259,6 +1276,8 @@ bool emitter::emitInsMayWriteToGCReg(instrDesc* id)
case IF_DV_2B: // DV_2B .Q.........iiiii ......nnnnnddddd Rd Vn[] (umov - to general)
case IF_DV_2H: // DV_2H X........X...... ......nnnnnddddd Rd Vn (fmov - to general)

case IF_SVE_DO_2A: // ........xx...... .....X.MMMMddddd -- SVE saturating inc/dec register by predicate count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why might these two groups write to GC Reg? (It's possible I've been missing some from my PRs)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, didn't mean to push that in the first place. I don't think we should worry about touching emitInsMayWriteToGCReg for now. Once we start emitting these SVE instructions for actual codegen, we might need to come back to this method.

// <Xdn>, <Pm>.<T>
case IF_SVE_DM_2A: // ........xx...... .......MMMMddddd -- SVE inc/dec register by predicate count
emitDispReg(id->idReg1(), id->idOpSize(), true); // ddddd
emitDispPredicateReg(id->idReg2(), PREDICATE_NONE, false); // MMMM
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once #95918 goes in then this should be
emitDispPredicateReg(id->idReg2(), PREDICATE_SIZED, false); and then don't call emitDispArrangement()

(This code can be fixed up later if it goes in before 95918)

@@ -15364,7 +15452,7 @@ void emitter::emitDispLowPredicateReg(regNumber reg, PredicateType ptype, bool a
//------------------------------------------------------------------------
// emitDispArrangement: Display a SIMD vector arrangement suffix
//
void emitter::emitDispArrangement(insOpts opt)
void emitter::emitDispArrangement(insOpts opt, bool addComma /* = false */)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any use of calling emitDispArrangement() with addComma set to true.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. (But still can be removed if end up using PREDICATE_SIZED)

@amanasifkhalid
Copy link
Member Author

Thank you for the review @a74nh! I've applied your feedback, and the unit test outputs remain the same.

JitDisasm:

decp    x0, p0.b
decp    x1, p1.h
decp    x2, p2.s
decp    x3, p3.d
incp    x4, p4.b
incp    x5, p5.h
incp    x6, p6.s
incp    x7, p7.d
decp    z0.h, p0.h
decp    z1.s, p1.s
decp    z2.d, p2.d
incp    z3.h, p3.h
incp    z4.s, p4.s
incp    z5.d, p5.d
sqdecp  x0, p0.b, w0
sqdecp  x1, p1.h, w1
sqdecp  x2, p2.s, w2
sqdecp  x3, p3.d, w3
sqdecp  x4, p4.b
sqdecp  x5, p5.h
sqdecp  x6, p6.s
sqdecp  x7, p7.d
sqincp  x0, p0.h, w0
sqincp  x1, p1.s, w1
sqincp  x2, p2.b, w2
sqincp  x3, p3.d, w3
sqincp  x4, p4.b
sqincp  x5, p5.h
sqincp  x6, p6.s
sqincp  x7, p7.d
uqdecp  w0, p0.b
uqdecp  w1, p1.h
uqdecp  w2, p2.s
uqdecp  w3, p3.d
uqdecp  x4, p4.b
uqdecp  x5, p5.h
uqdecp  x6, p6.s
uqdecp  x7, p7.d
uqincp  w0, p0.b
uqincp  w1, p1.h
uqincp  w2, p2.s
uqincp  w3, p3.d
uqincp  x4, p4.b
uqincp  x5, p5.h
uqincp  x6, p6.s
uqincp  x7, p7.d
sqdecp  z0.h, p0.h
sqdecp  z1.s, p1.s
sqdecp  z2.d, p2.d
sqincp  z3.h, p3.h
sqincp  z4.s, p4.s
sqincp  z5.d, p5.d
uqdecp  z6.h, p6.h
uqdecp  z7.s, p7.s
uqdecp  z8.d, p0.d
uqincp  z9.h, p1.h
uqincp  z10.s, p2.s
uqincp  z11.d, p3.d

cstool:

00882D25  decp  x0, p0.b
21886D25  decp  x1, p1.h
4288AD25  decp  x2, p2.s
6388ED25  decp  x3, p3.d
84882C25  incp  x4, p4.b
A5886C25  incp  x5, p5.h
C688AC25  incp  x6, p6.s
E788EC25  incp  x7, p7.d
00806D25  decp  z0.h, p0.h
2180AD25  decp  z1.s, p1.s
4280ED25  decp  z2.d, p2.d
63806C25  incp  z3.h, p3.h
8480AC25  incp  z4.s, p4.s
A580EC25  incp  z5.d, p5.d
00882A25  sqdecp        x0, p0.b, w0
21886A25  sqdecp        x1, p1.h, w1
4288AA25  sqdecp        x2, p2.s, w2
6388EA25  sqdecp        x3, p3.d, w3
848C2A25  sqdecp        x4, p4.b
A58C6A25  sqdecp        x5, p5.h
C68CAA25  sqdecp        x6, p6.s
E78CEA25  sqdecp        x7, p7.d
00886825  sqincp        x0, p0.h, w0
2188A825  sqincp        x1, p1.s, w1
42882825  sqincp        x2, p2.b, w2
6388E825  sqincp        x3, p3.d, w3
848C2825  sqincp        x4, p4.b
A58C6825  sqincp        x5, p5.h
C68CA825  sqincp        x6, p6.s
E78CE825  sqincp        x7, p7.d
00882B25  uqdecp        w0, p0.b
21886B25  uqdecp        w1, p1.h
4288AB25  uqdecp        w2, p2.s
6388EB25  uqdecp        w3, p3.d
848C2B25  uqdecp        x4, p4.b
A58C6B25  uqdecp        x5, p5.h
C68CAB25  uqdecp        x6, p6.s
E78CEB25  uqdecp        x7, p7.d
00882925  uqincp        w0, p0.b
21886925  uqincp        w1, p1.h
4288A925  uqincp        w2, p2.s
6388E925  uqincp        w3, p3.d
848C2925  uqincp        x4, p4.b
A58C6925  uqincp        x5, p5.h
C68CA925  uqincp        x6, p6.s
E78CE925  uqincp        x7, p7.d
00806A25  sqdecp        z0.h, p0.h
2180AA25  sqdecp        z1.s, p1.s
4280EA25  sqdecp        z2.d, p2.d
63806825  sqincp        z3.h, p3.h
8480A825  sqincp        z4.s, p4.s
A580E825  sqincp        z5.d, p5.d
C6806B25  uqdecp        z6.h, p6.h
E780AB25  uqdecp        z7.s, p7.s
0880EB25  uqdecp        z8.d, p0.d
29806925  uqincp        z9.h, p1.h
4A80A925  uqincp        z10.s, p2.s
6B80E925  uqincp        z11.d, p3.d

code = emitInsCodeSve(ins, fmt);
code |= insEncodeReg_R_4_to_0(id->idReg1()); // ddddd
code |= insEncodeReg_P_8_to_5(id->idReg2()); // MMMM
code |= insEncodeVLSElemsize(id->idOpSize()); // X
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically, you could combine this with IF_SVE_DM_2A, but lets do it in cleanup phase.

case IF_SVE_DO_2A:
  code |= insEncodeVLSElemsize(id->idOpSize());                    // X
  FALLTHROUGH;

case IF_SVE_DM_2A:
  ...

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@amanasifkhalid
Copy link
Member Author

Merged changes from #95996.

@amanasifkhalid amanasifkhalid merged commit 31825ff into dotnet:main Dec 18, 2023
@github-actions github-actions bot locked and limited conversation to collaborators Jan 18, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI arm-sve Work related to arm64 SVE/SVE2 support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants