-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Enable CORINFO_INTRINSIC Round, Ceiling, and Floor to generate ROUNDSS and ROUNDSD #14736
Conversation
This is still a WIP, but the first pass is done. |
src/jit/importer.cpp
Outdated
case CORINFO_INTRINSIC_Round: | ||
case CORINFO_INTRINSIC_Ceiling: | ||
case CORINFO_INTRINSIC_Floor: | ||
return JitTls::GetCompiler()->compSupports(InstructionSet_SSE41); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a better way to test/check this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's JitTls::GetCompiler
for?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to be the way other methods are getting an instance of the compiler. I expect that this is not correct, which is why I explicitly called it out 😄
Round
, Ceiling
, and Floor
are only intrinsic if SSE4.1 is supported, so the check needs to be somewhere, just not sure where the best place is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that code is in a Compiler
member function, you already have an instance of the compiler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a static member.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might just be able to return true
here and check in impMathIntrinsic
instead, which is an instance method... Still investigating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it seems like I can either:
- Return
true
inIsTargetIntrinsic
and then have a sub-check inimpMathIntrinsic
forcompSupports
-or- - Change
IsTargetIntrinsic
(and the other three methods) to be instance methods
The latter (convert to instance methods) seems like the better option since the compSupports
check won't have to be replicated everywhere required, and the IsTargetIntrinsic
method will be self-consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it makes sense for these to be instance methods.
src/jit/codegenxarch.cpp
Outdated
regNumber srcReg = srcNode->gtRegNum; | ||
|
||
// iii) tree type is floating point type | ||
assert(varTypeIsFloating(srcNode)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These checks should be at the start of the function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have it here so that they srcNode
variable, which is used elsewhere, can be reused.
The asserts can either:
- all be at the beginning, but duplicate some code (it would be
assert(varTypeIsFloating(treeNode->gtGetOp1()));
)
-or- - just after the variables are initialized
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I guess the actual problem is that ins
is computed too early.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved ins
and other variables after the assert.
src/jit/codegenxarch.cpp
Outdated
|
||
GenTree* srcNode = treeNode->gtGetOp1(); | ||
|
||
instruction ins = ins_FloatRound(treeNode->TypeGet()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a function it's too much ceremony for something that's simply (type == TYP_FLOAT) ? INS_roundss : INS_roundsd
. There are already too many cases where codegen uses such functions for no good reason and that results in wasted cycles and wasted time for those who read the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix. The CORINFO_INTRINSIC_Sqrt function should probably be updated as well (in a separate PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
src/jit/codegenxarch.cpp
Outdated
assert(varTypeIsFloating(srcNode)); | ||
assert(srcNode->TypeGet() == treeNode->TypeGet()); | ||
|
||
genConsumeOperands(treeNode->AsOp()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The treeNode
parameter should be GenTreeOp*
to begin with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix. The CORINFO_INTRINSIC_Sqrt
function should probably be updated as well (in a separate PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it seems like most functions take a GenTreePtr
and cast to GenTreeOp
via ->AsOp()
only when required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This is looking in the codegen layer, and at similar things like genSSE2BitwiseOp
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it seems like most functions take a GenTreePtr and cast to GenTreeOp via ->AsOp() only when required.
Yes, that's unfortunately the case in existing code. Code that has been modified or added recently tends not to use this "pattern".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
src/jit/codegenxarch.cpp
Outdated
switch (treeNode->gtIntrinsic.gtIntrinsicId) | ||
{ | ||
case CORINFO_INTRINSIC_Round: | ||
inst_RV_RV_IV(ins, size, dstReg, srcReg, 4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be better to sink the inst_RV_RV_IV
call out of the switch and use the switch only to select the rounding mode value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had thought about doing that as well. Will fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
I don't see any changes in lsraxarch.cpp, you'll probably need to adjust |
src/jit/emitxarch.cpp
Outdated
@@ -95,7 +95,8 @@ bool emitter::IsDstDstSrcAVXInstruction(instruction ins) | |||
ins == INS_pmaxuw || ins == INS_pmaxud || ins == INS_vinserti128 || ins == INS_punpckhbw || | |||
ins == INS_punpcklbw || ins == INS_punpckhqdq || ins == INS_punpcklqdq || ins == INS_punpckhwd || | |||
ins == INS_punpcklwd || ins == INS_punpckhdq || ins == INS_packssdw || ins == INS_packsswb || | |||
ins == INS_packuswb || ins == INS_packusdw || ins == INS_vperm2i128); | |||
ins == INS_packuswb || ins == INS_packusdw || ins == INS_vperm2i128 || ins == INS_roundps || | |||
ins == INS_roundss || ins == INS_roundpd || ins == INS_roundsd); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
roundss
androundsd
should beDstSrcSrc
instructions to avoid the similar register dependency issue to [RyuJIT] Fix bad VEX encoding to avoid false register dependency #14225.roundps
androundpd
are two-register instructions in AVX, so you should remove here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
roundps and roundpd are two-register instructions in AVX, so you should remove here.
What do you mean? The manual indicates they are 3 operand RMI
instructions:
- ROUNDPD xmm1, xmm2/m128, imm8
- VROUNDPD xmm1, xmm2/m128, imm8
- VROUNDPD ymm1, ymm2/m256, imm8
- ROUNDPS xmm1, xmm2/m128, imm8
- VROUNDPS xmm1, xmm2/m128, imm8
- VROUNDPS ymm1, ymm2/m256, imm8
This is slightly different from teh scalar versions which are 3 operand RMI
for SSE41 and 4 operand RVMI
for AVX:
- ROUNDSD xmm1, xmm2/m64, imm8
- VROUNDSD xmm1, xmm2, xmm3/m64, imm8
- ROUNDSS xmm1, xmm2/m32, imm8
- VROUNDSS xmm1, xmm2, xmm3/m32, imm8
roundss and roundsd should be DstSrcSrc instructions to avoid the similar register dependency issue to #14225.
Isn't DPPS
and DPPD
also in that category? They are RIM
instructions: DPPS xmm1, xmm2/m128, imm8
and DPPD xmm1, xmm2/m128, imm8
.
It seems like there are some inefficiencies in the INST table in general, such as it not having entries for RMI
or RVMI
encoded instructions. This leads to these instructions being listed as RM
encoded, and having special checks/handling elsewhere (as per the comment on this method)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two functions IsDstDstSrcAVXInstruction
and IsDstSrcSrcAVXInstruction
are only used to automatically convert SSE instructions to the corresponding VEX.NDS instructions that use VEX.vvvv
to encode the additional register. For example, roundss xmm0, xmm1, imm8
-> vroundss xmm0, xmm1, xmm2, imm8
where xmm1 in VEX.NDS version is encoded by VEX.vvvv
. But roundps/d
has the same registers as its VEX.NDS version.
Isn't DPPS and DPPD also in that category? They are RIM instructions: DPPS xmm1, xmm2/m128, imm8 and DPPD xmm1, xmm2/m128, imm8.
No, this issue is just from scalar instructions because certain scalar floating-point instructions need an additional register to specify the upper section of xmm.
Round the low packed single precision floating-point value in xmm3/m32 and place the result in xmm1. The rounding mode is determined by imm8. Also, upper packed single precision floating-point values (bits[127:32]) from xmm2 are copied to xmm1[127:32].
It seems like there are some inefficiencies in the INST table in general
Yes, so we are considering to refactor it #14523
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
cc @dotnet/jit-contrib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good to me overall; the failures seem to be due to some missing modifications for the static/instance change, and some issue with the case statement in TreeNodeInfoInitIntrinsic()
- it's not obvious to me why/how the assert for the default case is getting hit.
src/jit/lsraxarch.cpp
Outdated
@@ -1978,13 +1978,15 @@ void LinearScan::TreeNodeInfoInitIntrinsic(GenTree* tree) | |||
} | |||
break; | |||
|
|||
#ifdef _TARGET_X86_ | |||
#if defined(_TARGET_AMD64_) || defined(_TARGET_X86_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't needed here; this file is only for these two targets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
src/jit/lsraxarch.cpp
Outdated
case CORINFO_INTRINSIC_Cos: | ||
case CORINFO_INTRINSIC_Sin: | ||
case CORINFO_INTRINSIC_Round: | ||
#if defined(LEGACY_BACKEND) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also not needed, since NYI_X86
is a no-op for non-X86 targets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks to be a no-op for AMD64, but not for X86
, so I think I still need this particular #ifdef
: https://github.com/dotnet/coreclr/blob/master/src/jit/error.h#L180
The round intrinsic is a little special right now and I probably missed a case somewhere. I am investigating. For reference, |
src/jit/lsraxarch.cpp
Outdated
@@ -1978,13 +1978,15 @@ void LinearScan::TreeNodeInfoInitIntrinsic(GenTree* tree) | |||
} | |||
break; | |||
|
|||
#ifdef _TARGET_X86_ | |||
#if defined(_TARGET_AMD64_) || defined(_TARGET_X86_) | |||
case CORINFO_INTRINSIC_Cos: | |||
case CORINFO_INTRINSIC_Sin: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found the issue. I forgot to add cases for Ceiling
and Floor
here.
Cos
and Sin
should still be under _TARGET_X86_
and calling NYI_X86
, as before.
While Round
, Ceiling
, and Floor
should exist on all targets (AMD64 and X86) and should only call NYI_X86 on the legacy backend.
Tests are failing because I'm not handling the case where This appears to be handled in various different ways throughout the system, depending on the exact instruction being generated (a large number of instructions seem to go through What is the appropriate way to handle it here? I'd like to avoid duplicating code if possible, but may just be missing the appropriate helper function to use. |
For now the simplest solution is to revert the changes in lowerxarch so that the round operand does not end up in memory. It's not ideal but these intrinsics aren't used often enough for this to have any meaningful code size impact. Normally you'd use |
I've removed the handling of I plan on taking a closer look at this sometime this weekend to see if I can do this properly. |
Updated |
src/jit/codegenxarch.cpp
Outdated
} | ||
break; | ||
|
||
default: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
emitInsBinary
handles address mode
instructions here. I'm not sure if I need to cover that for roundsd
, however.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably need to. Try a Math.Round(a[i])
or something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍, will test locally and update if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated so that address mode instructions work -- this was actually a bit more 'involved' then I initially though, as emitOutputAM
didn't have Is4ByteAVXInstruction
support like emitOutputCV
and emitOutputSV
.
@@ -122,6 +122,7 @@ IF_DEF(MRD_OFF, IS_GM_RD, DSP) // offset mem | |||
IF_DEF(RRD_MRD, IS_GM_RD|IS_R1_RD, DSP) // read reg , read [mem] | |||
IF_DEF(RWR_MRD, IS_GM_RD|IS_R1_WR, DSP) // write reg , read [mem] | |||
IF_DEF(RRW_MRD, IS_GM_RD|IS_R1_RW, DSP) // r/w reg , read [mem] | |||
IF_DEF(RRW_MRD_CNS, IS_GM_RD|IS_R1_RW, DSP_CNS) // r/w reg , read [mem], const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The R_R_I
instructions define and use RRW_RRW_CNS
and I mirrored that here (but for mem and stk).
I actually expected the first register to be RWR
(write, rather than read/write). Does anyone know why it was set as read/write? (maybe to do with AVX support...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me either why this is this way.
src/jit/codegenxarch.cpp
Outdated
break; | ||
} | ||
|
||
if (srcNode->isContained()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as a side note. I mostly modeled this after the code paths in emitInsBinary
and noticed that the current code (in emitInsBinary
) ends up duplicating several checks multiple times. For example, it calls isContainedIndir
, then isContainedIntFltOrDblImmed
, then isLclFldUsedFromMemory
, all of which repeat the isContained
check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to eliminate those duplicate checks here and would imagine emitInsBinary
would benefit from similar "cleanup"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
emitInsBinary
is a bloody mess. And it doesn't belong in the emitter, it should be in CodeGen. I actually refactored much of that stuff a while ago but then I found more interesting things to do and didn't finish it.
Independent of this change, it looks like named intrinsics are only working for @AndyAyersMS, do you have any idea here? Does the |
The VM only looks for named intrinsics in the system assembly. |
Looks to be not a difference in whether or not Those projects appear to all be |
That would make sense, these intrinsics are optional expands. The math part of System.Runtime.Extensions is just a forwarder to corelib so the jit would not behave differently based on the assembly name the app has for the method. |
I believe I have handled all the codegen scenarios now. The biggest change was in |
We already have a number of tests which validate that Do we have any existing mechanism for validating codegen? That is, is there a good way to validate that the intrinsic is actually hit and |
You can look at the SIMD tests, e.g. https://github.com/dotnet/coreclr/blob/master/tests/src/JIT/SIMD/VectorAdd.cs and https://github.com/dotnet/coreclr/blob/master/tests/src/JIT/SIMD/VectorUtil.cs. It uses |
Thanks @CarolEidt. Unfortunately, those are partially dependent on the I opted to just add some new tests that explicitly validate different access types work correctly (constants, static and instance fields, locals, arrays, and accessing them via unsafe code). |
@dotnet-bot test Windows_NT x64 perf |
The initial benchmark results show clear wins across the board, with a max improvement of 23.1x in x86 and 8.8x in x64 Windows x86
Windows x64
Ubuntu 16.04 x64
|
This should be ready for final review, as far as I am aware. |
cc @dotnet/jit-contrib This should be ready for review. |
@CarolEidt, @AndyAyersMS. Could you give another review pass at this. The changes since last time include:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with one optional suggestion (happy to defer to future refactoring/cleanup)
@@ -122,6 +122,7 @@ IF_DEF(MRD_OFF, IS_GM_RD, DSP) // offset mem | |||
IF_DEF(RRD_MRD, IS_GM_RD|IS_R1_RD, DSP) // read reg , read [mem] | |||
IF_DEF(RWR_MRD, IS_GM_RD|IS_R1_WR, DSP) // write reg , read [mem] | |||
IF_DEF(RRW_MRD, IS_GM_RD|IS_R1_RW, DSP) // r/w reg , read [mem] | |||
IF_DEF(RRW_MRD_CNS, IS_GM_RD|IS_R1_RW, DSP_CNS) // r/w reg , read [mem], const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me either why this is this way.
void CodeGen::genSSE41RoundOp(GenTreeOp* treeNode) | ||
{ | ||
// i) SSE4.1 is supported by the underlying hardware | ||
assert(compiler->compSupports(InstructionSet_SSE41)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I like the way the asserts are documented here.
dst += emitOutputWord(dst, code | 0x4500); | ||
dst += emitOutputByte(dst, dsp); | ||
// Does the offset fit in a byte? | ||
if (dspInByte) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sequence is duplicated enough that it would be beneficial to pull it out into a method, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
This resolves https://github.com/dotnet/coreclr/issues/14134