Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Enable CORINFO_INTRINSIC Round, Ceiling, and Floor to generate ROUNDSS and ROUNDSD #14736

Merged
merged 2 commits into from
Jan 17, 2018
Merged

Conversation

tannergooding
Copy link
Member

@tannergooding
Copy link
Member Author

This is still a WIP, but the first pass is done.

case CORINFO_INTRINSIC_Round:
case CORINFO_INTRINSIC_Ceiling:
case CORINFO_INTRINSIC_Floor:
return JitTls::GetCompiler()->compSupports(InstructionSet_SSE41);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better way to test/check this?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's JitTls::GetCompiler for?!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be the way other methods are getting an instance of the compiler. I expect that this is not correct, which is why I explicitly called it out 😄

Round, Ceiling, and Floor are only intrinsic if SSE4.1 is supported, so the check needs to be somewhere, just not sure where the best place is.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that code is in a Compiler member function, you already have an instance of the compiler.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a static member.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might just be able to return true here and check in impMathIntrinsic instead, which is an instance method... Still investigating.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it seems like I can either:

  • Return true in IsTargetIntrinsic and then have a sub-check in impMathIntrinsic for compSupports
    -or-
  • Change IsTargetIntrinsic (and the other three methods) to be instance methods

The latter (convert to instance methods) seems like the better option since the compSupports check won't have to be replicated everywhere required, and the IsTargetIntrinsic method will be self-consistent.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it makes sense for these to be instance methods.

regNumber srcReg = srcNode->gtRegNum;

// iii) tree type is floating point type
assert(varTypeIsFloating(srcNode));
Copy link

@mikedn mikedn Oct 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These checks should be at the start of the function

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have it here so that they srcNode variable, which is used elsewhere, can be reused.

The asserts can either:

  • all be at the beginning, but duplicate some code (it would be assert(varTypeIsFloating(treeNode->gtGetOp1()));)
    -or-
  • just after the variables are initialized

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I guess the actual problem is that ins is computed too early.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved ins and other variables after the assert.


GenTree* srcNode = treeNode->gtGetOp1();

instruction ins = ins_FloatRound(treeNode->TypeGet());
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a function it's too much ceremony for something that's simply (type == TYP_FLOAT) ? INS_roundss : INS_roundsd. There are already too many cases where codegen uses such functions for no good reason and that results in wasted cycles and wasted time for those who read the code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix. The CORINFO_INTRINSIC_Sqrt function should probably be updated as well (in a separate PR)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

assert(varTypeIsFloating(srcNode));
assert(srcNode->TypeGet() == treeNode->TypeGet());

genConsumeOperands(treeNode->AsOp());
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The treeNode parameter should be GenTreeOp* to begin with.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix. The CORINFO_INTRINSIC_Sqrt function should probably be updated as well (in a separate PR)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it seems like most functions take a GenTreePtr and cast to GenTreeOp via ->AsOp() only when required.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This is looking in the codegen layer, and at similar things like genSSE2BitwiseOp)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it seems like most functions take a GenTreePtr and cast to GenTreeOp via ->AsOp() only when required.

Yes, that's unfortunately the case in existing code. Code that has been modified or added recently tends not to use this "pattern".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

switch (treeNode->gtIntrinsic.gtIntrinsicId)
{
case CORINFO_INTRINSIC_Round:
inst_RV_RV_IV(ins, size, dstReg, srcReg, 4);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be better to sink the inst_RV_RV_IV call out of the switch and use the switch only to select the rounding mode value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had thought about doing that as well. Will fix.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@mikedn
Copy link

mikedn commented Oct 28, 2017

I don't see any changes in lsraxarch.cpp, you'll probably need to adjust TreeNodeInfoInitIntrinsic. And maybe ContainCheckIntrinsic in lowerxarch.cpp as well.

@@ -95,7 +95,8 @@ bool emitter::IsDstDstSrcAVXInstruction(instruction ins)
ins == INS_pmaxuw || ins == INS_pmaxud || ins == INS_vinserti128 || ins == INS_punpckhbw ||
ins == INS_punpcklbw || ins == INS_punpckhqdq || ins == INS_punpcklqdq || ins == INS_punpckhwd ||
ins == INS_punpcklwd || ins == INS_punpckhdq || ins == INS_packssdw || ins == INS_packsswb ||
ins == INS_packuswb || ins == INS_packusdw || ins == INS_vperm2i128);
ins == INS_packuswb || ins == INS_packusdw || ins == INS_vperm2i128 || ins == INS_roundps ||
ins == INS_roundss || ins == INS_roundpd || ins == INS_roundsd);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

roundps and roundpd are two-register instructions in AVX, so you should remove here.

What do you mean? The manual indicates they are 3 operand RMI instructions:

  • ROUNDPD xmm1, xmm2/m128, imm8
  • VROUNDPD xmm1, xmm2/m128, imm8
  • VROUNDPD ymm1, ymm2/m256, imm8
  • ROUNDPS xmm1, xmm2/m128, imm8
  • VROUNDPS xmm1, xmm2/m128, imm8
  • VROUNDPS ymm1, ymm2/m256, imm8

This is slightly different from teh scalar versions which are 3 operand RMI for SSE41 and 4 operand RVMI for AVX:

  • ROUNDSD xmm1, xmm2/m64, imm8
  • VROUNDSD xmm1, xmm2, xmm3/m64, imm8
  • ROUNDSS xmm1, xmm2/m32, imm8
  • VROUNDSS xmm1, xmm2, xmm3/m32, imm8

roundss and roundsd should be DstSrcSrc instructions to avoid the similar register dependency issue to #14225.

Isn't DPPS and DPPD also in that category? They are RIM instructions: DPPS xmm1, xmm2/m128, imm8 and DPPD xmm1, xmm2/m128, imm8.

It seems like there are some inefficiencies in the INST table in general, such as it not having entries for RMI or RVMI encoded instructions. This leads to these instructions being listed as RM encoded, and having special checks/handling elsewhere (as per the comment on this method)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two functions IsDstDstSrcAVXInstruction and IsDstSrcSrcAVXInstruction are only used to automatically convert SSE instructions to the corresponding VEX.NDS instructions that use VEX.vvvv to encode the additional register. For example, roundss xmm0, xmm1, imm8 -> vroundss xmm0, xmm1, xmm2, imm8 where xmm1 in VEX.NDS version is encoded by VEX.vvvv. But roundps/d has the same registers as its VEX.NDS version.

Isn't DPPS and DPPD also in that category? They are RIM instructions: DPPS xmm1, xmm2/m128, imm8 and DPPD xmm1, xmm2/m128, imm8.

No, this issue is just from scalar instructions because certain scalar floating-point instructions need an additional register to specify the upper section of xmm.

Round the low packed single precision floating-point value in xmm3/m32 and place the result in xmm1. The rounding mode is determined by imm8. Also, upper packed single precision floating-point values (bits[127:32]) from xmm2 are copied to xmm1[127:32].

It seems like there are some inefficiencies in the INST table in general

Yes, so we are considering to refactor it #14523

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@BruceForstall
Copy link
Member

cc @dotnet/jit-contrib

Copy link

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good to me overall; the failures seem to be due to some missing modifications for the static/instance change, and some issue with the case statement in TreeNodeInfoInitIntrinsic() - it's not obvious to me why/how the assert for the default case is getting hit.

@@ -1978,13 +1978,15 @@ void LinearScan::TreeNodeInfoInitIntrinsic(GenTree* tree)
}
break;

#ifdef _TARGET_X86_
#if defined(_TARGET_AMD64_) || defined(_TARGET_X86_)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't needed here; this file is only for these two targets.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

case CORINFO_INTRINSIC_Cos:
case CORINFO_INTRINSIC_Sin:
case CORINFO_INTRINSIC_Round:
#if defined(LEGACY_BACKEND)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also not needed, since NYI_X86 is a no-op for non-X86 targets.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to be a no-op for AMD64, but not for X86, so I think I still need this particular #ifdef: https://github.com/dotnet/coreclr/blob/master/src/jit/error.h#L180

@tannergooding
Copy link
Member Author

it's not obvious to me why/how the assert for the default case is getting hit.

The round intrinsic is a little special right now and I probably missed a case somewhere. I am investigating.

For reference, Math.Round was moved to be a new style intrinsic in this PR: #14119. However, it was moved by mapping the new intrinsic name to the old CORINFO_INTRINSIC_ enum value (https://github.com/dotnet/coreclr/pull/14119/files#diff-edc46b80f57431489a82311948a8234dR3771).

@@ -1978,13 +1978,15 @@ void LinearScan::TreeNodeInfoInitIntrinsic(GenTree* tree)
}
break;

#ifdef _TARGET_X86_
#if defined(_TARGET_AMD64_) || defined(_TARGET_X86_)
case CORINFO_INTRINSIC_Cos:
case CORINFO_INTRINSIC_Sin:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found the issue. I forgot to add cases for Ceiling and Floor here.

Cos and Sin should still be under _TARGET_X86_ and calling NYI_X86, as before.

While Round, Ceiling, and Floor should exist on all targets (AMD64 and X86) and should only call NYI_X86 on the legacy backend.

@tannergooding
Copy link
Member Author

Tests are failing because I'm not handling the case where srcNode->isUsedFromMemory() == true.

This appears to be handled in various different ways throughout the system, depending on the exact instruction being generated (a large number of instructions seem to go through emitInsBinary).

What is the appropriate way to handle it here? I'd like to avoid duplicating code if possible, but may just be missing the appropriate helper function to use.

@mikedn
Copy link

mikedn commented Nov 7, 2017

What is the appropriate way to handle it here? I'd like to avoid duplicating code if possible, but may just be missing the appropriate helper function to use.

For now the simplest solution is to revert the changes in lowerxarch so that the round operand does not end up in memory. It's not ideal but these intrinsics aren't used often enough for this to have any meaningful code size impact.

Normally you'd use emitInsBinary but that can't handle the extra immediate operand. I was looking recently at refactoring that, maybe I can come up with something in the future. But it will take some time...

@tannergooding
Copy link
Member Author

I've removed the handling of _Round, _Ceiling, and _Floor intrinsics from lowerxarch for the time being.

I plan on taking a closer look at this sometime this weekend to see if I can do this properly.

@tannergooding
Copy link
Member Author

Updated genSSE41RoundOp to handle contained nodes. Codegen looks to be correct, and tests are passing.

}
break;

default:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emitInsBinary handles address mode instructions here. I'm not sure if I need to cover that for roundsd, however.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably need to. Try a Math.Round(a[i]) or something like that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍, will test locally and update if needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated so that address mode instructions work -- this was actually a bit more 'involved' then I initially though, as emitOutputAM didn't have Is4ByteAVXInstruction support like emitOutputCV and emitOutputSV.

@@ -122,6 +122,7 @@ IF_DEF(MRD_OFF, IS_GM_RD, DSP) // offset mem
IF_DEF(RRD_MRD, IS_GM_RD|IS_R1_RD, DSP) // read reg , read [mem]
IF_DEF(RWR_MRD, IS_GM_RD|IS_R1_WR, DSP) // write reg , read [mem]
IF_DEF(RRW_MRD, IS_GM_RD|IS_R1_RW, DSP) // r/w reg , read [mem]
IF_DEF(RRW_MRD_CNS, IS_GM_RD|IS_R1_RW, DSP_CNS) // r/w reg , read [mem], const
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The R_R_I instructions define and use RRW_RRW_CNS and I mirrored that here (but for mem and stk).

I actually expected the first register to be RWR (write, rather than read/write). Does anyone know why it was set as read/write? (maybe to do with AVX support...)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me either why this is this way.

break;
}

if (srcNode->isContained())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as a side note. I mostly modeled this after the code paths in emitInsBinary and noticed that the current code (in emitInsBinary) ends up duplicating several checks multiple times. For example, it calls isContainedIndir, then isContainedIntFltOrDblImmed, then isLclFldUsedFromMemory, all of which repeat the isContained check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to eliminate those duplicate checks here and would imagine emitInsBinary would benefit from similar "cleanup"

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emitInsBinary is a bloody mess. And it doesn't belong in the emitter, it should be in CodeGen. I actually refactored much of that stuff a while ago but then I found more interesting things to do and didn't finish it.

@tannergooding
Copy link
Member Author

tannergooding commented Nov 10, 2017

Independent of this change, it looks like named intrinsics are only working for [mscorlib]System.Math.Round.... If you are using [System.Runtime.Extensions]System.Math.Round, it isn't picked up.

@AndyAyersMS, do you have any idea here? Does the Intrinsic attribute need to be specified on the reference assembly as well?

@AndyAyersMS
Copy link
Member

The VM only looks for named intrinsics in the system assembly.

@tannergooding
Copy link
Member Author

Looks to be not a difference in whether or not [mscorlib] or [System.Runtime.Extensions] is used and is instead because the assemblies I checked that were using [mscorlib] were also being compiled without the DebuggableAttribute.DebuggingModes.DisableOptimizations flag, so opts.compDbgCode was returning false rather than true.

Those projects appear to all be ilproj (although I didn't check them all).

@AndyAyersMS
Copy link
Member

That would make sense, these intrinsics are optional expands.

The math part of System.Runtime.Extensions is just a forwarder to corelib so the jit would not behave differently based on the assembly name the app has for the method.

@tannergooding tannergooding changed the title [WIP] Enable CORINFO_INTRINSIC Round, Ceiling, and Floor to generate ROUNDSS and ROUNDSD Enable CORINFO_INTRINSIC Round, Ceiling, and Floor to generate ROUNDSS and ROUNDSD Nov 11, 2017
@tannergooding
Copy link
Member Author

I believe I have handled all the codegen scenarios now. The biggest change was in emitOutputAM which didn't yet support 4 byte AVX instructions.

@tannergooding
Copy link
Member Author

We already have a number of tests which validate that Math.Round produces the correct result (although I can add more, and explicitly test Math.Round on arrays, fields, locals, etc, if required).

Do we have any existing mechanism for validating codegen? That is, is there a good way to validate that the intrinsic is actually hit and roundsd/rounds is emitted on hardware with SSE4.1 support (other than manual validation, which I have already done).

@CarolEidt
Copy link

You can look at the SIMD tests, e.g. https://github.com/dotnet/coreclr/blob/master/tests/src/JIT/SIMD/VectorAdd.cs and https://github.com/dotnet/coreclr/blob/master/tests/src/JIT/SIMD/VectorUtil.cs. It uses COMPlus_JitFuncInfoLogFile. It doesn't actually validate what is generated, but it validates that the intrinsic method isn't jitted.

@tannergooding
Copy link
Member Author

Thanks @CarolEidt.

Unfortunately, those are partially dependent on the IsHardwareAccelerated property to determine whether or not the intrinsic path should be taken. We don't have a corresponding property for SSE4.1 (at least not one that is fully hooked up today).

I opted to just add some new tests that explicitly validate different access types work correctly (constants, static and instance fields, locals, arrays, and accessing them via unsafe code).

@tannergooding
Copy link
Member Author

@dotnet-bot test Windows_NT x64 perf
@dotnet-bot test Windows_NT x86 perf
@dotnet-bot test linux perf flow

@tannergooding
Copy link
Member Author

tannergooding commented Dec 6, 2017

The initial benchmark results show clear wins across the board, with a max improvement of 23.1x in x86 and 8.8x in x64

Windows x86

Function Baseline PR Improvement
Math.Ceiling 2544.91 445.94 5.707x
Math.Floor 1803.65 445.97 4.044x
Math.Round 4052.71 195.10 20.772x
MathF.Ceiling 3719.83 445.88 8.343x
MathF.Floor 3670.35 445.94 8.231x
MathF.Round 4514.50 195.08 23.142x

Windows x64

Function Baseline PR Improvement
Math.Ceiling 781.95 445.89 1.754x
Math.Floor 893.39 445.89 2.004x
Math.Round 1367.02 195.08 7.008x
MathF.Ceiling 782.07 445.91 1.754x
MathF.Floor 781.70 445.90 1.753x
MathF.Round 1268.49 195.08 6.503x

Ubuntu 16.04 x64

Function Baseline PR Improvement
Math.Ceiling 1266.09 446.20 2.837x
Math.Floor 1228.06 446.29 2.752x
Math.Round 1553.90 195.20 7.961x
MathF.Ceiling 1261.85 446.16 2.828x
MathF.Floor 1138.74 446.82 2.549x
MathF.Round 1718.49 195.19 8.804x

@tannergooding
Copy link
Member Author

This should be ready for final review, as far as I am aware.

@tannergooding
Copy link
Member Author

cc @dotnet/jit-contrib

This should be ready for review.

@tannergooding
Copy link
Member Author

Fixed the containment checks based on my findings from #15836 and #15804

@tannergooding
Copy link
Member Author

@CarolEidt, @AndyAyersMS. Could you give another review pass at this.

The changes since last time include:

  • Adding additional tests to cover the containment checks
  • Updating the containment checks to match the expected behavior

Copy link

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one optional suggestion (happy to defer to future refactoring/cleanup)

@@ -122,6 +122,7 @@ IF_DEF(MRD_OFF, IS_GM_RD, DSP) // offset mem
IF_DEF(RRD_MRD, IS_GM_RD|IS_R1_RD, DSP) // read reg , read [mem]
IF_DEF(RWR_MRD, IS_GM_RD|IS_R1_WR, DSP) // write reg , read [mem]
IF_DEF(RRW_MRD, IS_GM_RD|IS_R1_RW, DSP) // r/w reg , read [mem]
IF_DEF(RRW_MRD_CNS, IS_GM_RD|IS_R1_RW, DSP_CNS) // r/w reg , read [mem], const

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me either why this is this way.

void CodeGen::genSSE41RoundOp(GenTreeOp* treeNode)
{
// i) SSE4.1 is supported by the underlying hardware
assert(compiler->compSupports(InstructionSet_SSE41));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I like the way the asserts are documented here.

dst += emitOutputWord(dst, code | 0x4500);
dst += emitOutputByte(dst, dsp);
// Does the offset fit in a byte?
if (dspInByte)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sequence is duplicated enough that it would be beneficial to pull it out into a method, I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RyuJIT] Enable CORINFO_INTRINSIC Round, Ceiling, and Floor to generate ROUNDSS and ROUNDSD
8 participants