Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider mitigating the JCC erratum in the JIT (Nov. 2019) #13795

Closed
damageboy opened this issue Nov 14, 2019 · 15 comments
Closed

Consider mitigating the JCC erratum in the JIT (Nov. 2019) #13795

damageboy opened this issue Nov 14, 2019 · 15 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@damageboy
Copy link
Contributor

What

In continuation of #13794, It might be worthwhile to consider applying the same fixes suggested by intel in their erratum:
https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf

To the code generated by the JIT.

It is hard for me, personally to evaluate the impact on the performance of generated code produced by the JIT, but my very small anecdotal experience shows anywhere from 1%-2% degradation for my own code, to potential 20% degradation in very specific instances.

This is in accordance with Intel's own guidance (Section 2.2):

Intel has observed performance effects associated with the workaround ranging from
0-4% on many industry-standard benchmarks. 1 In subcomponents of these
benchmarks, Intel has observed outliers higher than the 0-4% range.

Affected opcodes

The update affects all forms of jumps, direct and indirect when the instruction itself is either ending or crossing on a 32-byte boundary (Section 2.1):

The MCU prevents jump instructions from being cached in the Decoded ICache when the jump
instructions cross a 32-byte boundary or when they end on a 32-byte boundary. In
this context, Jump Instructions include all jump types: conditional jump (Jcc), macro-
fused op-Jcc (where op is one of cmp, test, add, sub, and, inc, or dec), direct
unconditional jump, indirect jump, direct/indirect call, and return.

image

Proposed mitigation for code-generation

The proposed fix by intel is to stuff the affected opcodes:

  • jcc
  • fused
  • jmp
  • call
  • ret

with 0x2e prefix (End of section 2.4.0):

The advice to software developers is to align the jae instruction so that it does not
cross a 32-byte boundary. In the example, this is done by adding the benign prefix
0x2e four times before the first push %rbp instruction so that the cmp instruction,
which started at offset 1c, will instead start at offset 20. Hence the macro-fused cmp+jae instruction will not cross a 32-byte boundary.

Affected processors

Section 4.0 specifically lists no less than 40 Intel CPUs, covering the breadth of their product line, that are affected by this update:

Family Stepping Series
06_8EH 9 8th Generation Intel® CoreTM Processor Family based on microarchitecture code name Amber Lake Y
06_8EH C 8th Generation Intel® CoreTM Processor Family based on microarchitecture code name Amber Lake Y
06_55 7 2nd Generation Intel® Xeon® Scalable Processors based on microarchitecture code name Cascade Lake (server)
06_9EH A 8th Generation Intel® CoreTM Processor Family based on microarchitecture code name Coffee Lake H
06_9EH A 8th Generation Intel® CoreTM Processor Family based on microarchitecture code name Coffee Lake S
06_8EH A 8th Generation Intel® CoreTM Processor Family based on microarchitecture code name Coffee Lake U43e
06_9EH B 8th Generation Intel® CoreTM Processor Family based on microarchitecture code name Coffee Lake S (47)
06_9EH B Intel® Celeron® Processor G Series based on microarchitecture code name Coffee Lake S (48)
06_9EH A 8th Generation Intel® CoreTM Processor Family based on microarchitecture code name Coffee Lake S (69) x/KBP
06_9EH A Intel® Xeon® Processor E Family based on microarchitecture code name Coffee Lake S (610)
06_9EH A Intel® Xeon® Processor E Family based on microarchitecture code name Coffee Lake S (611)
06_9EH A Intel® Xeon® Processor E Family based on microarchitecture code name Coffee Lake S (612)
06_9EH A Intel® Xeon® Processor E Family based on microarchitecture code name Coffee Lake S (413)
06_9EH A Intel® Xeon® Processor E Family based on microarchitecture code name Coffee Lake S (414)
06_9EH A Intel® Xeon® Processor E Family based on microarchitecture code name Coffee Lake S (415)
06_9EH D 9th Generation Intel® CoreTM Processor Family based on microarchitecture code name Coffee Lake H (816)
06_9EH D 9th Generation Intel® CoreTM Processor Family based on microarchitecture code name Coffee Lake S (817)
06_8EH C 10th Generation Intel® CoreTM Processor Family based on microarchitecture code name Comet Lake U42
06_A6H 0 10th Generation Intel® CoreTM Processor Family based on microarchitecture code name Comet Lake U62
06_9EH 9 8th Generation Intel® CoreTM Processor Family based on microarchitecture code name Kaby Lake G
06_9EH 9 7th Generation Intel® CoreTM Processor Family based on microarchitecture code name Kaby Lake H
06_AEH A 8th Generation Intel® CoreTM Processor Family based on microarchitecture code name Kaby Lake Refresh U (422)
06_9EH 9 7th Generation Intel® CoreTM Processor Family based on microarchitecture code name Kaby Lake S
06_8EH 9 7th Generation Intel® CoreTM Processor Family based on microarchitecture code name Kaby Lake U
06_8EH 9 7th Generation Intel® CoreTM Processor Family based on microarchitecture code name Kaby Lake U23e
06_9EH 9 Intel® CoreTM X-series Processors based on microarchitecture code name Kaby Lake X
06_9EH 9 Intel® Xeon® Processor E3 v6 Family Kaby Lake Xeon E3
06_8EH 9 7th Generation Intel® CoreTM Processor Family based on microarchitecture code name Kaby Lake Y
06_55H 4 Intel® Xeon® Processor D Family based on microarchitecture code name Skylake D Bakerville
06_5E 3 6th Generation Intel® CoreTM Processor Family based on microarchitecture code name Skylake H
06_5E 3 6th Generation Intel® CoreTM Processor Family based on microarchitecture code name Skylake S
06_55H 4 Intel® Xeon® Scalable Processors based on microarchitecture code name Skylake Server
06_4E 3 6th Generation Intel® CoreTM Processors based on microarchitecture code name Skylake U
06_4E 3 6th Generation Intel® CoreTM Processor Family based on microarchitecture code name Skylake U23e
06_55H 4 Intel® Xeon® Processor W Family based on microarchitecture code name Skylake W
06_55H 4 Intel® CoreTM X-series Processors based on microarchitecture code name Skylake X
06_55H 4 Intel® Xeon® Processor E3 v5 Family based on microarchitecture code name Skylake Xeon E3
06_4E 3 6th Generation Intel® CoreTM Processors based on microarchitecture code name Skylake Y
06_8EH B 8th Generation Intel® CoreTM Processors based on microarchitecture code name Whiskey Lake U
06_8EH C 8th Generation Intel® CoreTM Processors based on microarchitecture code name Whiskey Lake U

Proposed fix to the JIT

I personally hold the position that this isn't a small issue (even if we take intel's claim of 0%-4% as-is) and that it is not going away: neither in terms of performance impact nor when considering the breadth of the product line affected by this issue.
I think that it would be the desired behavior, from the standpoint of the JIT, to detect if this microcode update is applied or if the CPU model is from the affected family and generate code that would not trigger the degradation.

On the face of it, while AMD processors are completely unaffected by this erratum, they also seem to not suffer adverse effects when running code that was generated by patched up assemblers:

https://www.phoronix.com/scan.php?page=news_item&px=AMD-With-Intel-JCC-Assembler

category:cq
theme:alignment
skill-level:intermediate
cost:medium

@BruceForstall
Copy link
Member

cc @AndyAyersMS

@AndyAyersMS
Copy link
Member

Would be helpful to collect more data on the observed perf impact, so we can decide how to prioritize this.

Perhaps extra motivation to 32-byte align the code buffer (see #9912 and #8108) for x86/x64; without doing that, we can't possibly fix this. Could possibly do it just for Tier1 methods to try and minimize the fragmentation.

Implementing a fix for this in the jit may be non-trivial. Adding nops to fix one branch alignment may alter span-dependent instruction lengths and so push other branches into bad alignments. Handling macro fused ops adds a bit of complexity too. Probably needs to be handled as part of branch tensioning.

@damageboy
Copy link
Contributor Author

damageboy commented Nov 20, 2019 via email

@damageboy
Copy link
Contributor Author

@AndyAyersMS @adamsitnik

Seems like someone over the GraalVM project re-ran their benchmark suite, and I think their results pretty much reflect what I've experienced sporadically and that is that Intel is lying through their teeth regarding the 0-5% perf drop (Very impressive and concise visualization of 1000 benchmark results, regardless):

image

oracle/graal#1829

I think it's clear, given that EVERY result on the screenshot is degraded more than 5% that this is not something that is un-noticeable...
Other compilers and JITs will eventually provide a mitigation for this (I'm assuming that Rust/Go and other LLVM based compilers will have an easier/faster response time to this, on that issue it is clear Graal already has something in the works) and there will be a clear and substantial drop in perceived/actual performance for CoreCLR / .NetFX from this microcode update compared to patched languages/runtimes.

I'd love to get the chance to run a bunch of benchmarks for JIT'd code that are acceptable for the CoreCLR team and provide some results.

I can start looking for large suites of .NET BenchmarkDotNet projects, but I'm assuming there is already something that all of the JIT devs are using as some sort of reference much like the GraalVM people?

@filipnavara
Copy link
Member

I can start looking for large suites of .NET BenchmarkDotNet projects, but I'm assuming there is already something that all of the JIT devs are using as some sort of reference much like the GraalVM people?

https://github.com/dotnet/performance ... it is run roughly every 4 hours for .NET Core and I run it once a day for Mono.

@emaste
Copy link

emaste commented Dec 6, 2019

Note that the image quoted above has a slider set to "Ignore deviations below 5%" - perhaps that explains the why every result on the screenshot is degraded more than 5%? I do believe it's very important that this issue gets addressed, but want to make sure we're interpreting the results correctly.

@damageboy
Copy link
Contributor Author

damageboy commented Dec 6, 2019 via email

@brianrob
Copy link
Member

brianrob commented Dec 6, 2019

Thanks folks for continuing to engage on this. I've been looking into the performance differences here as well, but don't yet have something worth sharing. I'm aiming for a mix of micro-benchmarks and some that look a bit more realistic. My goal is to understand the general performance cost, as well as which code is responsible for that cost (native vs. managed). From there we can start looking at mitigations.

With respect to the data above, the summary looks very similar to what I'm seeing in the microbenchmarks.

@damageboy
Copy link
Contributor Author

Hi, I was wondering, now that it's been about a month since the last update on this issue and 18 days since @BruceForstall added a 5.0 milestone for this issue if there has been any formal decision
on how/if to address this?

I realize that the 5.0 milestone doesn't necessarily mean that any mitigation will be added at any level (managed/unmanaged code). If no decision has been made yet, I'd like to understand if the research that was performed by @brianrob was concluded, and if so, would it be possible to share the results of the impact that was measured?

@brianrob
Copy link
Member

@damageboy thanks for reaching out on this. I was out for the holidays and have picked this work back up this week. I'm not done yet, but plan to share methodology and results once it's ready. Also expect that we'll be interested to hear from folks like yourself on how your apps are impacted to help guide a decision making process. All that to say that so far things have seemed fairly closed, but I expect to open it up more soon - I just don't have data that is in a form worth sharing, and that actually helps back up a conclusion just yet. Thanks for your patience.

@damageboy
Copy link
Contributor Author

Sure thing. I will point out, from the get-go, that for us, the only sensible decision was to make sure this microcode update never gets applied.

So in effect, we are experiencing no impact what-so-ever, since we have basically, at this point, given up any security expectations from Intel processors.
At the same time it is important to note that our systems which I am discussing here are not internet facing machines, which completely changes the decision making process in favor for taking such at extreme measure.

I will point out that in our testing the impact was considerable, and while it was definitely far from the worse case I reported with 20% drop in perf for Array.Sort(), we saw a more moderate, 4-5% drop in perf for our computationally intensive workloads on Haswell and Broadwell processors (No Skylake processors in our computational cluster so far).

@damageboy
Copy link
Contributor Author

@brianrob if there are new findings/conclusions about this issue I would be happy if you could share them here...

I personally think it's also completely fine to mark this as wontfix if that is what the group and the
JIT gurus think is the right answer going forward for this.
It's been about two and half months now, holidays and all, and I would really like to get some visibility or finality about where this is going.

Maybe you could share what MSFT is doing internally about this?

We've all heard previously how bing, to name one example, is a big consumer of CoreCLR and contributes and pushes GC and JIT improvement with various fixes/proposals which end up pushing perf up in small marginal improvements?

How are they being affected by this supposed 5%+ perf hit (that is if you take Intel's word as-is)?
Are they seeing a much smaller perf hit?
Are they perhaps also not deploying the fix?

@AndyAyersMS
Copy link
Member

@brianrob any update on this?

@brianrob
Copy link
Member

@AndyAyersMS, thanks for pinging this. I am putting together the results to post here soon.

@brianrob
Copy link
Member

brianrob commented May 1, 2020

FYI, I just posted #35730 with the results of the investigation.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

No branches or pull requests

7 participants