-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Add documents about JIT optimization planning #12956
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,127 @@ | ||
JIT Optimizer Planning Guide | ||
============================ | ||
|
||
The goal of this document is to capture some thinking about the process used to | ||
prioritize and validate optimizer investments. The overriding goal of such | ||
investments is to help ensure that the dotnet platform satisfies developers' | ||
performance needs. | ||
|
||
|
||
Benchmarking | ||
------------ | ||
|
||
There are a number of public benchmarks which evaluate different platforms' | ||
relative performance, so naturally dotnet's scores on such benchmarks give | ||
some indication of how well it satisfies developers' performance needs. The JIT | ||
team has used some of these benchmarks, particularly [TechEmpower](https://www.techempower.com/benchmarks/) | ||
and [Benchmarks Game](http://benchmarksgame.alioth.debian.org/), for scouting | ||
out optimization opportunities and prioritizing optimization improvements. | ||
While it is important to track scores on such benchmarks to validate performance | ||
changes in the dotnet platform as a whole, when it comes to planning and | ||
prioritizing JIT optimization improvements specifically, they aren't sufficient, | ||
due to a few well-known issues: | ||
|
||
- For macro-benchmarks, such as TechEmpower, compiler optimization is often not | ||
the dominant factor in performance. The effects of individual optimizer | ||
changes are most often in the sub-percent range, well below the noise level | ||
of the measurements, which will usually be at least 3% or so even for the | ||
most well-behaved macro-benchmarks. | ||
- Source-level changes can be made much more rapidly than compiler optimization | ||
changes. This means that for anything we're trying to track where the whole | ||
team is effecting changes in source, runtime, etc., any particular code | ||
sequence we may target with optimization improvements may well be targeted | ||
with source changes in the interim, nullifying the measured benefit of the | ||
optimization change when it is eventually merged. Source/library/runtime | ||
changes are in play for TechEmpower and Benchmarks Game both. | ||
|
||
Compiler micro-benchmarks (like those in our [test tree](https://github.com/dotnet/coreclr/tree/master/tests/src/JIT/Performance/CodeQuality)) | ||
don't share these issues, and adding them as optimizations are implemented is | ||
critical for validation and regression prevention; however, micro-benchmarks | ||
often aren't as representative of real-world code, and therefore not as | ||
reflective of developers' performance needs, so aren't well suited for scouting | ||
out and prioritizing opportunities. | ||
|
||
|
||
Benefits of JIT Optimization | ||
---------------------------- | ||
|
||
While source changes can more rapidly and dramatically effect changes to | ||
targeted hot code sequences in macro-benchmarks, compiler changes have the | ||
advantage that they apply broadly to all compiled code. One of the best reasons | ||
to invest in compiler optimization improvements is to capitalize on this. A few | ||
specific benefits: | ||
|
||
- Optimizer changes can effect "peanut-butter" improvements; by making an | ||
improvement which is small in any particular instance to a code sequence that | ||
is repeated thousands of times across a codebase, they can produce substantial | ||
cumulative wins. These should accrue toward the standard metrics (benchmark | ||
scores and code size), but identifying the most profitable "peanut-butter" | ||
opportunities is difficult. Improving our methodology for identifying such | ||
opportunities would be helpful; some ideas are below. | ||
- Optimizer changes can unblock coding patterns that performance-sensitive | ||
developers want to employ but consider prohibitively expensive. They may | ||
have inelegant works-around in their code, such as gotos for loop-exiting | ||
returns to work around poor block layout, manually scalarized structs to work | ||
around poor struct promotion, manually unrolled loops to work around lack of | ||
loop unrolling, limited use of lambdas to work around inefficient access to | ||
heap-allocated closures, etc. The more the optimizer can improve such | ||
situations, the better, as it both increases developer productivity and | ||
increases the usefulness of abstractions provided by the language and | ||
libraries. Finding a measurable metric to track this type of improvement | ||
poses a challenge, but would be a big help toward prioritizing and validating | ||
optimization improvements; again, some ideas are below. | ||
|
||
|
||
Brainstorm | ||
---------- | ||
|
||
Listed here are several ideas for undertakings we might pursue to improve our | ||
ability to identify opportunities and validate/track improvements that mesh | ||
with the benefits discussed above. Thinking here is in the early stages, but | ||
the hope is that with some thought/discussion some of these will surface as | ||
worth investing in. | ||
|
||
- Is there telemetry we can implement/analyze to identify "peanut-butter" | ||
opportunities, or target "coding pattern"s? Probably easier to use this | ||
to evaluate/prioritize patterns we're considering targeting than to identify | ||
the patterns in the first place. | ||
- Can we construct some sort of "peanut-butter profiler"? The idea would | ||
roughly be to aggregate samples/counters under particular input constructs | ||
rather than aggregate them under callstack. Might it be interesting to | ||
group by MSIL opcode, or opcode pair, or opcode triplet... ? | ||
- It might behoove us to build up some SPMI traces that could be data-mined | ||
for any of these experiments. | ||
- We should make it easy to view machine code emitted by the jit, and to | ||
collect profiles and correlate them with that machine code. This could | ||
benefit any developers doing performance analysis of their own code. | ||
The JIT team has discussed this, options include building something on top of | ||
the profiler APIs, enabling COMPlus_JitDisasm in release builds, and shipping | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Having JitDisasm in release builds would certainly be nice but it may also be limiting (e.g. right now it outputs to the console so it can interfere with the application's own output). The current disassembler output is also a bit inaccurate at times, not a big problem usually but it can be confusing. Another interesting option might be for the runtime to expose a managed API that offers information (e.g. code ranges) about JITed functions. That would allow people to use a 3rd party disassembler or perhaps find more creative uses. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, we'd need to have a way to send the disasm somewhere other than stdout. I believe there's some functionality to send jit output to a logfile already, which of course if we do this we'd need to make sure it's working and working well with JitDisasm. To my mind, the appeal of making JitDisasm available over disassembling the emitted code is that it would make it easy to bring along all the annotations we put in the disasm (method name, optimization flags, symbols and helper call names, annotated GC/EH tables, etc.), as well as things like DiffableDisasm.
There is CLR MD, which for example SharpLab is using for in-proc disassembly with a 3rd-party disassembler. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would very helpful to have a "side by side" very high resolution profiler. My suggestion would be to include as a one of available profiling options code described in paper Computer performance microscopy with S him, X Yang, SM Blackburn, KS McKinley - ACM SIGARCH Computer Architecture News, 2016. This profiler allows for 15 processor cycles resolution with overhead at around 68% and 1000 processor cycles resolution with overhead at 2% with no or very small observer effects. AFAIR currently used code (thread cycle measurements in utilities) has significant overhead (in range of 200 processor cycles for single measurement or 400 cycles for two point measurement which is necessary to determine time interval (cpuid + rdtsc instructions or similar serializing time stamp counter reading instruction). Last author of article Kathryn S McKinley is at Microsoft Research and code is available at https://github.com/ShimProfiler/SHIM under GPLv2. Work was funded by NSF It is quite often that I would like to know how long performance critical method executes in real application and yet it is often called only once during typical application life cycle - i.e. image decompression, coding algorithm for short data sequnces or some parts of the multi stage / multi algorithm processing. If typical benchmarks are used method is isolated from it's usual context and execution time could be quite often very different from execution time when method is executed once in application context. In my experiments in managed code on .NET 4.6 - 4.7 the difference could be as large as 3 - 5 times. |
||
with or making easily available an alt jit that supports JitDisasm. | ||
- Hardware companies maintain optimization/performance guides for their ISAs. | ||
Should we maintain one for MSIL and/or C# (and/or F#)? If we hosted such a | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ISAs a far more complicated than MSIL in this regard so it makes sense that there are such guides. I don't thinks there's a lot that can be done here but here are a few ideas:
|
||
thing somewhere publicly votable, we could track which anti-patterns people | ||
find most frustrating to avoid, and subsequent removal of them. Does such | ||
a guide already exist somewhere, that we could use as a starting point? | ||
Should we collate GitHub issues or Stack Overflow issues to create such a thing? | ||
- Maybe we should expand our labels on GitHub so that there are sub-areas | ||
within "optimization"? It could help prioritize by letting us compare the | ||
relative sizes of those buckets. | ||
- Can we more effectively leverage the legacy JIT codebases for comparative | ||
analysis? We've compared micro-benchmark performance against Jit64 and | ||
manually compared disassembly of hot code, what else can we do? One concrete | ||
idea: run over some large corpus of code (SPMI?), and do a path-length | ||
comparison e.g. by looking at each sequence of k MSIL instructions (for some | ||
small k), and for each combination of k opcodes collect statistics on the | ||
size of generated machine code (maybe using debug line number info to do the | ||
correlation?), then look for common sequences which are much longer with | ||
RyuJIT. | ||
- Maybe hook RyuJIT up to some sort of superoptimizer to identify opportunities? | ||
- Microsoft Research has done some experimenting that involved converting RyuJIT | ||
IR to LLVM IR; perhaps we could use this to identify common expressions that | ||
could be much better optimized. | ||
- What's a practical way to establish a metric of "unblocked coding patterns"? | ||
- How developers give feedback about patterns/performance could use some thought; | ||
the GitHub issue list is open, but does it need to be publicized somehow? We | ||
perhaps should have some regular process where we pull issues over from other | ||
places where people report/discuss dotnet performance issues, like | ||
[Stack Overflow](https://stackoverflow.com/questions/tagged/performance+.net). |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,134 @@ | ||
Optimizer Codebase Status/Investments | ||
===================================== | ||
|
||
There are a number of areas in the optimizer that we know we would invest in | ||
improving if resources were unlimited. This document lists them and some | ||
thoughts about their current state and prioritization, in an effort to capture | ||
the thinking about them that comes up in planning discussions. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be very useful to have a description of existing optimisations with info on implemented algorithms and links to the code - Optimizer Codebase and Status. This would help in understanding existing RyuJIT implementation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This is more or less available in the existing documentation: https://github.com/dotnet/coreclr/blob/master/Documentation/botr/ryujit-overview.md There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I know that document - I've read it already twice and to my taste I would like to go deeper with more detailed links to code. My intention is to indicate that documentation on jit, vm and gc should allow to understand implementation to a point that for experienced developer so called time to first commit would be as short as possible. Usually problem with documentation for developers is that it is best when it's written by code authors who have to write code in a first place and do not have much time for writing documents documenting their work. Other aspect of the same problem is a barrier to contributing to project which has a major impact on size of community and dynamics of open source project development. I would treat investments in documentation as an investment in community supporting project. |
||
|
||
Improved Struct Handling | ||
------------------------ | ||
|
||
This is an area that has received recent attention, with the [first-class structs](https://github.com/dotnet/coreclr/blob/master/Documentation/design-docs/first-class-structs.md) | ||
work and the struct promotion improvements that went in for `Span<T>`. Work here | ||
is expected to continue and can happen incrementally. Possible next steps: | ||
|
||
- Struct promotion stress mode (test mode to improve robustness/reliability) | ||
- Promotion of more structs; relax limits on e.g. field count (should generally | ||
help performance-sensitive code where structs are increasingly used to avoid | ||
heap allocations) | ||
- Improve handling of System V struct passing (I think we currently insert | ||
some unnecessary round-trips through memory at call boundaries due to | ||
internal representation issues) | ||
- Implicit byref parameter promotion w/o shadow copy | ||
|
||
We don't have specific benchmarks that we know would jump in response to any of | ||
these. May well be able to find some with some looking, though this may be an | ||
area where current performance-sensitive code avoids structs. | ||
|
||
|
||
Exception handling | ||
------------------ | ||
|
||
This is increasingly important as C# language constructs like async/await and | ||
certain `foreach` incantations are implemented with EH constructs, making them | ||
difficult to avoid at source level. The recent work on finally cloning, empty | ||
finally removal, and empty try removal targeted this. [Writethrough](https://github.com/dotnet/coreclr/blob/master/Documentation/design-docs/eh-writethru.md) | ||
is another key optimization enabler here, and we are actively pursuing it. Other | ||
things we've discussed include inlining methods with EH and computing funclet | ||
callee-save register usage independently of main function callee-save register | ||
usage, but I don't think we have any particular data pointing to either as a | ||
high priority. | ||
|
||
|
||
Loop Optimizations | ||
------------------ | ||
|
||
We haven't been targeting benchmarks that spend a lot of time doing compuations | ||
in an inner loop. Pursuing loop optimizations for the peanut butter effect | ||
would seem odd. So this simply hasn't bubbled up in priority yet, though it's | ||
bound to eventually. | ||
|
||
|
||
More Expression Optimizations | ||
----------------------------- | ||
|
||
We again don't have particular benchmarks pointing to key missing cases, and | ||
balancing the CQ vs TP will be delicate here, so it would really help to have | ||
an appropriate benchmark suite to evaluate this work against. | ||
|
||
|
||
Forward Substitution | ||
-------------------- | ||
|
||
This too needs an appropriate benchmark suite that I don't think we have at | ||
this time. The tradeoffs against register pressure increase and throughput | ||
need to be evaluated. This also might make more sense to do if/when we can | ||
handle SSA renames. | ||
|
||
|
||
Value Number Conservativism | ||
--------------------------- | ||
|
||
We have some frustrating phase-ordering issues resulting from this, but the | ||
opt-repeat experiment indicated that they're not prevalent enough to merit | ||
pursuing changing this right now. Also, using SSA def as the proxy for value | ||
number would require handling SSA renaming, so there's a big dependency chained | ||
to this. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure what you mean by There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I mean eagerly replacing redundant expressions and thus being able to approximate "has same value" with "is use of same SSA def" (and re-casting the heap VN stuff as memory SSA) rather than dragging around side tables of value numbers in a separate expression language. |
||
Maybe it's worth reconsidering the priority based on throughput? | ||
|
||
|
||
High Tier Optimizations | ||
----------------------- | ||
|
||
We don't have that many knobs we can "crank up" (though we do have the tracked | ||
assertion count and could switch inliner policies), nor do we have any sort of | ||
benchmarking story set up to validate whether tiering changes are helping or | ||
hurting. We should get that benchmarking story sorted out and at least hook | ||
up those two knobs. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Again, not really an optimization issue, but it's pretty clear that existing issues with register allocation (and in particular, issue with spill placement) are a current inhibitor to more aggressive optimization. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you elaborate? Are you saying we'd do more aggressive post-RA optimization with better-placed spills, or do more aggressive pre-RA optimization if we had better spill placement in the RA to rely on, or both/neither? And specifically is there something you think the doc should say about this under "High Tier Optimizations" (like that we could use a different RA algorithm)? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was saying the latter, and I think that all the doc really needs to say is that, until the RA issues are mitigated, aggressive optimizations are likely to be pessimized by RA issues and/or potentially make performance worse. Whether or not we need a different RA algorithm, I think, remains to be seen, but I think there's a lot of potential improvement with the existing RA algorithm that has not yet been achieved. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Makes sense. Added a note to that effect. |
||
|
||
|
||
Low Tier Back-Off | ||
----------------- | ||
|
||
We have some changes we know we want to make here: morph does more than it needs | ||
to in minopts, and tier 0 should be doing throughput-improving inlines, as | ||
opposed to minopts which does no inlining. It would be nice to have the | ||
benchmarking story set up to measure the effect of such changes when they go in, | ||
we should do that. | ||
|
||
|
||
Async | ||
----- | ||
|
||
We've made note of the prevalence of async/await in modern code (and particularly | ||
in web server code such as TechEmpower), and have some opportunities listed in | ||
[#7914](https://github.com/dotnet/coreclr/issues/7914). Some sort of study of | ||
async peanut butter to find more opportunities is probably in order, but what | ||
would that look like? | ||
|
||
|
||
Address Mode Building | ||
--------------------- | ||
|
||
One opportunity that's frequently visible in asm dumps is that more address | ||
expressions could be folded into memory operands' address expressions. This | ||
would likely give a measurable codesize win. Needs some thought about where | ||
to run in phase list and how aggressive to be about e.g. analyzing across | ||
statements. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Isn't this related to forward substitution? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, certainly. I suppose I mentioned it here simply thinking that if we tackle the address mode thing it might be worthwhile to add some simple forward propagation as part of that, which could then be refactored/subsumed if we add more general forward substitution subsequently. |
||
|
||
|
||
If-Conversion (cmov formation) | ||
------------------------------ | ||
|
||
This hits big in microbenchmarks where it hits. There's some work in flight | ||
on this (see #7447 and #10861). | ||
|
||
|
||
Mulshift | ||
-------- | ||
|
||
Replacing multiplication by constants with shift/add/lea sequences is a | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Eh, the JIT does some of this already and I suspect it wouldn't be much trouble to make it do more. That's a way of saying "do we really need a planning document to make it happen"? :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Pull requests welcome :) No, I'm not trying to modify our workflow, impose heavier process, or demand that changes get added to this document before (or after) getting implemented, or anything like that -- I'm just capturing a list of items we keep discussing in planning to avoid having to re-create the discussion. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Not anytime soon, I don't think it's a very useful optimization (beyond what we already have now) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. From perspective of a developer working in core team saying:
is entirely understandable since she/he is on bleeding edge of project development but from perspective of potential community members it would be very helpful and welcome. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure what you are trying to say. Just to be clear, I'm a community member, not core team member :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. From my perspective your knowledge of dotnet and contributions tell that you are a core team member :) - git blame does not lie, does it? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I ran some stats and yeah it looks like we're already getting nearly everything that makes sense, will put together a PR for the few stragglers that seem worthwhile. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Wow, that's a bit of work. It would be nice to know how did you instrument the JIT. Or to be more precise - how did you get the numbers out of the JIT. Files I presume? I didn't know that morph does this, I only knew about codegen. Now that I see this I'm not so sure it's a good idea to have this in morph. For one thing it increases IR size and it's not likely to enable additional optimizations, quite the contrary. But more importantly, this really belongs in codegen as it is a very target specific optimization. IMUL is quite fast these days - latency 3 and throughput 1. Replacing it with a single LEA or SHR is pretty much always a win but the moment you replace it with 2 LEA/SHR instructions things become complicated. Those 2 instructions will have at least 2 cycle latency so in the best case you're saving 1 cycle at the cost of adding an instruction. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I added instance fields to I agree it seems like something that should live in the backend. cc @russellhadley who had some reasons to prefer Lower to CodeGen. I'm not planning to stop and migrate it now (bigger fish to fry), but would be happy to see that happen. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yeah, I prefer Lower too. Doing this kind of stuff in CodeGen sometimes also requires adding logic in Lower or TreeNodeInfoInit and that logic needs to be kept in sync, otherwise bugs or CQ issues show up. But if we do it in Lower we also need to add a new lclvar because the non constant operand of MUL has multiple uses.
I might take a look once I finish my pesky cast optimization attempt. |
||
classic optimization that keeps coming up in planning. An [analysis](https://gist.github.com/JosephTremoulet/c1246b17ea2803e93e203b9969ee5a25#file-mulshift-md) | ||
indicates that RyuJIT is already capitalizing on most of the opportunity here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use all of those, including also:
Examples of the result: https://ayende.com/blog/177569/why-we-arent-publishing-benchmarks-for-ravendb-4-0-yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the examples, this is exactly the sort of list I'm hoping we can build/prioritize/address.