-
Notifications
You must be signed in to change notification settings - Fork 2.7k
JIT: make suitably optimistic prejit inline assessments #14850
JIT: make suitably optimistic prejit inline assessments #14850
Conversation
When prejitting, the jit assesses whether each root method is a potential inline candidate for any possible caller. Methods deemed un-inlinable in any caller are marked in the runtime as "noinline" to save the jit some work later on when it sees calls to these methods. This assessment was too conservative and led to prejit-ordering dependences for inlines. It also meant that prejitting was missing some inlines that would have happened had we not done the prejit root assessment. This change removes some of the prejit observation blockers. These mostly will enable more prejit methods to become candidates. We also now track when a method argument reaches a test. When we are assessing profitability for a prejit root, assume the call site best possible case. Also, update the inline XML to capture the prejit assessments. This increases the number of inlines considered and performed when prejitting and will also lead to slightly more inlining when jitting. However we do not expect a large througput impact when jitting -- the main impact of this change is that inlining when prejitting is now just as aggressive as inlining when jitting, and the decisions no longer depend on the order in which methods are prejitted. Closes #14441. Closes #3482.
@briansull PTAL Verified that if I disable the propagation of the noinline bit into the runtime we get the same inlines in corelib as we do when we propagate. Size impact from jit-diffs. Note size increase is expected as this makes the prejit inliner as aggressive as the jit inliner. I plan to follow-up with some commentary on the largest regressions.
Some data on the impact to inlining in corelib below. We seen more candidates and so more inline attempts and a successes. Fewer failures from "noinline" and more from "unprofitable". Unprofitable rejects are still somewhat cheap (requires IL scan but no importation). About 10% more methods in corelib are now considered inline candidates.
Success Breakdown
Failure Breakdown
Extra Breakdown
"Extras" are methods inlined because some inline parent is an aggressive inline, eg we have a call chain A->B->C where B is an aggressive inline. Once the jit inlines B into A it now may be required to do even more inlines. So think of it as the assessment of the knock-on effect from an aggressive non-leaf inline. |
@@ -474,6 +474,12 @@ void InlineContext::DumpXml(FILE* file, unsigned indent) | |||
m_Sibling->DumpXml(file, indent); | |||
} | |||
|
|||
// Optionally suppress failing inline records | |||
if ((JitConfig.JitInlineDumpXml() == 3) && !m_Success) | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add an enum for the three values used by JitInlineDumpXml()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a bunch of enhancements I have in mind for the inline XML, though as far as I know nobody but me has ever used it.... also I have some analysis tooling lying about that I should publish somewhere, maybe over in jitutils.
I should really change this into a flags-style thing.... you should be able to choose:
- get XML for all methods or only those with inlines or inline attempts
- get XML for all inline attempts or only those with successful inlines
- get XML with full details or just summary details
- specify filename to contain the results (currently always goes to stderr)
So maybe this is best done as a follow-up change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks Good
Looking at the larger regressions, so far they fall into two related patterns:
These cases highlight the fact that in the current inline heuristics the profitability model is not dependent on the overall set of inline candidates and/or the history of inlines done so far. Each inline is assessed independently of all others. So if a method presents the inliner with a large number of marginal candidates the inliner can grow size fairly aggressively as each candidate when considered on its own passes the bar. There is a limiter that can kick in if things get too far out of hand (the "inline exceeds budget" rejection case) and we do see an uptick in this in the data but it is mainly there to prevent pathological behavior. As I have stated elsewhere the current profitability heuristics are not well founded and it is likely they give improper weight to the various facts available to the inliner. The upshot is that many marginal candidates appear more attractive then they ought to be. While some tweaking is possible the biggest missing factor is some kind of call site frequency estimate which is entirely lacking*. For example in the case of ;;; Before
mov rcx, rsi
mov edx, 2
call [System.Text.RegularExpressions.RegexInterpreter:Advance(int):this]
;;; After
add dword ptr [rsi+124], 3 // rsi == 'this'; need it in rcx below
mov gword ptr [rsp+68H], rsi // questionable spill
mov rcx, gword ptr [rsi+104]
mov rcx, gword ptr [rcx+8]
mov edx, dword ptr [rsi+124]
cmp edx, dword ptr [rcx+8]
jae G_M25344_IG182
movsxd rdx, edx
mov ecx, dword ptr [rcx+4*rdx+16] // would like to see this go into rdx
mov dword ptr [rsp+64H], ecx // to avoid this spill...
mov rcx, gword ptr [rsp+68H] // rsi still has this value!
mov edx, dword ptr [rsp+64H] // ... and reload
call [System.Text.RegularExpressions.RegexInterpreter:SetOperator(int):this] Finally the profitability model must take into account the capabilities of the rest of the optimization pipeline, so what looks like a good inline may in fact produce suboptimal code. The At any rate the behavior seen here is "by design" as things are working the way they are supposed to work. *Such data is available with IBC (profile feedback) but the inliner doesn't handle this very well either. |
@dotnet-bot retest Perf Build and Test @jorive This leg is now 3h30m or longer. Can we split it into multiple legs (say one for x64 and one for x86) to try and cut down the latency? |
Don't think it does this; and would likely be a large undertaking if not. However, if the same method is inlined multiple times, and only the parameters are changing could it be inlined once with the setup register values changed prior to Downside would be reduced optimization from dead code elimination (as branches may be active on multiple paths, so indeterminate) |
@AndyAyersMS All of the builds/tests run in parallel, as it's a pipeline job. So breaking it into different jobs will not make a difference in terms of time spent. It is running long because we have some serious backlog on the linux jobs, so the machines are slammed. Strangely, it still shows as pending, but the Perf leg has actually completed and failed due to not being able to restore the opt data packages. |
Could you wait until the build break job is checked in: |
@benaadams you mean treat repeated inlines like some kind of local procedure call? That idea has floated about at times but I've never seen it pursued seriously. Doing this presents modelling challenges: simple flow graph representations now contain many "infeasible" flow paths can seriously degrade the quality and running time of analyses. |
@briansull sure, I'm in no hurry to merge, just hate rerunning failing jobs at random thinking that for some reason they might now pass. |
I think you can ignore the failure, but if you're interested in the perf implications of the change, you should rerun. |
I merged the build break fix, so you are all clear from my point of view |
Yes; using an in method May allow for deeper call stack inlining as the repetition will be less (e.g. inlines into the functions being inlined etc). Probably would only want to do it for methods > always inline size (e.g. wrappers, properties would likely be worse not directly inlined)
For running time; for example, it could be crossgen only? However; may also reduce CQ by keeping multiple paths alive; rather than branch eliminating them (i.e. constants becoming variables) - so may not be worth it. Just musing... |
I do care about perf; in particular this is going to look like it causes TP regressions. The last 30 or so runs of "perf build and test" have all failed, so it seems unlikely triggering a rerun will do anything useful. |
@benaadams the quality of the solutions can degrade pretty badly if you just wire everything up like it is normal dataflow. To get good results you typically need techniques like those developed for interprocedural dataflow, and that gets tricky with things like SSA. Large fan-ins (say at the "entry" to the local procedure) are troublemakers since they commonly represent places where facts must merge and it requires thoughtful engineering to avoid quadratic bad behavior in dataflow models. This fan-in issue also plagues SSA which generally copes very nicely with large methods. So lots of places where things can go off the rails. A compromise is to "fix" an ABI and decouple some of what is going on in the local procedure from the main method, and give special treatment to things that are live across the boundaries. This is more or less what we do for catches and finallys that are compiled as funclets and (for finallys anyways) invoked as local procedures. Or you can just let the compiler do its thing and they very late try to pattern match and combine the disparate copies back into common code. However all of these sorts of things are pretty far down there in priority. In the case of the Regex interpreter it is pretty common that the call to |
@dotnet-bot test Perf Build and Test |
That is probably true 😁 |
@AndyAyersMS just had a look at |
Added PR for |
Over 6 hr 40 min for "Perf Build and Test" so far; might need some more machines? |
Perf build and test actually had a few passing runs recently, but some legs are now at 16hrs+ and still running. Seems like we are having some challenges figuring out adequate provisioning for the some of the tests we'd really like to run here. Perf runs are a bit more complex to set up since we run on dedicated HW and not VMs. @jorive suggested to me privately that perhaps we move the Linux perf to a "rolling" basis -- triggered only by commits instead of PRs -- which I think is sensible if we can't handle the per-PR traffic load yet. @RussKeldorph we should really settle on a policy here as having likely to fail long-running CI legs is not helpful to anyone. I would like to see us be more proactive and nimble with legs that start failing with any frequency. |
I can definitely move the linux perf out of the innerloop/per PR test loop. That's an easy change. |
Odds of a successful perf run still not good. I am going to merge and let the rolling CI perf runs pick this up. |
Bit of an aside; but interestingly .NET Core runs happily on a 272 thread Xeon Phi https://twitter.com/damageboy/status/928208914759462912 might be an interesting testbed and highlight contention issues well |
When prejitting, the jit assesses whether each root method is a potential
inline candidate for any possible caller. Methods deemed un-inlinable in any
caller are marked in the runtime as "noinline" to save the jit some work
later on when it sees calls to these methods.
This assessment was too conservative and led to prejit-ordering dependences
for inlines. It also meant that prejitting was missing some inlines that
would have happened had we not done the prejit root assessment.
This change removes some of the prejit observation blockers. These mostly
will enable more prejit methods to become candidates. We also now track when
a method argument reaches a test.
When we are assessing profitability for a prejit root, assume the call site
best possible case.
Also, update the inline XML to capture the prejit assessments.
This increases the number of inlines considered and performed when prejitting
and will also lead to slightly more inlining when jitting. However we do not
expect a large througput impact when jitting -- the main impact of this change
is that inlining when prejitting is now just as aggressive as inlining when
jitting, and the decisions no longer depend on the order in which methods are
prejitted.
Closes #14441.
Closes #3482.