Reduce calls to Crystal::Type#remove_indirection
in module dispatch
#14992
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
tl;dr Module types are expanded into unions during Crystal's codegen phase when the receiver of a call has a module type. This PR makes this expansion occur once for the entire call, instead of once for each including type.
It is a well-known issue that dynamic dispatch on a call can lead to very slow build times when the receiver is a module type with many includers. Here is a minimal example that demonstrates this:
We benchmark the compiler using
hyperfine --warmup 5 --prepare 'crystal clear_cache' -P N 100 2000 -D 100 'N={N} crystal build --prelude=empty test.cr'
:Results
N=100 crystal build --prelude=empty test.cr
N=200 crystal build --prelude=empty test.cr
N=300 crystal build --prelude=empty test.cr
N=400 crystal build --prelude=empty test.cr
N=500 crystal build --prelude=empty test.cr
N=600 crystal build --prelude=empty test.cr
N=700 crystal build --prelude=empty test.cr
N=800 crystal build --prelude=empty test.cr
N=900 crystal build --prelude=empty test.cr
N=1000 crystal build --prelude=empty test.cr
N=1100 crystal build --prelude=empty test.cr
N=1200 crystal build --prelude=empty test.cr
N=1300 crystal build --prelude=empty test.cr
N=1400 crystal build --prelude=empty test.cr
N=1500 crystal build --prelude=empty test.cr
N=1600 crystal build --prelude=empty test.cr
N=1700 crystal build --prelude=empty test.cr
N=1800 crystal build --prelude=empty test.cr
N=1900 crystal build --prelude=empty test.cr
N=2000 crystal build --prelude=empty test.cr
If we plot the mean times we could observe quadratic growth (R2 = 0.997), even though nothing in the code suggests there would be quadratic behavior. Passing
--stats
reveals that most of the time is spent on Crystal's codegen phase, rather than LLVM's, so this has nothing to do with large structs (e.g. #12801).There are at least two sources contributing equally to this accidentally quadratic behavior. One is the following fragment of
Crystal::CodeGenVisitor#codegen_dispatch
:crystal/src/compiler/crystal/codegen/call.cr
Lines 367 to 373 in ce76bf5
There are
N
target defs becauseFoo
hasN
includers. For each target def, the compiler calls#match_type_id
:crystal/src/compiler/crystal/codegen/match.cr
Lines 4 to 6 in ce76bf5
Therefore,
Crystal::Type#remove_indirection
is also calledN
times for this entire dispatch. Normally this method is idempotent, but if the receiver represents a module type, then the method converts that module into a union of its including types, i.e. the typeBar1 | Bar2 | ... | Bar{{ N }}
:crystal/src/compiler/crystal/types.cr
Lines 1172 to 1178 in ce76bf5
As each call produces an array with all
N
including types, the whole dispatch creates a total ofN * N
such elements, hence the quadratic behavior.Back to
Crystal::CodeGenVisitor#codegen_dispatch
, knowing thatowner = owner.remove_indirection
is idempotent and thatowner
is used only by that onematch_type_id
, we simply call#remove_indirection
before the loop. The new benchmark results with a local release build of the compiler are:Results
N=100 bin/crystal build --prelude=empty test.cr
N=200 bin/crystal build --prelude=empty test.cr
N=300 bin/crystal build --prelude=empty test.cr
N=400 bin/crystal build --prelude=empty test.cr
N=500 bin/crystal build --prelude=empty test.cr
N=600 bin/crystal build --prelude=empty test.cr
N=700 bin/crystal build --prelude=empty test.cr
N=800 bin/crystal build --prelude=empty test.cr
N=900 bin/crystal build --prelude=empty test.cr
N=1000 bin/crystal build --prelude=empty test.cr
N=1100 bin/crystal build --prelude=empty test.cr
N=1200 bin/crystal build --prelude=empty test.cr
N=1300 bin/crystal build --prelude=empty test.cr
N=1400 bin/crystal build --prelude=empty test.cr
N=1500 bin/crystal build --prelude=empty test.cr
N=1600 bin/crystal build --prelude=empty test.cr
N=1700 bin/crystal build --prelude=empty test.cr
N=1800 bin/crystal build --prelude=empty test.cr
N=1900 bin/crystal build --prelude=empty test.cr
N=2000 bin/crystal build --prelude=empty test.cr
As expected, this roughly halves the compilation times.
The other half comes from downcasting the receiver to each of the
N
including types:crystal/src/compiler/crystal/codegen/call.cr
Line 405 in ce76bf5
crystal/src/compiler/crystal/codegen/codegen.cr
Lines 1353 to 1359 in ce76bf5
This
var
is a temporaryCrystal::LLVMVar
that is created once outside the loop:crystal/src/compiler/crystal/codegen/call.cr
Lines 342 to 344 in ce76bf5
crystal/src/compiler/crystal/codegen/call.cr
Lines 362 to 363 in ce76bf5
This
%self
's type is used only at thedowncast
call, which likewise uses#remove_indirection
. We can therefore precompute all including types right when theLLVMVar
is constructed, and this time, all the quadratic slowdown is practically gone:Results
N=100 bin/crystal build --prelude=empty usr/small.cr
N=200 bin/crystal build --prelude=empty usr/small.cr
N=300 bin/crystal build --prelude=empty usr/small.cr
N=400 bin/crystal build --prelude=empty usr/small.cr
N=500 bin/crystal build --prelude=empty usr/small.cr
N=600 bin/crystal build --prelude=empty usr/small.cr
N=700 bin/crystal build --prelude=empty usr/small.cr
N=800 bin/crystal build --prelude=empty usr/small.cr
N=900 bin/crystal build --prelude=empty usr/small.cr
N=1000 bin/crystal build --prelude=empty usr/small.cr
N=1100 bin/crystal build --prelude=empty usr/small.cr
N=1200 bin/crystal build --prelude=empty usr/small.cr
N=1300 bin/crystal build --prelude=empty usr/small.cr
N=1400 bin/crystal build --prelude=empty usr/small.cr
N=1500 bin/crystal build --prelude=empty usr/small.cr
N=1600 bin/crystal build --prelude=empty usr/small.cr
N=1700 bin/crystal build --prelude=empty usr/small.cr
N=1800 bin/crystal build --prelude=empty usr/small.cr
N=1900 bin/crystal build --prelude=empty usr/small.cr
N=2000 bin/crystal build --prelude=empty usr/small.cr
There is still quadratic growth, but the compilation time remains well below 1 second until
N
is around 9500. (This growth is dominated byCrystal::TypeDeclarationProcessor
rather than codegen.)