Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce calls to Crystal::Type#remove_indirection in module dispatch #14992

Conversation

HertzDevil
Copy link
Contributor

@HertzDevil HertzDevil commented Sep 10, 2024

tl;dr Module types are expanded into unions during Crystal's codegen phase when the receiver of a call has a module type. This PR makes this expansion occur once for the entire call, instead of once for each including type.

It is a well-known issue that dynamic dispatch on a call can lead to very slow build times when the receiver is a module type with many includers. Here is a minimal example that demonstrates this:

{% begin %}
  N = {{ env("N").to_i }}
{% end %}

module Foo
end

{% for i in 1..N %}
  struct Bar{{ i }}
    include Foo

    def foo
    end
  end
{% end %}

Bar1.new.as(Foo).foo

We benchmark the compiler using hyperfine --warmup 5 --prepare 'crystal clear_cache' -P N 100 2000 -D 100 'N={N} crystal build --prelude=empty test.cr':

Results
Command Mean [ms] Min [ms] Max [ms] Relative
N=100 crystal build --prelude=empty test.cr 101.8 ± 1.7 99.5 105.6 1.00
N=200 crystal build --prelude=empty test.cr 110.0 ± 1.7 107.9 114.4 1.08 ± 0.02
N=300 crystal build --prelude=empty test.cr 129.3 ± 1.8 127.0 133.4 1.27 ± 0.03
N=400 crystal build --prelude=empty test.cr 161.1 ± 1.1 159.5 163.4 1.58 ± 0.03
N=500 crystal build --prelude=empty test.cr 208.8 ± 1.5 206.5 211.0 2.05 ± 0.04
N=600 crystal build --prelude=empty test.cr 280.5 ± 2.6 276.6 286.2 2.76 ± 0.05
N=700 crystal build --prelude=empty test.cr 371.4 ± 2.6 368.5 376.1 3.65 ± 0.06
N=800 crystal build --prelude=empty test.cr 489.1 ± 2.2 485.3 491.9 4.81 ± 0.08
N=900 crystal build --prelude=empty test.cr 639.0 ± 12.4 629.8 672.2 6.28 ± 0.16
N=1000 crystal build --prelude=empty test.cr 834.1 ± 21.1 811.6 875.3 8.19 ± 0.25
N=1100 crystal build --prelude=empty test.cr 1041.8 ± 10.8 1032.1 1070.5 10.23 ± 0.20
N=1200 crystal build --prelude=empty test.cr 1316.9 ± 29.0 1291.2 1376.2 12.94 ± 0.35
N=1300 crystal build --prelude=empty test.cr 1635.2 ± 40.3 1595.8 1700.6 16.06 ± 0.47
N=1400 crystal build --prelude=empty test.cr 1969.3 ± 16.8 1951.6 2007.4 19.35 ± 0.36
N=1500 crystal build --prelude=empty test.cr 2385.6 ± 39.6 2351.1 2463.1 23.44 ± 0.54
N=1600 crystal build --prelude=empty test.cr 2844.8 ± 13.6 2827.7 2865.2 27.95 ± 0.47
N=1700 crystal build --prelude=empty test.cr 3362.4 ± 41.4 3327.6 3438.0 33.03 ± 0.67
N=1800 crystal build --prelude=empty test.cr 3942.2 ± 52.5 3894.6 4034.2 38.73 ± 0.81
N=1900 crystal build --prelude=empty test.cr 4594.1 ± 61.5 4536.5 4720.2 45.13 ± 0.95
N=2000 crystal build --prelude=empty test.cr 5299.2 ± 53.3 5239.4 5405.6 52.06 ± 1.00

If we plot the mean times we could observe quadratic growth (R2 = 0.997), even though nothing in the code suggests there would be quadratic behavior. Passing --stats reveals that most of the time is spent on Crystal's codegen phase, rather than LLVM's, so this has nothing to do with large structs (e.g. #12801).


There are at least two sources contributing equally to this accidentally quadratic behavior. One is the following fragment of Crystal::CodeGenVisitor#codegen_dispatch:

target_defs.each do |a_def|
if is_super
# A super call always matches the obj type
result = int1(1)
else
result = match_type_id(owner, a_def.owner, obj_type_id)
end

There are N target defs because Foo has N includers. For each target def, the compiler calls #match_type_id:

def match_type_id(type, restriction, type_id)
match_type_id_impl(type.remove_indirection, restriction.remove_indirection, type_id)
end

Therefore, Crystal::Type#remove_indirection is also called N times for this entire dispatch. Normally this method is idempotent, but if the receiver represents a module type, then the method converts that module into a union of its including types, i.e. the type Bar1 | Bar2 | ... | Bar{{ N }}:

def remove_indirection
if including_types = self.including_types
including_types.remove_indirection
else
self
end
end

As each call produces an array with all N including types, the whole dispatch creates a total of N * N such elements, hence the quadratic behavior.

Back to Crystal::CodeGenVisitor#codegen_dispatch, knowing that owner = owner.remove_indirection is idempotent and that owner is used only by that one match_type_id, we simply call #remove_indirection before the loop. The new benchmark results with a local release build of the compiler are:

Results
Command Mean [ms] Min [ms] Max [ms] Relative
N=100 bin/crystal build --prelude=empty test.cr 112.7 ± 6.9 100.6 121.4 1.00
N=200 bin/crystal build --prelude=empty test.cr 117.9 ± 6.8 105.9 126.8 1.05 ± 0.09
N=300 bin/crystal build --prelude=empty test.cr 131.9 ± 14.9 118.8 182.0 1.17 ± 0.15
N=400 bin/crystal build --prelude=empty test.cr 145.2 ± 9.2 135.8 159.7 1.29 ± 0.11
N=500 bin/crystal build --prelude=empty test.cr 171.5 ± 9.2 163.3 187.5 1.52 ± 0.12
N=600 bin/crystal build --prelude=empty test.cr 210.5 ± 10.9 198.3 226.5 1.87 ± 0.15
N=700 bin/crystal build --prelude=empty test.cr 266.8 ± 17.9 250.2 307.7 2.37 ± 0.21
N=800 bin/crystal build --prelude=empty test.cr 327.5 ± 11.9 311.1 341.0 2.91 ± 0.21
N=900 bin/crystal build --prelude=empty test.cr 406.1 ± 13.2 389.6 425.1 3.60 ± 0.25
N=1000 bin/crystal build --prelude=empty test.cr 501.3 ± 13.9 485.8 520.5 4.45 ± 0.30
N=1100 bin/crystal build --prelude=empty test.cr 626.8 ± 22.2 605.1 669.3 5.56 ± 0.39
N=1200 bin/crystal build --prelude=empty test.cr 757.7 ± 17.7 725.3 782.9 6.73 ± 0.44
N=1300 bin/crystal build --prelude=empty test.cr 936.2 ± 30.3 896.8 1000.6 8.31 ± 0.57
N=1400 bin/crystal build --prelude=empty test.cr 1117.6 ± 26.7 1087.8 1157.9 9.92 ± 0.65
N=1500 bin/crystal build --prelude=empty test.cr 1331.4 ± 21.3 1300.4 1357.9 11.82 ± 0.74
N=1600 bin/crystal build --prelude=empty test.cr 1587.4 ± 39.8 1518.7 1654.5 14.09 ± 0.93
N=1700 bin/crystal build --prelude=empty test.cr 1855.4 ± 31.6 1811.1 1898.2 16.47 ± 1.04
N=1800 bin/crystal build --prelude=empty test.cr 2163.1 ± 14.5 2135.2 2185.9 19.20 ± 1.18
N=1900 bin/crystal build --prelude=empty test.cr 2504.5 ± 33.2 2458.3 2562.6 22.23 ± 1.38
N=2000 bin/crystal build --prelude=empty test.cr 2890.7 ± 25.6 2832.5 2920.7 25.66 ± 1.58

As expected, this roughly halves the compilation times.


The other half comes from downcasting the receiver to each of the N including types:

var = context.vars[node.name]?
if var
return unreachable if var.type.no_return?
# Special variables always have an extra pointer
already_loaded = (node.special_var? ? false : var.already_loaded)
@last = downcast var.pointer, node.type, var.type, already_loaded, extern: false

This var is a temporary Crystal::LLVMVar that is created once outside the loop:

if node_obj
new_vars["%self"] = LLVMVar.new(@last, node_obj.type, true)
end

with_cloned_context do
context.vars = new_vars

This %self's type is used only at the downcast call, which likewise uses #remove_indirection. We can therefore precompute all including types right when the LLVMVar is constructed, and this time, all the quadratic slowdown is practically gone:

Results
Command Mean [ms] Min [ms] Max [ms] Relative
N=100 bin/crystal build --prelude=empty usr/small.cr 100.2 ± 1.6 97.1 103.7 1.00
N=200 bin/crystal build --prelude=empty usr/small.cr 102.0 ± 1.2 99.4 104.6 1.02 ± 0.02
N=300 bin/crystal build --prelude=empty usr/small.cr 113.9 ± 7.1 102.2 123.9 1.14 ± 0.07
N=400 bin/crystal build --prelude=empty usr/small.cr 115.0 ± 6.9 103.2 123.7 1.15 ± 0.07
N=500 bin/crystal build --prelude=empty usr/small.cr 109.5 ± 2.2 107.1 118.0 1.09 ± 0.03
N=600 bin/crystal build --prelude=empty usr/small.cr 111.9 ± 1.1 109.9 114.1 1.12 ± 0.02
N=700 bin/crystal build --prelude=empty usr/small.cr 116.9 ± 1.2 113.7 118.4 1.17 ± 0.02
N=800 bin/crystal build --prelude=empty usr/small.cr 119.6 ± 1.0 117.5 122.1 1.19 ± 0.02
N=900 bin/crystal build --prelude=empty usr/small.cr 122.2 ± 0.9 121.1 124.3 1.22 ± 0.02
N=1000 bin/crystal build --prelude=empty usr/small.cr 125.7 ± 2.6 122.7 135.2 1.25 ± 0.03
N=1100 bin/crystal build --prelude=empty usr/small.cr 128.2 ± 1.1 125.6 129.6 1.28 ± 0.02
N=1200 bin/crystal build --prelude=empty usr/small.cr 131.1 ± 1.3 128.8 132.8 1.31 ± 0.02
N=1300 bin/crystal build --prelude=empty usr/small.cr 137.8 ± 10.5 133.2 176.9 1.38 ± 0.11
N=1400 bin/crystal build --prelude=empty usr/small.cr 139.1 ± 1.2 137.1 141.4 1.39 ± 0.03
N=1500 bin/crystal build --prelude=empty usr/small.cr 143.7 ± 2.4 141.6 148.5 1.43 ± 0.03
N=1600 bin/crystal build --prelude=empty usr/small.cr 147.2 ± 1.9 144.2 151.7 1.47 ± 0.03
N=1700 bin/crystal build --prelude=empty usr/small.cr 152.0 ± 2.1 149.3 155.0 1.52 ± 0.03
N=1800 bin/crystal build --prelude=empty usr/small.cr 156.0 ± 1.8 154.4 159.1 1.56 ± 0.03
N=1900 bin/crystal build --prelude=empty usr/small.cr 167.8 ± 11.3 163.2 207.0 1.68 ± 0.12
N=2000 bin/crystal build --prelude=empty usr/small.cr 183.4 ± 7.3 168.4 191.5 1.83 ± 0.08

There is still quadratic growth, but the compilation time remains well below 1 second until N is around 9500. (This growth is dominated by Crystal::TypeDeclarationProcessor rather than codegen.)

Copy link
Member

@straight-shoota straight-shoota left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What a tiny little change and auch a big impact!
The explanation of whats going on is great. This is an explemplary pull requests. Thank you @HertzDevil!

@straight-shoota straight-shoota added this to the 1.14.0 milestone Sep 19, 2024
@straight-shoota straight-shoota merged commit 5f018fc into crystal-lang:master Sep 21, 2024
64 of 66 checks passed
@HertzDevil HertzDevil deleted the perf/module-dispatch-remove-indirection branch September 29, 2024 20:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants