Reduce calls to `Crystal::Type#remove_indirection` in module dispatch #14992

HertzDevil · 2024-09-10T16:13:46Z

tl;dr Module types are expanded into unions during Crystal's codegen phase when the receiver of a call has a module type. This PR makes this expansion occur once for the entire call, instead of once for each including type.

It is a well-known issue that dynamic dispatch on a call can lead to very slow build times when the receiver is a module type with many includers. Here is a minimal example that demonstrates this:

{% begin %}
  N = {{ env("N").to_i }}
{% end %}

module Foo
end

{% for i in 1..N %}
  struct Bar{{ i }}
    include Foo

    def foo
    end
  end
{% end %}

Bar1.new.as(Foo).foo

We benchmark the compiler using hyperfine --warmup 5 --prepare 'crystal clear_cache' -P N 100 2000 -D 100 'N={N} crystal build --prelude=empty test.cr':

Results

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`N=100 crystal build --prelude=empty test.cr`	101.8 ± 1.7	99.5	105.6	1.00
`N=200 crystal build --prelude=empty test.cr`	110.0 ± 1.7	107.9	114.4	1.08 ± 0.02
`N=300 crystal build --prelude=empty test.cr`	129.3 ± 1.8	127.0	133.4	1.27 ± 0.03
`N=400 crystal build --prelude=empty test.cr`	161.1 ± 1.1	159.5	163.4	1.58 ± 0.03
`N=500 crystal build --prelude=empty test.cr`	208.8 ± 1.5	206.5	211.0	2.05 ± 0.04
`N=600 crystal build --prelude=empty test.cr`	280.5 ± 2.6	276.6	286.2	2.76 ± 0.05
`N=700 crystal build --prelude=empty test.cr`	371.4 ± 2.6	368.5	376.1	3.65 ± 0.06
`N=800 crystal build --prelude=empty test.cr`	489.1 ± 2.2	485.3	491.9	4.81 ± 0.08
`N=900 crystal build --prelude=empty test.cr`	639.0 ± 12.4	629.8	672.2	6.28 ± 0.16
`N=1000 crystal build --prelude=empty test.cr`	834.1 ± 21.1	811.6	875.3	8.19 ± 0.25
`N=1100 crystal build --prelude=empty test.cr`	1041.8 ± 10.8	1032.1	1070.5	10.23 ± 0.20
`N=1200 crystal build --prelude=empty test.cr`	1316.9 ± 29.0	1291.2	1376.2	12.94 ± 0.35
`N=1300 crystal build --prelude=empty test.cr`	1635.2 ± 40.3	1595.8	1700.6	16.06 ± 0.47
`N=1400 crystal build --prelude=empty test.cr`	1969.3 ± 16.8	1951.6	2007.4	19.35 ± 0.36
`N=1500 crystal build --prelude=empty test.cr`	2385.6 ± 39.6	2351.1	2463.1	23.44 ± 0.54
`N=1600 crystal build --prelude=empty test.cr`	2844.8 ± 13.6	2827.7	2865.2	27.95 ± 0.47
`N=1700 crystal build --prelude=empty test.cr`	3362.4 ± 41.4	3327.6	3438.0	33.03 ± 0.67
`N=1800 crystal build --prelude=empty test.cr`	3942.2 ± 52.5	3894.6	4034.2	38.73 ± 0.81
`N=1900 crystal build --prelude=empty test.cr`	4594.1 ± 61.5	4536.5	4720.2	45.13 ± 0.95
`N=2000 crystal build --prelude=empty test.cr`	5299.2 ± 53.3	5239.4	5405.6	52.06 ± 1.00

If we plot the mean times we could observe quadratic growth (R² = 0.997), even though nothing in the code suggests there would be quadratic behavior. Passing --stats reveals that most of the time is spent on Crystal's codegen phase, rather than LLVM's, so this has nothing to do with large structs (e.g. #12801).

There are at least two sources contributing equally to this accidentally quadratic behavior. One is the following fragment of Crystal::CodeGenVisitor#codegen_dispatch:

crystal/src/compiler/crystal/codegen/call.cr

Lines 367 to 373 in ce76bf5

    
           target_defs.each do |a_def| 
        
             if is_super 
        
               # A super call always matches the obj type 
        
               result = int1(1) 
        
             else 
        
               result = match_type_id(owner, a_def.owner, obj_type_id) 
        
             end

There are N target defs because Foo has N includers. For each target def, the compiler calls #match_type_id:

crystal/src/compiler/crystal/codegen/match.cr

Lines 4 to 6 in ce76bf5

    
           def match_type_id(type, restriction, type_id) 
        
             match_type_id_impl(type.remove_indirection, restriction.remove_indirection, type_id) 
        
           end

Therefore, Crystal::Type#remove_indirection is also called N times for this entire dispatch. Normally this method is idempotent, but if the receiver represents a module type, then the method converts that module into a union of its including types, i.e. the type Bar1 | Bar2 | ... | Bar{{ N }}:

crystal/src/compiler/crystal/types.cr

Lines 1172 to 1178 in ce76bf5

    
           def remove_indirection 
        
             if including_types = self.including_types 
        
               including_types.remove_indirection 
        
             else 
        
               self 
        
             end 
        
           end

As each call produces an array with all N including types, the whole dispatch creates a total of N * N such elements, hence the quadratic behavior.

Back to Crystal::CodeGenVisitor#codegen_dispatch, knowing that owner = owner.remove_indirection is idempotent and that owner is used only by that one match_type_id, we simply call #remove_indirection before the loop. The new benchmark results with a local release build of the compiler are:

Results

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`N=100 bin/crystal build --prelude=empty test.cr`	112.7 ± 6.9	100.6	121.4	1.00
`N=200 bin/crystal build --prelude=empty test.cr`	117.9 ± 6.8	105.9	126.8	1.05 ± 0.09
`N=300 bin/crystal build --prelude=empty test.cr`	131.9 ± 14.9	118.8	182.0	1.17 ± 0.15
`N=400 bin/crystal build --prelude=empty test.cr`	145.2 ± 9.2	135.8	159.7	1.29 ± 0.11
`N=500 bin/crystal build --prelude=empty test.cr`	171.5 ± 9.2	163.3	187.5	1.52 ± 0.12
`N=600 bin/crystal build --prelude=empty test.cr`	210.5 ± 10.9	198.3	226.5	1.87 ± 0.15
`N=700 bin/crystal build --prelude=empty test.cr`	266.8 ± 17.9	250.2	307.7	2.37 ± 0.21
`N=800 bin/crystal build --prelude=empty test.cr`	327.5 ± 11.9	311.1	341.0	2.91 ± 0.21
`N=900 bin/crystal build --prelude=empty test.cr`	406.1 ± 13.2	389.6	425.1	3.60 ± 0.25
`N=1000 bin/crystal build --prelude=empty test.cr`	501.3 ± 13.9	485.8	520.5	4.45 ± 0.30
`N=1100 bin/crystal build --prelude=empty test.cr`	626.8 ± 22.2	605.1	669.3	5.56 ± 0.39
`N=1200 bin/crystal build --prelude=empty test.cr`	757.7 ± 17.7	725.3	782.9	6.73 ± 0.44
`N=1300 bin/crystal build --prelude=empty test.cr`	936.2 ± 30.3	896.8	1000.6	8.31 ± 0.57
`N=1400 bin/crystal build --prelude=empty test.cr`	1117.6 ± 26.7	1087.8	1157.9	9.92 ± 0.65
`N=1500 bin/crystal build --prelude=empty test.cr`	1331.4 ± 21.3	1300.4	1357.9	11.82 ± 0.74
`N=1600 bin/crystal build --prelude=empty test.cr`	1587.4 ± 39.8	1518.7	1654.5	14.09 ± 0.93
`N=1700 bin/crystal build --prelude=empty test.cr`	1855.4 ± 31.6	1811.1	1898.2	16.47 ± 1.04
`N=1800 bin/crystal build --prelude=empty test.cr`	2163.1 ± 14.5	2135.2	2185.9	19.20 ± 1.18
`N=1900 bin/crystal build --prelude=empty test.cr`	2504.5 ± 33.2	2458.3	2562.6	22.23 ± 1.38
`N=2000 bin/crystal build --prelude=empty test.cr`	2890.7 ± 25.6	2832.5	2920.7	25.66 ± 1.58

As expected, this roughly halves the compilation times.

The other half comes from downcasting the receiver to each of the N including types:

crystal/src/compiler/crystal/codegen/call.cr

Line 405 in ce76bf5

accept call

crystal/src/compiler/crystal/codegen/codegen.cr

Lines 1353 to 1359 in ce76bf5

    
           var = context.vars[node.name]? 
        
           if var 
        
             return unreachable if var.type.no_return? 
        
             # Special variables always have an extra pointer 
        
             already_loaded = (node.special_var? ? false : var.already_loaded) 
        
             @last = downcast var.pointer, node.type, var.type, already_loaded, extern: false

This var is a temporary Crystal::LLVMVar that is created once outside the loop:

crystal/src/compiler/crystal/codegen/call.cr

Lines 342 to 344 in ce76bf5

    
           if node_obj 
        
             new_vars["%self"] = LLVMVar.new(@last, node_obj.type, true) 
        
           end

crystal/src/compiler/crystal/codegen/call.cr

Lines 362 to 363 in ce76bf5

    
           with_cloned_context do 
        
             context.vars = new_vars

This %self's type is used only at the downcast call, which likewise uses #remove_indirection. We can therefore precompute all including types right when the LLVMVar is constructed, and this time, all the quadratic slowdown is practically gone:

Results

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`N=100 bin/crystal build --prelude=empty usr/small.cr`	100.2 ± 1.6	97.1	103.7	1.00
`N=200 bin/crystal build --prelude=empty usr/small.cr`	102.0 ± 1.2	99.4	104.6	1.02 ± 0.02
`N=300 bin/crystal build --prelude=empty usr/small.cr`	113.9 ± 7.1	102.2	123.9	1.14 ± 0.07
`N=400 bin/crystal build --prelude=empty usr/small.cr`	115.0 ± 6.9	103.2	123.7	1.15 ± 0.07
`N=500 bin/crystal build --prelude=empty usr/small.cr`	109.5 ± 2.2	107.1	118.0	1.09 ± 0.03
`N=600 bin/crystal build --prelude=empty usr/small.cr`	111.9 ± 1.1	109.9	114.1	1.12 ± 0.02
`N=700 bin/crystal build --prelude=empty usr/small.cr`	116.9 ± 1.2	113.7	118.4	1.17 ± 0.02
`N=800 bin/crystal build --prelude=empty usr/small.cr`	119.6 ± 1.0	117.5	122.1	1.19 ± 0.02
`N=900 bin/crystal build --prelude=empty usr/small.cr`	122.2 ± 0.9	121.1	124.3	1.22 ± 0.02
`N=1000 bin/crystal build --prelude=empty usr/small.cr`	125.7 ± 2.6	122.7	135.2	1.25 ± 0.03
`N=1100 bin/crystal build --prelude=empty usr/small.cr`	128.2 ± 1.1	125.6	129.6	1.28 ± 0.02
`N=1200 bin/crystal build --prelude=empty usr/small.cr`	131.1 ± 1.3	128.8	132.8	1.31 ± 0.02
`N=1300 bin/crystal build --prelude=empty usr/small.cr`	137.8 ± 10.5	133.2	176.9	1.38 ± 0.11
`N=1400 bin/crystal build --prelude=empty usr/small.cr`	139.1 ± 1.2	137.1	141.4	1.39 ± 0.03
`N=1500 bin/crystal build --prelude=empty usr/small.cr`	143.7 ± 2.4	141.6	148.5	1.43 ± 0.03
`N=1600 bin/crystal build --prelude=empty usr/small.cr`	147.2 ± 1.9	144.2	151.7	1.47 ± 0.03
`N=1700 bin/crystal build --prelude=empty usr/small.cr`	152.0 ± 2.1	149.3	155.0	1.52 ± 0.03
`N=1800 bin/crystal build --prelude=empty usr/small.cr`	156.0 ± 1.8	154.4	159.1	1.56 ± 0.03
`N=1900 bin/crystal build --prelude=empty usr/small.cr`	167.8 ± 11.3	163.2	207.0	1.68 ± 0.12
`N=2000 bin/crystal build --prelude=empty usr/small.cr`	183.4 ± 7.3	168.4	191.5	1.83 ± 0.08

There is still quadratic growth, but the compilation time remains well below 1 second until N is around 9500. (This growth is dominated by Crystal::TypeDeclarationProcessor rather than codegen.)

straight-shoota

What a tiny little change and auch a big impact!
The explanation of whats going on is great. This is an explemplary pull requests. Thank you @HertzDevil!

src/compiler/crystal/codegen/call.cr

Reduce calls to Crystal::Type#remove_indirection in module dispatch

d15d9ba

HertzDevil added performance topic:compiler:codegen labels Sep 10, 2024

straight-shoota approved these changes Sep 10, 2024

View reviewed changes

hoist #remove_indirection for downcast too

94134ff

Sija reviewed Sep 10, 2024

View reviewed changes

src/compiler/crystal/codegen/call.cr Show resolved Hide resolved

fixup

a8153a9

straight-shoota approved these changes Sep 10, 2024

View reviewed changes

straight-shoota added this to the 1.14.0 milestone Sep 19, 2024

straight-shoota merged commit 5f018fc into crystal-lang:master Sep 21, 2024
64 of 66 checks passed

HertzDevil deleted the perf/module-dispatch-remove-indirection branch September 29, 2024 20:42

BrewTestBot mentioned this pull request Oct 9, 2024

crystal 1.14.0 Homebrew/homebrew-core#193336

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce calls to `Crystal::Type#remove_indirection` in module dispatch #14992

Reduce calls to `Crystal::Type#remove_indirection` in module dispatch #14992

HertzDevil commented Sep 10, 2024 •

edited

Loading

straight-shoota left a comment

	target_defs.each do \|a_def\|
	if is_super
	# A super call always matches the obj type
	result = int1(1)
	else
	result = match_type_id(owner, a_def.owner, obj_type_id)
	end

	def match_type_id(type, restriction, type_id)
	match_type_id_impl(type.remove_indirection, restriction.remove_indirection, type_id)
	end

	def remove_indirection
	if including_types = self.including_types
	including_types.remove_indirection
	else
	self
	end
	end

	var = context.vars[node.name]?
	if var
	return unreachable if var.type.no_return?

	# Special variables always have an extra pointer
	already_loaded = (node.special_var? ? false : var.already_loaded)
	@last = downcast var.pointer, node.type, var.type, already_loaded, extern: false

	if node_obj
	new_vars["%self"] = LLVMVar.new(@last, node_obj.type, true)
	end

Reduce calls to Crystal::Type#remove_indirection in module dispatch #14992

Reduce calls to Crystal::Type#remove_indirection in module dispatch #14992

Conversation

HertzDevil commented Sep 10, 2024 • edited Loading

straight-shoota left a comment

Choose a reason for hiding this comment

Reduce calls to `Crystal::Type#remove_indirection` in module dispatch #14992

Reduce calls to `Crystal::Type#remove_indirection` in module dispatch #14992

HertzDevil commented Sep 10, 2024 •

edited

Loading