Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-45611: [C++][Acero] Improve Swiss join build performance by partitioning batches ahead to reduce contention #45612

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

zanmato1984
Copy link
Contributor

@zanmato1984 zanmato1984 commented Feb 24, 2025

Rationale for this change

High contention is observed in Swiss join build phase as showed in #45611 .

A little background about the contention. To build the hash table in parallel, we first build N partitioned hash tables (the "build" stage), then merge them together into the final hash table (the "merge" stage, less interesting in this PR). In the build stage, each one of the exec batches from the build side table is distributed to one of the M threads. Each such thread processes each one of the assigned batches by:

  1. Partition the batch based on the hash of the join key into N partitions;
  2. Insert the rows of each of the N partitions into the corresponding one of the N partitioned hash tables.

Because each batch contains arbitrary data, all M threads will write to all N partitioned hash tables simultaneously. So we use (spin) locks on these partitioned hash tables, thus the contention.

What changes are included in this PR?

Instead of all M threads writing to all N partitioned hash tables simultaneously, we can further split the build stage into two:

  1. Partition stage: M threads, each only partitions the batches and preserves the partition info of each batch;
  2. (New) Build stage: N threads, each builds one of the N partitioned hash tables. Every thread will iterate all the batches and only insert the belonging rows of the batch into its assigned hash table.

Performance

Take this benchmark, which is dedicated for the performance of parallel build, the result shows by eliminating the contention, we can achieve up to 10x (on Arm) and 5x (on Intel) performance boost for Swiss join build. I picked krows=64 and krows=512 and made a chart.

Arm (1)

Intel

Note the single thread performance is actually down a little bit (reasons detailed later). But IMO this is quite trivial compared to the total win of multi-threaded cases.

Detailed benchmark numbers (on Arm) follow.

Benchmark After (Click to expand)
Run on (10 X 24.1216 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB
  L1 Instruction 128 KiB
  L2 Unified 4096 KiB (x10)
Load Average: 3.47, 2.76, 2.54
-----------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                               Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------------------------
BM_HashJoinBasic_BuildParallelism/Threads:1/HashTable krows:1/process_time          53315 ns        53284 ns        12295 rows/sec=19.2179M/s
BM_HashJoinBasic_BuildParallelism/Threads:2/HashTable krows:1/process_time          73001 ns        80862 ns         8606 rows/sec=12.6636M/s
BM_HashJoinBasic_BuildParallelism/Threads:3/HashTable krows:1/process_time          88003 ns        95127 ns         7429 rows/sec=10.7645M/s
BM_HashJoinBasic_BuildParallelism/Threads:4/HashTable krows:1/process_time          93248 ns       120317 ns         5135 rows/sec=8.51085M/s
BM_HashJoinBasic_BuildParallelism/Threads:5/HashTable krows:1/process_time         109931 ns       140384 ns         4527 rows/sec=7.29427M/s
BM_HashJoinBasic_BuildParallelism/Threads:6/HashTable krows:1/process_time         127997 ns       180633 ns         3546 rows/sec=5.66897M/s
BM_HashJoinBasic_BuildParallelism/Threads:7/HashTable krows:1/process_time         125138 ns       185416 ns         3267 rows/sec=5.52271M/s
BM_HashJoinBasic_BuildParallelism/Threads:8/HashTable krows:1/process_time         142611 ns       236355 ns         3613 rows/sec=4.33247M/s
BM_HashJoinBasic_BuildParallelism/Threads:9/HashTable krows:1/process_time         169663 ns       336376 ns         2158 rows/sec=3.04421M/s
BM_HashJoinBasic_BuildParallelism/Threads:10/HashTable krows:1/process_time        174708 ns       362630 ns         1943 rows/sec=2.82381M/s
BM_HashJoinBasic_BuildParallelism/Threads:11/HashTable krows:1/process_time        186939 ns       409803 ns         1693 rows/sec=2.49876M/s
BM_HashJoinBasic_BuildParallelism/Threads:12/HashTable krows:1/process_time        196817 ns       451213 ns         1542 rows/sec=2.26944M/s
BM_HashJoinBasic_BuildParallelism/Threads:13/HashTable krows:1/process_time        209194 ns       501488 ns         1407 rows/sec=2.04192M/s
BM_HashJoinBasic_BuildParallelism/Threads:14/HashTable krows:1/process_time        218517 ns       544590 ns         1299 rows/sec=1.88031M/s
BM_HashJoinBasic_BuildParallelism/Threads:15/HashTable krows:1/process_time        224407 ns       579947 ns         1206 rows/sec=1.76568M/s
BM_HashJoinBasic_BuildParallelism/Threads:16/HashTable krows:1/process_time        236201 ns       630016 ns         1134 rows/sec=1.62536M/s
BM_HashJoinBasic_BuildParallelism/Threads:1/HashTable krows:8/process_time         213061 ns       213082 ns         3276 rows/sec=38.4453M/s
BM_HashJoinBasic_BuildParallelism/Threads:2/HashTable krows:8/process_time         260230 ns       374124 ns         1900 rows/sec=21.8965M/s
BM_HashJoinBasic_BuildParallelism/Threads:3/HashTable krows:8/process_time         275723 ns       483754 ns         1331 rows/sec=16.9342M/s
BM_HashJoinBasic_BuildParallelism/Threads:4/HashTable krows:8/process_time         326784 ns       711857 ns          974 rows/sec=11.5079M/s
BM_HashJoinBasic_BuildParallelism/Threads:5/HashTable krows:8/process_time         351987 ns       861883 ns          798 rows/sec=9.50477M/s
BM_HashJoinBasic_BuildParallelism/Threads:6/HashTable krows:8/process_time         370956 ns      1000389 ns          683 rows/sec=8.18881M/s
BM_HashJoinBasic_BuildParallelism/Threads:7/HashTable krows:8/process_time         384963 ns      1064672 ns          646 rows/sec=7.69439M/s
BM_HashJoinBasic_BuildParallelism/Threads:8/HashTable krows:8/process_time         406914 ns      1172464 ns          606 rows/sec=6.987M/s
BM_HashJoinBasic_BuildParallelism/Threads:9/HashTable krows:8/process_time         425632 ns      1252871 ns          567 rows/sec=6.53858M/s
BM_HashJoinBasic_BuildParallelism/Threads:10/HashTable krows:8/process_time        433262 ns      1287050 ns          524 rows/sec=6.36494M/s
BM_HashJoinBasic_BuildParallelism/Threads:11/HashTable krows:8/process_time        443328 ns      1329822 ns          528 rows/sec=6.16022M/s
BM_HashJoinBasic_BuildParallelism/Threads:12/HashTable krows:8/process_time        450736 ns      1383203 ns          508 rows/sec=5.92249M/s
BM_HashJoinBasic_BuildParallelism/Threads:13/HashTable krows:8/process_time        465523 ns      1425956 ns          495 rows/sec=5.74492M/s
BM_HashJoinBasic_BuildParallelism/Threads:14/HashTable krows:8/process_time        471723 ns      1462440 ns          475 rows/sec=5.6016M/s
BM_HashJoinBasic_BuildParallelism/Threads:15/HashTable krows:8/process_time        484823 ns      1524638 ns          464 rows/sec=5.37308M/s
BM_HashJoinBasic_BuildParallelism/Threads:16/HashTable krows:8/process_time        485260 ns      1541146 ns          453 rows/sec=5.31553M/s
BM_HashJoinBasic_BuildParallelism/Threads:1/HashTable krows:64/process_time       1716517 ns      1716522 ns          404 rows/sec=38.1795M/s
BM_HashJoinBasic_BuildParallelism/Threads:2/HashTable krows:64/process_time       1762125 ns      2982570 ns          235 rows/sec=21.973M/s
BM_HashJoinBasic_BuildParallelism/Threads:3/HashTable krows:64/process_time       1826549 ns      4331683 ns          161 rows/sec=15.1295M/s
BM_HashJoinBasic_BuildParallelism/Threads:4/HashTable krows:64/process_time       2032670 ns      6228081 ns          111 rows/sec=10.5227M/s
BM_HashJoinBasic_BuildParallelism/Threads:5/HashTable krows:64/process_time       2008129 ns      7401860 ns           93 rows/sec=8.85399M/s
BM_HashJoinBasic_BuildParallelism/Threads:6/HashTable krows:64/process_time       2022595 ns      8733805 ns           77 rows/sec=7.50372M/s
BM_HashJoinBasic_BuildParallelism/Threads:7/HashTable krows:64/process_time       2084620 ns     10333721 ns           68 rows/sec=6.34196M/s
BM_HashJoinBasic_BuildParallelism/Threads:8/HashTable krows:64/process_time       2186912 ns     12275696 ns           56 rows/sec=5.33868M/s
BM_HashJoinBasic_BuildParallelism/Threads:9/HashTable krows:64/process_time       3061302 ns     20949833 ns           24 rows/sec=3.12823M/s
BM_HashJoinBasic_BuildParallelism/Threads:10/HashTable krows:64/process_time      4241129 ns     34483810 ns           21 rows/sec=1.90049M/s
BM_HashJoinBasic_BuildParallelism/Threads:11/HashTable krows:64/process_time      4123000 ns     33438545 ns           22 rows/sec=1.95989M/s
BM_HashJoinBasic_BuildParallelism/Threads:12/HashTable krows:64/process_time      5282983 ns     44385773 ns           22 rows/sec=1.47651M/s
BM_HashJoinBasic_BuildParallelism/Threads:13/HashTable krows:64/process_time      4214940 ns     33978250 ns           16 rows/sec=1.92876M/s
BM_HashJoinBasic_BuildParallelism/Threads:14/HashTable krows:64/process_time      9775500 ns     85277400 ns           10 rows/sec=768.504k/s
BM_HashJoinBasic_BuildParallelism/Threads:15/HashTable krows:64/process_time      8448605 ns     40459190 ns           21 rows/sec=1.61981M/s
BM_HashJoinBasic_BuildParallelism/Threads:16/HashTable krows:64/process_time      8311054 ns     74384765 ns           17 rows/sec=881.041k/s
BM_HashJoinBasic_BuildParallelism/Threads:1/HashTable krows:512/process_time     15124972 ns     15124152 ns           46 rows/sec=34.6656M/s
BM_HashJoinBasic_BuildParallelism/Threads:2/HashTable krows:512/process_time      9977718 ns     19336583 ns           36 rows/sec=27.1138M/s
BM_HashJoinBasic_BuildParallelism/Threads:3/HashTable krows:512/process_time      8751039 ns     23240667 ns           30 rows/sec=22.5591M/s
BM_HashJoinBasic_BuildParallelism/Threads:4/HashTable krows:512/process_time      9839327 ns     33597150 ns           20 rows/sec=15.6051M/s
BM_HashJoinBasic_BuildParallelism/Threads:5/HashTable krows:512/process_time     10058853 ns     41758118 ns           17 rows/sec=12.5554M/s
BM_HashJoinBasic_BuildParallelism/Threads:6/HashTable krows:512/process_time     10139465 ns     49509846 ns           13 rows/sec=10.5896M/s
BM_HashJoinBasic_BuildParallelism/Threads:7/HashTable krows:512/process_time     10311708 ns     58393545 ns           11 rows/sec=8.97853M/s
BM_HashJoinBasic_BuildParallelism/Threads:8/HashTable krows:512/process_time     10327653 ns     65427667 ns            9 rows/sec=8.01325M/s
BM_HashJoinBasic_BuildParallelism/Threads:9/HashTable krows:512/process_time     13476536 ns     99947571 ns            7 rows/sec=5.24563M/s
BM_HashJoinBasic_BuildParallelism/Threads:10/HashTable krows:512/process_time    17290050 ns    143569000 ns            5 rows/sec=3.65182M/s
BM_HashJoinBasic_BuildParallelism/Threads:11/HashTable krows:512/process_time    20576010 ns    176557250 ns            4 rows/sec=2.96951M/s
BM_HashJoinBasic_BuildParallelism/Threads:12/HashTable krows:512/process_time    24393117 ns    205985600 ns            5 rows/sec=2.54527M/s
BM_HashJoinBasic_BuildParallelism/Threads:13/HashTable krows:512/process_time    21039639 ns    168724000 ns            3 rows/sec=3.10737M/s
BM_HashJoinBasic_BuildParallelism/Threads:14/HashTable krows:512/process_time    38604708 ns    333330667 ns            3 rows/sec=1.57288M/s
BM_HashJoinBasic_BuildParallelism/Threads:15/HashTable krows:512/process_time    63189833 ns    502763000 ns            1 rows/sec=1042.81k/s
BM_HashJoinBasic_BuildParallelism/Threads:16/HashTable krows:512/process_time    91289749 ns    731794000 ns            1 rows/sec=716.442k/s
BM_HashJoinBasic_BuildParallelism/Threads:1/HashTable krows:4096/process_time   164686385 ns    164197000 ns            4 rows/sec=25.5443M/s
BM_HashJoinBasic_BuildParallelism/Threads:2/HashTable krows:4096/process_time   112767458 ns    217052333 ns            3 rows/sec=19.3239M/s
BM_HashJoinBasic_BuildParallelism/Threads:3/HashTable krows:4096/process_time   100643792 ns    245290000 ns            3 rows/sec=17.0994M/s
BM_HashJoinBasic_BuildParallelism/Threads:4/HashTable krows:4096/process_time    74837889 ns    268070667 ns            3 rows/sec=15.6463M/s
BM_HashJoinBasic_BuildParallelism/Threads:5/HashTable krows:4096/process_time    63174056 ns    269879667 ns            3 rows/sec=15.5414M/s
BM_HashJoinBasic_BuildParallelism/Threads:6/HashTable krows:4096/process_time    59140353 ns    294662000 ns            2 rows/sec=14.2343M/s
BM_HashJoinBasic_BuildParallelism/Threads:7/HashTable krows:4096/process_time    64158124 ns    354435000 ns            2 rows/sec=11.8338M/s
BM_HashJoinBasic_BuildParallelism/Threads:8/HashTable krows:4096/process_time    70799208 ns    465744500 ns            2 rows/sec=9.00559M/s
BM_HashJoinBasic_BuildParallelism/Threads:9/HashTable krows:4096/process_time   118786833 ns    730395500 ns            2 rows/sec=5.74251M/s
BM_HashJoinBasic_BuildParallelism/Threads:10/HashTable krows:4096/process_time  158779374 ns   1254764000 ns            1 rows/sec=3.3427M/s
BM_HashJoinBasic_BuildParallelism/Threads:11/HashTable krows:4096/process_time  124160834 ns    985925000 ns            1 rows/sec=4.25418M/s
BM_HashJoinBasic_BuildParallelism/Threads:12/HashTable krows:4096/process_time  261909918 ns   1956600000 ns            1 rows/sec=2.14367M/s
BM_HashJoinBasic_BuildParallelism/Threads:13/HashTable krows:4096/process_time  437582374 ns   3326539000 ns            1 rows/sec=1.26086M/s
BM_HashJoinBasic_BuildParallelism/Threads:14/HashTable krows:4096/process_time  225402042 ns   1756542000 ns            1 rows/sec=2.38782M/s
BM_HashJoinBasic_BuildParallelism/Threads:15/HashTable krows:4096/process_time  284178668 ns   2485382000 ns            1 rows/sec=1.68759M/s
BM_HashJoinBasic_BuildParallelism/Threads:16/HashTable krows:4096/process_time  198744084 ns   1697137000 ns            1 rows/sec=2.4714M/s
Benchmark After (Click to expand)
Running ./arrow-acero-hash-join-benchmark
Run on (10 X 24.1886 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB
  L1 Instruction 128 KiB
  L2 Unified 4096 KiB (x10)
Load Average: 3.72, 3.38, 3.20
-----------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                               Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------------------------
BM_HashJoinBasic_BuildParallelism/Threads:1/HashTable krows:1/process_time          64162 ns        60216 ns        11306 rows/sec=17.0054M/s
BM_HashJoinBasic_BuildParallelism/Threads:2/HashTable krows:1/process_time          73712 ns        85168 ns         8287 rows/sec=12.0233M/s
BM_HashJoinBasic_BuildParallelism/Threads:3/HashTable krows:1/process_time          81532 ns       108468 ns         6563 rows/sec=9.44057M/s
BM_HashJoinBasic_BuildParallelism/Threads:4/HashTable krows:1/process_time          90389 ns       125957 ns         5590 rows/sec=8.12979M/s
BM_HashJoinBasic_BuildParallelism/Threads:5/HashTable krows:1/process_time          98131 ns       144575 ns         3912 rows/sec=7.08281M/s
BM_HashJoinBasic_BuildParallelism/Threads:6/HashTable krows:1/process_time         112269 ns       171638 ns         3551 rows/sec=5.96605M/s
BM_HashJoinBasic_BuildParallelism/Threads:7/HashTable krows:1/process_time         127481 ns       207426 ns         3053 rows/sec=4.93669M/s
BM_HashJoinBasic_BuildParallelism/Threads:8/HashTable krows:1/process_time         135240 ns       221817 ns         3337 rows/sec=4.61641M/s
BM_HashJoinBasic_BuildParallelism/Threads:9/HashTable krows:1/process_time         167247 ns       323541 ns         2152 rows/sec=3.16497M/s
BM_HashJoinBasic_BuildParallelism/Threads:10/HashTable krows:1/process_time        173753 ns       363113 ns         1913 rows/sec=2.82006M/s
BM_HashJoinBasic_BuildParallelism/Threads:11/HashTable krows:1/process_time        182739 ns       404210 ns         1717 rows/sec=2.53334M/s
BM_HashJoinBasic_BuildParallelism/Threads:12/HashTable krows:1/process_time        194151 ns       451175 ns         1542 rows/sec=2.26963M/s
BM_HashJoinBasic_BuildParallelism/Threads:13/HashTable krows:1/process_time        205538 ns       496195 ns         1423 rows/sec=2.06371M/s
BM_HashJoinBasic_BuildParallelism/Threads:14/HashTable krows:1/process_time        217099 ns       540857 ns         1259 rows/sec=1.89329M/s
BM_HashJoinBasic_BuildParallelism/Threads:15/HashTable krows:1/process_time        228487 ns       591203 ns         1274 rows/sec=1.73206M/s
BM_HashJoinBasic_BuildParallelism/Threads:16/HashTable krows:1/process_time        240082 ns       642682 ns         1087 rows/sec=1.59332M/s
BM_HashJoinBasic_BuildParallelism/Threads:1/HashTable krows:8/process_time         218917 ns       218912 ns         3219 rows/sec=37.4214M/s
BM_HashJoinBasic_BuildParallelism/Threads:2/HashTable krows:8/process_time         239310 ns       338138 ns         2066 rows/sec=24.2268M/s
BM_HashJoinBasic_BuildParallelism/Threads:3/HashTable krows:8/process_time         284833 ns       411252 ns         1570 rows/sec=19.9197M/s
BM_HashJoinBasic_BuildParallelism/Threads:4/HashTable krows:8/process_time         315525 ns       496170 ns         1437 rows/sec=16.5105M/s
BM_HashJoinBasic_BuildParallelism/Threads:5/HashTable krows:8/process_time         329116 ns       557150 ns         1246 rows/sec=14.7034M/s
BM_HashJoinBasic_BuildParallelism/Threads:6/HashTable krows:8/process_time         339415 ns       612913 ns         1123 rows/sec=13.3657M/s
BM_HashJoinBasic_BuildParallelism/Threads:7/HashTable krows:8/process_time         354355 ns       673437 ns         1040 rows/sec=12.1645M/s
BM_HashJoinBasic_BuildParallelism/Threads:8/HashTable krows:8/process_time         371602 ns       736217 ns          948 rows/sec=11.1271M/s
BM_HashJoinBasic_BuildParallelism/Threads:9/HashTable krows:8/process_time         388963 ns       788646 ns          870 rows/sec=10.3874M/s
BM_HashJoinBasic_BuildParallelism/Threads:10/HashTable krows:8/process_time        398060 ns       838691 ns          850 rows/sec=9.76761M/s
BM_HashJoinBasic_BuildParallelism/Threads:11/HashTable krows:8/process_time        403233 ns       875477 ns          789 rows/sec=9.35719M/s
BM_HashJoinBasic_BuildParallelism/Threads:12/HashTable krows:8/process_time        410908 ns       917480 ns          748 rows/sec=8.92881M/s
BM_HashJoinBasic_BuildParallelism/Threads:13/HashTable krows:8/process_time        425442 ns       971118 ns          702 rows/sec=8.43564M/s
BM_HashJoinBasic_BuildParallelism/Threads:14/HashTable krows:8/process_time        427492 ns      1002726 ns          718 rows/sec=8.16973M/s
BM_HashJoinBasic_BuildParallelism/Threads:15/HashTable krows:8/process_time        442728 ns      1057910 ns          653 rows/sec=7.74357M/s
BM_HashJoinBasic_BuildParallelism/Threads:16/HashTable krows:8/process_time        455481 ns      1115695 ns          642 rows/sec=7.34251M/s
BM_HashJoinBasic_BuildParallelism/Threads:1/HashTable krows:64/process_time       1731379 ns      1731375 ns          403 rows/sec=37.852M/s
BM_HashJoinBasic_BuildParallelism/Threads:2/HashTable krows:64/process_time       1179658 ns      2152165 ns          328 rows/sec=30.4512M/s
BM_HashJoinBasic_BuildParallelism/Threads:3/HashTable krows:64/process_time       1116942 ns      2232095 ns          316 rows/sec=29.3608M/s
BM_HashJoinBasic_BuildParallelism/Threads:4/HashTable krows:64/process_time        814811 ns      2498054 ns          276 rows/sec=26.2348M/s
BM_HashJoinBasic_BuildParallelism/Threads:5/HashTable krows:64/process_time        900296 ns      2959111 ns          235 rows/sec=22.1472M/s
BM_HashJoinBasic_BuildParallelism/Threads:6/HashTable krows:64/process_time        917596 ns      3253949 ns          215 rows/sec=20.1405M/s
BM_HashJoinBasic_BuildParallelism/Threads:7/HashTable krows:64/process_time        920826 ns      3526660 ns          197 rows/sec=18.583M/s
BM_HashJoinBasic_BuildParallelism/Threads:8/HashTable krows:64/process_time        811062 ns      3789065 ns          184 rows/sec=17.2961M/s
BM_HashJoinBasic_BuildParallelism/Threads:9/HashTable krows:64/process_time       1031480 ns      5637721 ns          122 rows/sec=11.6246M/s
BM_HashJoinBasic_BuildParallelism/Threads:10/HashTable krows:64/process_time      1072212 ns      6040280 ns          118 rows/sec=10.8498M/s
BM_HashJoinBasic_BuildParallelism/Threads:11/HashTable krows:64/process_time      1088001 ns      6204862 ns          116 rows/sec=10.562M/s
BM_HashJoinBasic_BuildParallelism/Threads:12/HashTable krows:64/process_time      1119427 ns      6356310 ns          113 rows/sec=10.3104M/s
BM_HashJoinBasic_BuildParallelism/Threads:13/HashTable krows:64/process_time      1128651 ns      6542557 ns          115 rows/sec=10.0169M/s
BM_HashJoinBasic_BuildParallelism/Threads:14/HashTable krows:64/process_time      1152430 ns      6731112 ns          107 rows/sec=9.73628M/s
BM_HashJoinBasic_BuildParallelism/Threads:15/HashTable krows:64/process_time      1161581 ns      6772318 ns          107 rows/sec=9.67704M/s
BM_HashJoinBasic_BuildParallelism/Threads:16/HashTable krows:64/process_time      1171040 ns      6748462 ns          106 rows/sec=9.71125M/s
BM_HashJoinBasic_BuildParallelism/Threads:1/HashTable krows:512/process_time     16584785 ns     16419156 ns           45 rows/sec=31.9315M/s
BM_HashJoinBasic_BuildParallelism/Threads:2/HashTable krows:512/process_time      9782162 ns     18750500 ns           36 rows/sec=27.9613M/s
BM_HashJoinBasic_BuildParallelism/Threads:3/HashTable krows:512/process_time      9204909 ns     18933861 ns           36 rows/sec=27.6905M/s
BM_HashJoinBasic_BuildParallelism/Threads:4/HashTable krows:512/process_time      5665851 ns     20187600 ns           35 rows/sec=25.9708M/s
BM_HashJoinBasic_BuildParallelism/Threads:5/HashTable krows:512/process_time      6824165 ns     24445690 ns           29 rows/sec=21.4471M/s
BM_HashJoinBasic_BuildParallelism/Threads:6/HashTable krows:512/process_time      6476403 ns     25448704 ns           27 rows/sec=20.6018M/s
BM_HashJoinBasic_BuildParallelism/Threads:7/HashTable krows:512/process_time      6380011 ns     26670808 ns           26 rows/sec=19.6577M/s
BM_HashJoinBasic_BuildParallelism/Threads:8/HashTable krows:512/process_time      4994868 ns     29002792 ns           24 rows/sec=18.0772M/s
BM_HashJoinBasic_BuildParallelism/Threads:9/HashTable krows:512/process_time      6097037 ns     37510263 ns           19 rows/sec=13.9772M/s
BM_HashJoinBasic_BuildParallelism/Threads:10/HashTable krows:512/process_time     6024000 ns     40356889 ns           18 rows/sec=12.9913M/s
BM_HashJoinBasic_BuildParallelism/Threads:11/HashTable krows:512/process_time     6167103 ns     41287529 ns           17 rows/sec=12.6985M/s
BM_HashJoinBasic_BuildParallelism/Threads:12/HashTable krows:512/process_time     6087725 ns     40475722 ns           18 rows/sec=12.9531M/s
BM_HashJoinBasic_BuildParallelism/Threads:13/HashTable krows:512/process_time     6163463 ns     41720647 ns           17 rows/sec=12.5666M/s
BM_HashJoinBasic_BuildParallelism/Threads:14/HashTable krows:512/process_time     6056402 ns     40388529 ns           17 rows/sec=12.9811M/s
BM_HashJoinBasic_BuildParallelism/Threads:15/HashTable krows:512/process_time     5972958 ns     40973824 ns           17 rows/sec=12.7957M/s
BM_HashJoinBasic_BuildParallelism/Threads:16/HashTable krows:512/process_time     6593174 ns     40719647 ns           17 rows/sec=12.8756M/s
BM_HashJoinBasic_BuildParallelism/Threads:1/HashTable krows:4096/process_time   174475083 ns    174058000 ns            3 rows/sec=24.0972M/s
BM_HashJoinBasic_BuildParallelism/Threads:2/HashTable krows:4096/process_time   109935347 ns    200222667 ns            3 rows/sec=20.9482M/s
BM_HashJoinBasic_BuildParallelism/Threads:3/HashTable krows:4096/process_time    89852042 ns    187011000 ns            3 rows/sec=22.4281M/s
BM_HashJoinBasic_BuildParallelism/Threads:4/HashTable krows:4096/process_time    57974139 ns    202076667 ns            3 rows/sec=20.756M/s
BM_HashJoinBasic_BuildParallelism/Threads:5/HashTable krows:4096/process_time    57160194 ns    210744667 ns            3 rows/sec=19.9023M/s
BM_HashJoinBasic_BuildParallelism/Threads:6/HashTable krows:4096/process_time    56770167 ns    221233000 ns            3 rows/sec=18.9588M/s
BM_HashJoinBasic_BuildParallelism/Threads:7/HashTable krows:4096/process_time    59031097 ns    241927000 ns            3 rows/sec=17.3371M/s
BM_HashJoinBasic_BuildParallelism/Threads:8/HashTable krows:4096/process_time    46069291 ns    263787667 ns            3 rows/sec=15.9003M/s
BM_HashJoinBasic_BuildParallelism/Threads:9/HashTable krows:4096/process_time    51498374 ns    310020500 ns            2 rows/sec=13.5291M/s
BM_HashJoinBasic_BuildParallelism/Threads:10/HashTable krows:4096/process_time   52055417 ns    319261500 ns            2 rows/sec=13.1375M/s
BM_HashJoinBasic_BuildParallelism/Threads:11/HashTable krows:4096/process_time   49418250 ns    331526500 ns            2 rows/sec=12.6515M/s
BM_HashJoinBasic_BuildParallelism/Threads:12/HashTable krows:4096/process_time   53305833 ns    332126000 ns            2 rows/sec=12.6287M/s
BM_HashJoinBasic_BuildParallelism/Threads:13/HashTable krows:4096/process_time   48910062 ns    325631500 ns            2 rows/sec=12.8805M/s
BM_HashJoinBasic_BuildParallelism/Threads:14/HashTable krows:4096/process_time   52218458 ns    312798500 ns            2 rows/sec=13.409M/s
BM_HashJoinBasic_BuildParallelism/Threads:15/HashTable krows:4096/process_time   51131709 ns    344045500 ns            2 rows/sec=12.1911M/s
BM_HashJoinBasic_BuildParallelism/Threads:16/HashTable krows:4096/process_time   55233376 ns    338843500 ns            2 rows/sec=12.3783M/s

Overhead

This change introduces some overhead indeed. First, in the old implementation, the partition info is used right way after partitioning the batch, whereas the new implementation preserves the partition info and uses it in the next stage (potentially by other thread). This may be less cache friendly. Second, preserving the the partition info requires more memory: the increased allocation may hurt performance a bit, and worsen the memory profile by 6 bytes per row (4 bytes for hash and 2 bytes for row id in partition).

But as mentioned above, almost all multi-threaded cases are winning. Even nicer, the increased memory profile spans only a short period and doesn't really increase the peak memory: the peak moment always comes in the merge stage, and by that time, the preserved partition info for all batches are released already. This is verified by printing the memory pool stats when benchmarking in my local.

Are these changes tested?

Yes. Existing tests suffice.

Are there any user-facing changes?

None.

This PR includes breaking changes to public APIs. (If there are any breaking changes to public APIs, please explain which changes are breaking. If not, you can remove this.)

This PR contains a "Critical Fix". (If the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld), please provide explanation. If not, you can remove this.)

Copy link

⚠️ GitHub issue #45611 has no components, please add labels for components.

@zanmato1984 zanmato1984 force-pushed the reduce-swiss-join-build-contention branch from 281ca6d to 30f6adf Compare February 24, 2025 14:15
Copy link

⚠️ GitHub issue #45611 has no components, please add labels for components.

3 similar comments
Copy link

⚠️ GitHub issue #45611 has no components, please add labels for components.

Copy link

⚠️ GitHub issue #45611 has no components, please add labels for components.

Copy link

⚠️ GitHub issue #45611 has no components, please add labels for components.

@zanmato1984 zanmato1984 marked this pull request as ready for review February 24, 2025 15:51
@zanmato1984
Copy link
Contributor Author

I'm pretty excited by this optimization. @pitrou @westonpace would you mind to take a look? Thanks.

@pitrou
Copy link
Member

pitrou commented Feb 24, 2025

Does this also remove the spin locks?

Comment on lines +617 to +619
std::vector<uint32_t> hashes;
std::vector<uint16_t> prtn_ranges;
std::vector<uint16_t> prtn_row_ids;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these one value per row?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is what I meant in the Overhead section of the PR description (quoted below).

... and worsen the memory profile by 6 bytes per row (4 bytes for hash and 2 bytes for row id in partition).

Some more details you may also want to know:

  • The prtn_ranges is one element per partition.
  • This BatchState struct is per batch.

Both are less space complexity.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Feb 24, 2025
@pitrou
Copy link
Member

pitrou commented Feb 24, 2025

Besides the two comments above, I don't feel competent to review this, sorry.

@zanmato1984
Copy link
Contributor Author

Does this also remove the spin locks?

This removes the spin lock usage in Swiss join build, but the spin lock itself, namely PartitionLocks::AcquirePartitionLock, is not removed because it is also used elsewhere.

Besides the two comments above, I don't feel competent to review this, sorry.

No problem, I appreciate the comments anyway!

@@ -1112,7 +1113,7 @@ Status SwissTableForJoinBuild::Init(SwissTableForJoin* target, int dop, int64_t

// Make sure that we do not use many partitions if there are not enough rows.
//
constexpr int64_t min_num_rows_per_prtn = 1 << 18;
constexpr int64_t min_num_rows_per_prtn = 1 << 12;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the contention is eliminated, we can be a little more aggressive on the parallelism.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants