[Discussion] Hybrid SIMD+scalar #630

easyaspi314 · 2021-11-30T21:48:46Z

As I noticed when I got distracted in #607, it seems that a new experiment to try would be to mix SIMD and scalar code.

This would utilize both the integer and FPU pipelines, at the cost of some complexity.

This seems to be beneficial for AArch64 for both XXH3 and possibly even XXH64.

Notes:

Is this worth the complexity of the implementation?
- At least Clang understands two consecutive for loops.
- Should be toggleable for size opt
What is the correct ratio?
- AArch64 NEON is definitely 6:2 for XXH3 (8.8 GB/s → 10.2 GB/s on Clang 13), however current versions of GCC do not interleave properly due to a half baked arm_neon.h
- I presume SSE2 will be similar.
- Consider 32-bit vs 64-bit in the cost
- AVX2 could be 4:4 or 4:2:2 using a 128-bit register?
Should SSE2/NEON XXH64 be considered?
- Both 32-bit x86 and ARM are known to greatly improve from full SSE2/NEON, but the complexity and lack of benefit on 64-bit makes it not worth it to implement.
  - NEON and 64-bit scalar take roughly the same time.
- If a hybrid implementation benefits 64-bit enough, it can be easily adapted to use the full path on 32-bit for free.
- 2:2 is already slightly beneficial for AArch64 NEON (2.8 GB/s → 3.1 GB/s) on Clang
  - Still worse than XXH32 (4.1 GB/s)
  - May be a dead end optimization thanks to most, if not all, AArch64 CPUs internally using 32-bit multipliers.
  - Could be tweaked though, as Clang likes to interfere with it. However, using inline assembly I can only get up to 3204 MB/s.

The text was updated successfully, but these errors were encountered:

easyaspi314 · 2021-12-01T01:15:10Z

This is only beneficial on aarch64, so I am sticking to #632 only.

easyaspi314 closed this as completed Dec 1, 2021

Provide feedback