Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] Hybrid SIMD+scalar #630

Closed
easyaspi314 opened this issue Nov 30, 2021 · 1 comment
Closed

[Discussion] Hybrid SIMD+scalar #630

easyaspi314 opened this issue Nov 30, 2021 · 1 comment

Comments

@easyaspi314
Copy link
Contributor

easyaspi314 commented Nov 30, 2021

As I noticed when I got distracted in #607, it seems that a new experiment to try would be to mix SIMD and scalar code.

This would utilize both the integer and FPU pipelines, at the cost of some complexity.

This seems to be beneficial for AArch64 for both XXH3 and possibly even XXH64.

Notes:

  1. Is this worth the complexity of the implementation?
    • At least Clang understands two consecutive for loops.
    • Should be toggleable for size opt
  2. What is the correct ratio?
    • AArch64 NEON is definitely 6:2 for XXH3 (8.8 GB/s → 10.2 GB/s on Clang 13), however current versions of GCC do not interleave properly due to a half baked arm_neon.h
    • I presume SSE2 will be similar.
    • Consider 32-bit vs 64-bit in the cost
    • AVX2 could be 4:4 or 4:2:2 using a 128-bit register?
  3. Should SSE2/NEON XXH64 be considered?
    • Both 32-bit x86 and ARM are known to greatly improve from full SSE2/NEON, but the complexity and lack of benefit on 64-bit makes it not worth it to implement.
      • NEON and 64-bit scalar take roughly the same time.
    • If a hybrid implementation benefits 64-bit enough, it can be easily adapted to use the full path on 32-bit for free.
    • 2:2 is already slightly beneficial for AArch64 NEON (2.8 GB/s → 3.1 GB/s) on Clang
      • Still worse than XXH32 (4.1 GB/s)
      • May be a dead end optimization thanks to most, if not all, AArch64 CPUs internally using 32-bit multipliers.
      • Could be tweaked though, as Clang likes to interfere with it. However, using inline assembly I can only get up to 3204 MB/s.
@easyaspi314
Copy link
Contributor Author

This is only beneficial on aarch64, so I am sticking to #632 only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant