You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been digging into c2-chacha to figure out why its performance is consistently better than chacha20. I think I understand the main difference now: c2-chacha has a "wide" mode where it processes four ChaCha blocks at a time. This requires two sets of registers per state word (since a single 256-bit register only has room for two blocks in parallel), but I guess on modern CPUs there are enough registers that it can handle the increased number of temporaries, and the resulting interleaved instructions seem to parallelise well.
The text was updated successfully, but these errors were encountered:
I've been digging into
c2-chacha
to figure out why its performance is consistently better thanchacha20
. I think I understand the main difference now:c2-chacha
has a "wide" mode where it processes four ChaCha blocks at a time. This requires two sets of registers per state word (since a single 256-bit register only has room for two blocks in parallel), but I guess on modern CPUs there are enough registers that it can handle the increased number of temporaries, and the resulting interleaved instructions seem to parallelise well.The text was updated successfully, but these errors were encountered: