-
Notifications
You must be signed in to change notification settings - Fork 733
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up SHA1 computation #939
Conversation
Pushed a new version addressing the review. Having the function loop over a complete 20-element array allowed the compiler to further optimize the code and gave a free ~2% speedup over my initial submission. Not exactly sure why but presumably it wasn't able to fully remove bounds checks on W before. Added another commit that avoids needlessly initiallizing W with zeros for each block. Gives a boost of 0.8-1.3%. For some reason sha1::block_len is the only benchmark which regresses though (~-1.8%). If anyone can explain that, I'd be all eyes. In total, the code is now 52-64% faster depending on which benchmark you look at. |
Thanks @LingMan. Please squash all your commits into one commit and append the contributor agreement from https://github.com/briansmith/ring#contributing to the end of the commit message. |
We should choose the version that is faster for HMAC and in particular PBKDF2, instead of choosing the version that would be faster for longer inputs. |
In the current implementation, a significant portion of SHA1 computation time is spent matching on the loop variable t to determine the function and constant to be used for each round. Split the loop into four - one for each constant/function combination - and eliminate the match statement. Performance measured with crypto-bench on a stock i5-3450 running Linux 5.4.12-1-MANJARO. Before: test digest::sha1::_1000 ... bench: 4,471 ns/iter (+/- 39) = 223 MB/s test digest::sha1::_16 ... bench: 315 ns/iter (+/- 2) = 50 MB/s test digest::sha1::_2000 ... bench: 8,903 ns/iter (+/- 208) = 224 MB/s test digest::sha1::_256 ... bench: 1,452 ns/iter (+/- 14) = 176 MB/s test digest::sha1::_8192 ... bench: 35,563 ns/iter (+/- 348) = 230 MB/s test digest::sha1::block_len ... bench: 622 ns/iter (+/- 16) = 102 MB/s After: test digest::sha1::_1000 ... bench: 2,770 ns/iter (+/- 23) = 361 MB/s test digest::sha1::_16 ... bench: 211 ns/iter (+/- 11) = 75 MB/s test digest::sha1::_2000 ... bench: 5,488 ns/iter (+/- 42) = 364 MB/s test digest::sha1::_256 ... bench: 910 ns/iter (+/- 4) = 281 MB/s test digest::sha1::_8192 ... bench: 21,931 ns/iter (+/- 629) = 373 MB/s test digest::sha1::block_len ... bench: 395 ns/iter (+/- 7) = 162 MB/s Thus, the performance gain is 50-62%. I agree to license my contributions to each file under the terms given at the top of each file I changed.
Done and dropped the commit with mixed results. |
Thanks for doing this! Are you interested in doing the same for the SHA-2 implementation? |
What exactly do you have in mind for SHA2? From a quick glance it doesn't appear to suffer from the same problem. |
The first commit is just a minor style fix that doesn't really warrant it's own PR.
The second commit speeds up SHA1 computation by ~50-60%. See the commit message for details. Unfortunately it does make the code a bit uglier than I initially expected because Rust doesn't support destructuring assignments (yet?). Still, even for an implementation like ring's, that favors simplicity over speed, the trade-off is worth it in my opinion.
There's an alternative with much nicer code and the same speedup in this branch. However, that turned out to increase the binary size by almost 390 KiB. Since ring is targeting small devices, that seamed unacceptable. Therefore I'm PRing this solution instead.
(Edit 2020-02-23: Deleted the alternative branch linked above. It used the
unroll!
macro from the crunchy crate. The merged solution is superior, so no need to keep it around.)Before:
test digest::sha1::_1000 ... bench: 4,471 ns/iter (+/- 39) = 223 MB/s
test digest::sha1::_16 ... bench: 315 ns/iter (+/- 2) = 50 MB/s
test digest::sha1::_2000 ... bench: 8,903 ns/iter (+/- 208) = 224 MB/s
test digest::sha1::_256 ... bench: 1,452 ns/iter (+/- 14) = 176 MB/s
test digest::sha1::_8192 ... bench: 35,563 ns/iter (+/- 348) = 230 MB/s
test digest::sha1::block_len ... bench: 622 ns/iter (+/- 16) = 102 MB/s
After:
test digest::sha1::_1000 ... bench: 2,801 ns/iter (+/- 22) = 357 MB/s
test digest::sha1::_16 ... bench: 215 ns/iter (+/- 20) = 74 MB/s
test digest::sha1::_2000 ... bench: 5,556 ns/iter (+/- 86) = 359 MB/s
test digest::sha1::_256 ... bench: 922 ns/iter (+/- 35) = 277 MB/s
test digest::sha1::_8192 ... bench: 22,141 ns/iter (+/- 270) = 369 MB/s
test digest::sha1::block_len ... bench: 403 ns/iter (+/- 10) = 158 MB/s