Speed up SHA1 computation #939

LingMan · 2020-01-24T07:41:16Z

The first commit is just a minor style fix that doesn't really warrant it's own PR.

The second commit speeds up SHA1 computation by ~50-60%. See the commit message for details. Unfortunately it does make the code a bit uglier than I initially expected because Rust doesn't support destructuring assignments (yet?). Still, even for an implementation like ring's, that favors simplicity over speed, the trade-off is worth it in my opinion.

There's an alternative with much nicer code and the same speedup in this branch. However, that turned out to increase the binary size by almost 390 KiB. Since ring is targeting small devices, that seamed unacceptable. Therefore I'm PRing this solution instead.

(Edit 2020-02-23: Deleted the alternative branch linked above. It used the unroll! macro from the crunchy crate. The merged solution is superior, so no need to keep it around.)

Before:
test digest::sha1::_1000 ... bench: 4,471 ns/iter (+/- 39) = 223 MB/s
test digest::sha1::_16 ... bench: 315 ns/iter (+/- 2) = 50 MB/s
test digest::sha1::_2000 ... bench: 8,903 ns/iter (+/- 208) = 224 MB/s
test digest::sha1::_256 ... bench: 1,452 ns/iter (+/- 14) = 176 MB/s
test digest::sha1::_8192 ... bench: 35,563 ns/iter (+/- 348) = 230 MB/s
test digest::sha1::block_len ... bench: 622 ns/iter (+/- 16) = 102 MB/s

After:
test digest::sha1::_1000 ... bench: 2,801 ns/iter (+/- 22) = 357 MB/s
test digest::sha1::_16 ... bench: 215 ns/iter (+/- 20) = 74 MB/s
test digest::sha1::_2000 ... bench: 5,556 ns/iter (+/- 86) = 359 MB/s
test digest::sha1::_256 ... bench: 922 ns/iter (+/- 35) = 277 MB/s
test digest::sha1::_8192 ... bench: 22,141 ns/iter (+/- 270) = 369 MB/s
test digest::sha1::block_len ... bench: 403 ns/iter (+/- 10) = 158 MB/s

src/digest/sha1.rs

LingMan · 2020-01-25T14:00:31Z

Pushed a new version addressing the review. Having the function loop over a complete 20-element array allowed the compiler to further optimize the code and gave a free ~2% speedup over my initial submission. Not exactly sure why but presumably it wasn't able to fully remove bounds checks on W before.

Added another commit that avoids needlessly initiallizing W with zeros for each block. Gives a boost of 0.8-1.3%. For some reason sha1::block_len is the only benchmark which regresses though (~-1.8%). If anyone can explain that, I'd be all eyes.

In total, the code is now 52-64% faster depending on which benchmark you look at.

briansmith · 2020-01-25T15:49:52Z

Thanks @LingMan. Please squash all your commits into one commit and append the contributor agreement from https://github.com/briansmith/ring#contributing to the end of the commit message.

briansmith · 2020-01-25T15:51:25Z

For some reason sha1::block_len is the only benchmark which regresses though (~-1.8%). If anyone can explain that, I'd be all eyes.

We should choose the version that is faster for HMAC and in particular PBKDF2, instead of choosing the version that would be faster for longer inputs.

In the current implementation, a significant portion of SHA1 computation time is spent matching on the loop variable t to determine the function and constant to be used for each round. Split the loop into four - one for each constant/function combination - and eliminate the match statement. Performance measured with crypto-bench on a stock i5-3450 running Linux 5.4.12-1-MANJARO. Before: test digest::sha1::_1000 ... bench: 4,471 ns/iter (+/- 39) = 223 MB/s test digest::sha1::_16 ... bench: 315 ns/iter (+/- 2) = 50 MB/s test digest::sha1::_2000 ... bench: 8,903 ns/iter (+/- 208) = 224 MB/s test digest::sha1::_256 ... bench: 1,452 ns/iter (+/- 14) = 176 MB/s test digest::sha1::_8192 ... bench: 35,563 ns/iter (+/- 348) = 230 MB/s test digest::sha1::block_len ... bench: 622 ns/iter (+/- 16) = 102 MB/s After: test digest::sha1::_1000 ... bench: 2,770 ns/iter (+/- 23) = 361 MB/s test digest::sha1::_16 ... bench: 211 ns/iter (+/- 11) = 75 MB/s test digest::sha1::_2000 ... bench: 5,488 ns/iter (+/- 42) = 364 MB/s test digest::sha1::_256 ... bench: 910 ns/iter (+/- 4) = 281 MB/s test digest::sha1::_8192 ... bench: 21,931 ns/iter (+/- 629) = 373 MB/s test digest::sha1::block_len ... bench: 395 ns/iter (+/- 7) = 162 MB/s Thus, the performance gain is 50-62%. I agree to license my contributions to each file under the terms given at the top of each file I changed.

LingMan · 2020-01-25T18:46:50Z

Done and dropped the commit with mixed results.

briansmith · 2020-02-12T15:31:16Z

Thanks for doing this!

Are you interested in doing the same for the SHA-2 implementation?

LingMan · 2020-02-23T06:22:00Z

What exactly do you have in mind for SHA2? From a quick glance it doesn't appear to suffer from the same problem.

briansmith reviewed Jan 24, 2020

View reviewed changes

src/digest/sha1.rs Outdated Show resolved Hide resolved

briansmith approved these changes Feb 12, 2020

View reviewed changes

briansmith merged commit e7c7462 into briansmith:master Feb 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up SHA1 computation #939

Speed up SHA1 computation #939

LingMan commented Jan 24, 2020 •

edited

Loading

LingMan commented Jan 25, 2020 •

edited

Loading

briansmith commented Jan 25, 2020

briansmith commented Jan 25, 2020

LingMan commented Jan 25, 2020

briansmith commented Feb 12, 2020

LingMan commented Feb 23, 2020

Speed up SHA1 computation #939

Speed up SHA1 computation #939

Conversation

LingMan commented Jan 24, 2020 • edited Loading

LingMan commented Jan 25, 2020 • edited Loading

briansmith commented Jan 25, 2020

briansmith commented Jan 25, 2020

LingMan commented Jan 25, 2020

briansmith commented Feb 12, 2020

LingMan commented Feb 23, 2020

LingMan commented Jan 24, 2020 •

edited

Loading

LingMan commented Jan 25, 2020 •

edited

Loading