-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ACP: Expose algebraic floating point intrinsics #532
Comments
An alternative is a Does using
Yes please! Any new API should be consistent across all four formats. |
I think Separately, my impression is that the Wrapping/Saturating types actually aren’t all that popular compared to using the wrapping/saturating methods. If they were proposed as new API today, I’d ask “why can’t this start out as a third party crate that wraps the primitive types’ methods to prove the concept?” - I think the same question applies to |
Agreed. For instance, for strict arithmetic (panic on overflow), we added the methods and didn't add a wrapper type. The docs for these methods should make it clear that we are not fixing the exact set of optimizations that can be performed. Basically, the semantic guarantee for these operations is just that they return some float value, and even with the same inputs the resulting float value can be different even within a single program run. Unsafe code cannot rely on any property of the return value for soundness. However, implementations will generally do their best to pick a reasonable tradeoff between performance and accuracy for the result. |
Yeah, that's fair enough. The difference I see is that an accidental vanilla However no disagreement that the surface area may not be worth it for a relatively niche purpose.
On that note, if this goes forward it would be good to come up with a loose idea of what other optimization-related API we may want and how it all interacts. E.g. if we actually want almost all of fast-math then |
Only if you happen to run into the overflowing case in a debug build. But yeah it's easier to test for than a performance cliff.
FTZ/DAZ is subnormal flushing, right? These acronyms are terrible inaccessible.^^ I consider subnormal flushing to be a legal compilation scheme for these algebraic operations. But they are safe, so the backend can't make "no-nan" or "is-finite" assumptions. Currently LLVM doesn't offer a flag that represents "if the input is a nan, just produce some arbitrary well-defined output"; if LLVM gains such a flag in the future we could also use it for these operations I would say.
|
I’d rather not expand the scope of these operations to cover “every funsafe-math-ish thing that doesn’t have to be I’m not saying we shouldn’t reserve the right to change what exactly these methods do, but as a QoI matter, it would probably be a disservice to users who just want a summation loop to get auto-vectorized if the only option for that also turned on a lot of other rewrites that are more likely to cause them problems than to help. While this goes on the direction of having multiple combinations of LLVM “fast math flags” exposed, I don’t think that there’s more than two or three sets that are broadly useful enough and well-behaved enough to be worth exposing. And one of those is just “let the hardware do whatever is fast w.r.t. subnormals” which is something that really wants to be applied to entire regions of code and not to individual operations, so it may need an entirely different design. (It’s also a function attribute rather than an instruction flag in LLVM.) |
Nope, because the associativity is still:
which the compiler isn't allowed to rewrite to:
or anything else for vectorization. Same performance as all the other Rust stable implementations on a i7-10875H:
https://github.com/calder/dot-bench/blob/main/mul_add.rs
How about
Gotcha, updated!
Exactly. In the benchmarking I've done algebraic is sufficient in all cases to unlock vectorization. And for most floating point ops where it applies there's nothing inherently more "correct" about one association than another; both are finite precision approximations of real numbers and anyone writing floating point code already has to deal with that. The biggest difference is that the exact sequence of ops:
But for something like a dot product where a mix of small terms and giant cancelling terms will cause precision issues regardless of association, this trade makes sense. |
I completely agree they're very bad names. Yeah; flush-to-zero makes the output of operations return 0 if subnormal, denormals-are-zero makes the operations treat subnormal inputs as zero, so they tend to go together.
Agreeing with Hanna and Calder, logically I don't think subnormal flushing should be grouped in with algebraic optimizations. The operations currently proposed here basically give the compiler freedom to increase your rounding error, but the subnormal flags reduce the amount of available range and are more hardware-dependent. That is why I was thinking these may fit into tiers:
We don't need to figure that all out now but before stabilization (assuming this gets to that point), we should probably have a general idea of where other fast-math optimizations fit into the picture, if at all. Edit: to clarify the above, the following would be about how this expands in LLVM:
No signed zeros
This may not currently be feasible,
This would be the flags that poison NaN and infinity, To be clear, I'm not suggesting we add all of these but this may be a reasonable way to split things up if we do. |
Thanks for the input everyone! What are the next steps here? I'll try to join the Libs API meeting tomorrow in case anyone has questions for me. |
Nothing to do, just keep the top post up to date with new clarification/reasoning as needed then |
We discussed this in the libs-api meeting today and accepted this as proposed, under the Similarly, we plan to work out the exact guarantees to be documented in the PRs. We discussed and did like @RalfJung's proposal above:
(Please include lang on the tracking issue and on stabilization PRs, as these will be exposing intrinsics. cc @rust-lang/lang) |
Tracking issue: rust-lang/rust#136469 |
Expose algebraic floating point intrinsics # Problem A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization. See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H. ### C++: 10us ✅ With Clang 18.1.3 and `-O2 -march=haswell`: <table> <tr> <th>C++</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="cc"> float dot(float *a, float *b, size_t len) { #pragma clang fp reassociate(on) float sum = 0.0; for (size_t i = 0; i < len; ++i) { sum += a[i] * b[i]; } return sum; } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" /> </td> </tr> </table> ### Nightly Rust: 10us ✅ With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i])); } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" /> </td> </tr> </table> ### Stable Rust: 84us ❌ With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum += a[i] * b[i]; } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" /> </td> </tr> </table> # Proposed Change Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature. # Alternatives Considered rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles. In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit. # References * rust-lang#21690 * rust-lang/libs-team#532 * rust-lang#136469 * https://github.com/calder/dot-bench * https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps
Expose algebraic floating point intrinsics # Problem A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization. See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H. ### C++: 10us ✅ With Clang 18.1.3 and `-O2 -march=haswell`: <table> <tr> <th>C++</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="cc"> float dot(float *a, float *b, size_t len) { #pragma clang fp reassociate(on) float sum = 0.0; for (size_t i = 0; i < len; ++i) { sum += a[i] * b[i]; } return sum; } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" /> </td> </tr> </table> ### Nightly Rust: 10us ✅ With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i])); } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" /> </td> </tr> </table> ### Stable Rust: 84us ❌ With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum += a[i] * b[i]; } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" /> </td> </tr> </table> # Proposed Change Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature. # Alternatives Considered rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles. In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit. # References * rust-lang#21690 * rust-lang/libs-team#532 * rust-lang#136469 * https://github.com/calder/dot-bench * https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps try-job: x86_64-gnu-nopt
Proposal
Problem statement
A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization.
Motivating examples or use cases
https://github.com/calder/dot-bench contains benchmarks for different dot product implementations, but all stable Rust implementations suffer from the same inability to vectorize and fuse instructions because the compiler needs to preserve the order of individual floating point operations.
Measurements below were performed on a i7-10875H:
C++: 10us ✅
With Clang 18.1.3 and
-O2 -march=haswell
:Nightly Rust: 10us ✅
With rustc 1.86.0-nightly (8239a37f9) and
-C opt-level=3 -C target-feature=+avx2,+fma
:Stable Rust: 84us ❌
With rustc 1.84.1 (e71f9a9a9) and
-C opt-level=3 -C target-feature=+avx2,+fma
:Solution sketch
Expose
core::intrinsics::f*_algebraic
intrinsics as:Alternatives
rust-lang/rust#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles.
In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit.
Open questions
f16
andf128
as well? --> Yesfast_*()
since we're unlikely to ever want to expose the current (unsafe)f*_fast()
intrinsics? --> No, "algebraic" does a better job capturing the intent guiding which optimizations we will enableLinks and related work
What happens now?
This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.
Possible responses
The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):
Second, if there's a concrete solution:
The text was updated successfully, but these errors were encountered: