Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ACP: Expose algebraic floating point intrinsics #532

Closed
4 tasks done
calder opened this issue Feb 3, 2025 · 12 comments
Closed
4 tasks done

ACP: Expose algebraic floating point intrinsics #532

calder opened this issue Feb 3, 2025 · 12 comments
Labels
ACP-accepted API Change Proposal is accepted (seconded with no objections) api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api

Comments

@calder
Copy link

calder commented Feb 3, 2025

Proposal

Problem statement

A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization.

Motivating examples or use cases

https://github.com/calder/dot-bench contains benchmarks for different dot product implementations, but all stable Rust implementations suffer from the same inability to vectorize and fuse instructions because the compiler needs to preserve the order of individual floating point operations.

Measurements below were performed on a i7-10875H:

C++: 10us ✅

With Clang 18.1.3 and -O2 -march=haswell:

C++ Assembly
float dot(float *a, float *b, size_t len) {
    #pragma clang fp reassociate(on)
    float sum = 0.0;
    for (size_t i = 0; i < len; ++i) {
        sum += a[i] * b[i];
    }
    return sum;
}

Nightly Rust: 10us ✅

With rustc 1.86.0-nightly (8239a37f9) and -C opt-level=3 -C target-feature=+avx2,+fma:

Rust Assembly
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum = algebraic_fadd(sum, algebraic_fmul(a[i], b[i]));
    }
    sum
}

Stable Rust: 84us ❌

With rustc 1.84.1 (e71f9a9a9) and -C opt-level=3 -C target-feature=+avx2,+fma:

Rust Assembly
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum += a[i] * b[i];
    }
    sum
}

Solution sketch

Expose core::intrinsics::f*_algebraic intrinsics as:

// core::num::f16

impl f16 {
    pub fn algebraic_add(self, rhs: f16) -> f16;
    pub fn algebraic_sub(self, rhs: f16) -> f16;
    pub fn algebraic_mul(self, rhs: f16) -> f16;
    pub fn algebraic_div(self, rhs: f16) -> f16;
    pub fn algebraic_rem(self, rhs: f16) -> f16;
}
// core::num::f32

impl f32 {
    pub fn algebraic_add(self, rhs: f32) -> f32;
    pub fn algebraic_sub(self, rhs: f32) -> f32;
    pub fn algebraic_mul(self, rhs: f32) -> f32;
    pub fn algebraic_div(self, rhs: f32) -> f32;
    pub fn algebraic_rem(self, rhs: f32) -> f32;
}
// core::num::f64

impl f64 {
    pub fn algebraic_add(self, rhs: f64) -> f64;
    pub fn algebraic_sub(self, rhs: f64) -> f64;
    pub fn algebraic_mul(self, rhs: f64) -> f64;
    pub fn algebraic_div(self, rhs: f64) -> f64;
    pub fn algebraic_rem(self, rhs: f64) -> f64;
}
// core::num::f128

impl f128 {
    pub fn algebraic_add(self, rhs: f128) -> f128;
    pub fn algebraic_sub(self, rhs: f128) -> f128;
    pub fn algebraic_mul(self, rhs: f128) -> f128;
    pub fn algebraic_div(self, rhs: f128) -> f128;
    pub fn algebraic_rem(self, rhs: f128) -> f128;
}

Alternatives

rust-lang/rust#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles.

In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit.

Open questions

Links and related work

What happens now?

This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.

Possible responses

The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):

  • We think this problem seems worth solving, and the standard library might be the right place to solve it.
  • We think that this probably doesn't belong in the standard library.

Second, if there's a concrete solution:

  • We think this specific solution looks roughly right, approved, you or someone else should implement this. (Further review will still happen on the subsequent implementation PR.)
  • We're not sure this is the right solution, and the alternatives or other materials don't give us enough information to be sure about that. Here are some questions we have that aren't answered, or rough ideas about alternatives we'd want to see discussed.
@calder calder added api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api labels Feb 3, 2025
@tgross35
Copy link

tgross35 commented Feb 3, 2025

An alternative is a struct Algebraic<F>(F) wrapper that implements Add, Mul, etc. so the usual operators can be used without remembering to use the methods for everything (akin to Wrapping and Saturating). I think that could be preferable because if you intend to opt in to imprecise operations, you probably want it for every operation on the value rather than piecemeal (especially here since this doesn't do much in isolation). If we stick with the associated functions however, naming should likely be reversed to match integer checked_add and similar.

Does using mul_add instead get similar optimizations in the demo case?

Should we add these methods to f16 and f128 as well?

Yes please! Any new API should be consistent across all four formats.

@hanna-kruppe
Copy link

I think struct Algebraic<F> opens a big can of worms because we currently only have a small subset of “algebraic” operations. Any other operations on floats would either have to be omitted (making the wrapper much less useful) or they’d be ambiguous in the sense that they currently don’t opt into any special semantics compared to the underlying float type, but might in the future if a use case arises. This is different from Wrapping/Saturating where it’s uniquely determined which operations can wrap/saturate and which ones can’t and are identical to the operation on the underlying type.

Separately, my impression is that the Wrapping/Saturating types actually aren’t all that popular compared to using the wrapping/saturating methods. If they were proposed as new API today, I’d ask “why can’t this start out as a third party crate that wraps the primitive types’ methods to prove the concept?” - I think the same question applies to Algebraic<F>. I think adding only the newtype and not the primitive methods would be a mistake because there will be case where people would be forced to write (Algebraic(lhs) + Algebraic(rhs)).0 to get what they want.

@RalfJung
Copy link
Member

RalfJung commented Feb 3, 2025

Separately, my impression is that the Wrapping/Saturating types actually aren’t all that popular compared to using the wrapping/saturating methods. If they were proposed as new API today, I’d ask “why can’t this start out as a third party crate that wraps the primitive types’ methods to prove the concept?”

Agreed. For instance, for strict arithmetic (panic on overflow), we added the methods and didn't add a wrapper type.


The docs for these methods should make it clear that we are not fixing the exact set of optimizations that can be performed. Basically, the semantic guarantee for these operations is just that they return some float value, and even with the same inputs the resulting float value can be different even within a single program run. Unsafe code cannot rely on any property of the return value for soundness. However, implementations will generally do their best to pick a reasonable tradeoff between performance and accuracy for the result.

@tgross35
Copy link

tgross35 commented Feb 3, 2025

Yeah, that's fair enough. The difference I see is that an accidental vanilla * among checked_* or similar calls is something reasonably likely to be identified during testing, whereas accidentally using * among other *_algebraic calls will quietly remove any benefit from this optimization (seems easy to do at function boundaries).

However no disagreement that the surface area may not be worth it for a relatively niche purpose.

The docs for these methods should make it clear that we are not fixing the exact set of optimizations that can be performed. Basically, the semantic guarantee for these operations is just that they return some float value, and even with the same inputs the resulting float value can be different even within a single program run. Unsafe code cannot rely on any property of the return value for soundness. However, implementations will generally do their best to pick a reasonable tradeoff between performance and accuracy for the result.

On that note, if this goes forward it would be good to come up with a loose idea of what other optimization-related API we may want and how it all interacts. E.g. if we actually want almost all of fast-math then algebraic_* optimizations could be considered a subset of something providing FTZ/DAZ, and that a subset of unsafe finite-math-only + no-signed-zeros, in order to avoid the combinatorial explosion.

@RalfJung
Copy link
Member

RalfJung commented Feb 3, 2025

Yeah, that's fair enough. The difference I see is that an accidental vanilla * among checked_* or similar calls is something reasonably likely to be identified during testing,

Only if you happen to run into the overflowing case in a debug build.

But yeah it's easier to test for than a performance cliff.

On that note, if this goes forward it would be good to come up with a loose idea of what other optimization-related API we may want and how it all interacts. E.g. if we actually want almost all of fast-math then algebraic_* optimizations could be considered a subset of something providing FTZ/DAZ, and that a subset of unsafe finite-math-only + no-signed-zeros, in order to avoid the combinatorial explosion.

FTZ/DAZ is subnormal flushing, right? These acronyms are terrible inaccessible.^^

I consider subnormal flushing to be a legal compilation scheme for these algebraic operations. But they are safe, so the backend can't make "no-nan" or "is-finite" assumptions. Currently LLVM doesn't offer a flag that represents "if the input is a nan, just produce some arbitrary well-defined output"; if LLVM gains such a flag in the future we could also use it for these operations I would say.

algebraic might not be the best name...

@hanna-kruppe
Copy link

hanna-kruppe commented Feb 3, 2025

I’d rather not expand the scope of these operations to cover “every funsafe-math-ish thing that doesn’t have to be unsafe”. There is a clear benefit in many kinds of code to allowing reassociation and mul-add contraction, and their effects are usually relatively tame in practice, while e.g. returning a safe garbage value on NaN inputs is much more dubious.

I’m not saying we shouldn’t reserve the right to change what exactly these methods do, but as a QoI matter, it would probably be a disservice to users who just want a summation loop to get auto-vectorized if the only option for that also turned on a lot of other rewrites that are more likely to cause them problems than to help.

While this goes on the direction of having multiple combinations of LLVM “fast math flags” exposed, I don’t think that there’s more than two or three sets that are broadly useful enough and well-behaved enough to be worth exposing. And one of those is just “let the hardware do whatever is fast w.r.t. subnormals” which is something that really wants to be applied to entire regions of code and not to individual operations, so it may need an entirely different design. (It’s also a function attribute rather than an instruction flag in LLVM.)

@calder
Copy link
Author

calder commented Feb 3, 2025

Does using mul_add instead get similar optimizations in the demo case?

Nope, because the associativity is still:

(((a0 * b0) + (a1 * b1)) + (a2 * b2)) + ...

which the compiler isn't allowed to rewrite to:

((a0 * b0 + a1 * b1 + ...) + (a8 * b8 + a9 * b9 + ...)) + ...

or anything else for vectorization. Same performance as all the other Rust stable implementations on a i7-10875H:

Rust Assembly
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum = a[i].mul_add(b[i], sum);
    }
    sum
}

https://github.com/calder/dot-bench/blob/main/mul_add.rs

If we stick with the associated functions however, naming should likely be reversed to match integer checked_add and similar.

How about alg_[add|sub|mul|div|rem]?

Yes please! Any new API should be consistent across all four formats.

Gotcha, updated!

I’d rather not expand the scope of these operations to cover "every funsafe-math-ish thing that doesn’t have to be unsafe". There is a clear benefit in many kinds of code to allowing reassociation and mul-add contraction, and their effects are usually relatively tame in practice, while e.g. returning a safe garbage value on NaN inputs is much more dubious.

Exactly. In the benchmarking I've done algebraic is sufficient in all cases to unlock vectorization. And for most floating point ops where it applies there's nothing inherently more "correct" about one association than another; both are finite precision approximations of real numbers and anyone writing floating point code already has to deal with that. The biggest difference is that the exact sequence of ops:

  1. Isn't predictable from looking at the code
  2. May change depending on which architecture you're compiling for

But for something like a dot product where a mix of small terms and giant cancelling terms will cause precision issues regardless of association, this trade makes sense.

@tgross35
Copy link

tgross35 commented Feb 3, 2025

On that note, if this goes forward it would be good to come up with a loose idea of what other optimization-related API we may want and how it all interacts. E.g. if we actually want almost all of fast-math then algebraic_* optimizations could be considered a subset of something providing FTZ/DAZ, and that a subset of unsafe finite-math-only + no-signed-zeros, in order to avoid the combinatorial explosion.

FTZ/DAZ is subnormal flushing, right? These acronyms are terrible inaccessible.^^

I completely agree they're very bad names. Yeah; flush-to-zero makes the output of operations return 0 if subnormal, denormals-are-zero makes the operations treat subnormal inputs as zero, so they tend to go together.

I consider subnormal flushing to be a legal compilation scheme for these algebraic operations. But they are safe, so the backend can't make "no-nan" or "is-finite" assumptions. Currently LLVM doesn't offer a flag that represents "if the input is a nan, just produce some arbitrary well-defined output"; if LLVM gains such a flag in the future we could also use it for these operations I would say.

algebraic might not be the best name...

Agreeing with Hanna and Calder, logically I don't think subnormal flushing should be grouped in with algebraic optimizations. The operations currently proposed here basically give the compiler freedom to increase your rounding error, but the subnormal flags reduce the amount of available range and are more hardware-dependent. That is why I was thinking these may fit into tiers:

  1. Optimizations that reduce precision but do not significantly affect the result: the algebraic optimizations here, any sort of sqrt/cbrt+pow, exp+log, or trig reduction if LLVM ever uses its intrinsics for that, possibly calling less precise versions of math functions.
  2. The above, hardware-dependent optimizations that trade precision or range for throughput: FTZ/DAZ
  3. The above, plus anything you may be able to gain by making the operations unsafe (though I am unsure how well this fits with FTZ/DAZ, it's tougher to tell if your operations will overflow if you don't know your available range)

We don't need to figure that all out now but before stabilization (assuming this gets to that point), we should probably have a general idea of where other fast-math optimizations fit into the picture, if at all.


Edit: to clarify the above, the following would be about how this expands in LLVM:

  1. Optimizations that reduce precision but do not significantly affect the result: the algebraic optimizations here, any sort of sqrt/cbrt+pow, exp+log, or trig reduction if LLVM ever uses its intrinsics for that, possibly calling less precise versions of math functions.

No signed zeros nsz, which does not poison -0, plus all rewrite flags: division-reciprocal arcp, fusing ops contract, property expansion reassoc. This is what we currently have https://github.com/rust-lang/rust/blob/8a8b4641754e9ce8a31b272dda6567727452df9e/compiler/rustc_llvm/llvm-wrapper/RustWrapper.cpp#L605-L612. If we add similar API for other math functions, this should include approximate functions afn.

  1. The above, hardware-dependent optimizations that trade precision or range for throughput: FTZ/DAZ

This may not currently be feasible, denormal-fp-math seems to be a function attribute rather than operation flag. Setting that to "positive-zero,positive-zero" would probably be the closest thing.

  1. The above, plus anything you may be able to gain by making the operations unsafe (though I am unsure how well this fits with FTZ/DAZ, it's tougher to tell if your operations will overflow if you don't know your available range)

This would be the flags that poison NaN and infinity, nnan and ninf.

To be clear, I'm not suggesting we add all of these but this may be a reasonable way to split things up if we do.

@calder
Copy link
Author

calder commented Feb 4, 2025

Thanks for the input everyone! What are the next steps here?

I'll try to join the Libs API meeting tomorrow in case anyone has questions for me.

@tgross35
Copy link

tgross35 commented Feb 4, 2025

Nothing to do, just keep the top post up to date with new clarification/reasoning as needed then
wait for libs-api to get a chance to discuss it 👍

@traviscross
Copy link

traviscross commented Feb 4, 2025

We discussed this in the libs-api meeting today and accepted this as proposed, under the algebraic_* name. We also, in this acceptance, accept any similar algebraic_* methods that might make sense (e.g. maybe algebraic_mul_add), with those to be worked out in PRs.

Similarly, we plan to work out the exact guarantees to be documented in the PRs. We discussed and did like @RalfJung's proposal above:

The docs for these methods should make it clear that we are not fixing the exact set of optimizations that can be performed. Basically, the semantic guarantee for these operations is just that they return some float value, and even with the same inputs the resulting float value can be different even within a single program run. Unsafe code cannot rely on any property of the return value for soundness. However, implementations will generally do their best to pick a reasonable tradeoff between performance and accuracy for the result.

(Please include lang on the tracking issue and on stabilization PRs, as these will be exposing intrinsics. cc @rust-lang/lang)

@the8472 the8472 closed this as completed Feb 4, 2025
@the8472 the8472 added the ACP-accepted API Change Proposal is accepted (seconded with no objections) label Feb 4, 2025
@RalfJung
Copy link
Member

Tracking issue: rust-lang/rust#136469

Zalathar added a commit to Zalathar/rust that referenced this issue Feb 18, 2025
Expose algebraic floating point intrinsics

# Problem

A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization.

See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H.

### C++: 10us ✅

With Clang 18.1.3 and `-O2 -march=haswell`:
<table>
<tr>
    <th>C++</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="cc">
float dot(float *a, float *b, size_t len) {
    #pragma clang fp reassociate(on)
    float sum = 0.0;
    for (size_t i = 0; i < len; ++i) {
        sum += a[i] * b[i];
    }
    return sum;
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" />
</td>
</tr>
</table>

### Nightly Rust: 10us ✅

With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i]));
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" />
</td>
</tr>
</table>

### Stable Rust: 84us ❌

With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum += a[i] * b[i];
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" />
</td>
</tr>
</table>

# Proposed Change

Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature.

# Alternatives Considered

rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles.

In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit.

# References

* rust-lang#21690
* rust-lang/libs-team#532
* rust-lang#136469
* https://github.com/calder/dot-bench
* https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps
bors added a commit to rust-lang-ci/rust that referenced this issue Feb 19, 2025
Expose algebraic floating point intrinsics

# Problem

A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization.

See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H.

### C++: 10us ✅

With Clang 18.1.3 and `-O2 -march=haswell`:
<table>
<tr>
    <th>C++</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="cc">
float dot(float *a, float *b, size_t len) {
    #pragma clang fp reassociate(on)
    float sum = 0.0;
    for (size_t i = 0; i < len; ++i) {
        sum += a[i] * b[i];
    }
    return sum;
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" />
</td>
</tr>
</table>

### Nightly Rust: 10us ✅

With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i]));
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" />
</td>
</tr>
</table>

### Stable Rust: 84us ❌

With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum += a[i] * b[i];
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" />
</td>
</tr>
</table>

# Proposed Change

Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature.

# Alternatives Considered

rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles.

In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit.

# References

* rust-lang#21690
* rust-lang/libs-team#532
* rust-lang#136469
* https://github.com/calder/dot-bench
* https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps

try-job: x86_64-gnu-nopt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ACP-accepted API Change Proposal is accepted (seconded with no objections) api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api
Projects
None yet
Development

No branches or pull requests

6 participants