Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to compile directly to llvm intrinsics #97

Closed
Finomnis opened this issue May 31, 2023 · 13 comments
Closed

Add option to compile directly to llvm intrinsics #97

Finomnis opened this issue May 31, 2023 · 13 comments

Comments

@Finomnis
Copy link

Finomnis commented May 31, 2023

Many targets, like thumbv7em-none-eabihf have a hardware floating point unit which supports operations like f32::abs() or f32::sqrt(). Using those would improve performance and reduce code footprint.

Sadly, in no-std, those are currently only obtainable through the unstable core::intrinsics API. A discussion about adding those to stable exists, bit is stale for quite a while now. Still, for maximum performance, it would be worth it to add an unstable feature flag (or experimental_intrinsics or similar) which compiles directly to intrinsics.

This would work the following way. LLVM intrinsics can be compiled in three different ways by LLVM, depending on the circumstances. For example sqrtf32():

  • to a constant, if compile time evaluatable
  • to a hardware instruction if supported (like thumbv7em-none-eabihf)
  • to the function call sqrtf32 as a fallback if no hardware support exists. This function has to be provided by us, via extern "C" and #[no_mangle].

Advantages:

  • Out-of-the-box compile time evaluation, if possible (built into llvm)
  • Much smaller footprint
  • Much faster performance

Disadvantages:

  • Depending on the target, the provided fallback functions might collide with the std library. To be determined. That's why I would definitely keep this behind a non-default feature gate.
  • Requires a nightly compiler. This is another reason why I would keep it behind a feature gate.

If this sparks interest from the crate developer, I might start an implementation and open a PR. If not, I'm fine with having this issue closed.

@tarcieri
Copy link
Owner

We can use unstable core intrinsics (I'd prefer a nightly feature for that, though keeping up with nightly changes is always a bit painful), but if the goal is primarily optimizing thumbv7em-none-eabihf, it seems like inline assembly would be a stable alternative?

@Finomnis
Copy link
Author

Finomnis commented May 31, 2023

if the goal is primarily optimizing thumbv7em-none-eabihf, it seems like inline assembly would be a stable alternative?

Well yes, that is true, but by implementing it via intrinsics would make this compatible with all boards, not just this one. And inline assembly isn't optimizable via LLVM, so things like compile time evaluation wouldn't happen.

This is my current solution in a project:

pub trait F32Ext {
    /// Sinus
    fn sin(self) -> Self;
    /// Cosinus
    fn cos(self) -> Self;
}

#[no_mangle]
extern "C" fn sinf(val: f32) -> f32 {
    micromath::F32(val).sin().0
}

#[no_mangle]
extern "C" fn cosf(val: f32) -> f32 {
    micromath::F32(val).cos().0
}

impl F32Ext for f32 {
    fn sin(self) -> f32 {
        unsafe { core::intrinsics::sinf32(self) }
    }

    fn cos(self) -> f32 {
        unsafe { core::intrinsics::cosf32(self) }
    }
}

I thought surely I'm not the only person who came across this issue, so I thought maybe it's worth including something like this in a library :)

But if you think it's too much effort to maintain, I'll just keep it in my local project.

@tarcieri
Copy link
Owner

Well yes, that is true, but by implementing it via intrinsics would make this compatible with all boards, not just this one

Shouldn't the optimizations work on all boards with an thumbv7em-none-eabihf MCU? I don't see why the assembly would vary by board.

And inline assembly isn't optimizable via LLVM, so things like compile time evaluation wouldn't happen.

While that's a minor drawback, it's one I think is outweighed by nightly vs stable. nightly is a big maintenance burden because it's a moving target, which means things may break from nightly-release-to-release.

@Finomnis
Copy link
Author

Shouldn't the optimizations work on all boards with an thumbv7em-none-eabihf MCU? I don't see why the assembly would vary by board.

I might have misworded it, I mean it's compatible with all targets then, not just thumbv7em-none-eabihf. Basically every other target would then also automatically choose either a hardware implementation or micromath as a fallback, instead of using the micromath implementation by default. Even other micromath functions that depend on each other would profit if some of the calls were hardware accelerated.

While that's a minor drawback, it's one I think is outweighed by nightly vs stable. nightly is a big maintenance burden because it's a moving target, which means things may break from nightly-release-to-release.

And if we add a description to the flag the we give zero stability guarantees? The rest of the project (if the feature isn't enabled) should of course stay compatible with stable.

@Finomnis
Copy link
Author

Of course if you say it's too much of a hastle, I might release it as a separate package.

@tarcieri
Copy link
Owner

I have nightly features on other projects and it can be quite annoying. Even if it's pinned there can be sporadic breakages (e.g. ICE)

This crate is low maintenance enough, however, that I guess I'd be OK with it, so long as nightly were pinned and we don't get frequent requests to bump the nightly version and cut releases due to breakages like I experience with other projects.

@Finomnis
Copy link
Author

Finomnis commented May 31, 2023

I started with doing some benchmarks, and it seems like only abs and sqrt are available on thumbv7em-none-eabihf. And sadly, it makes barely a difference.

(on Teensy 4.0)

===== Micromath Benchmark =====
Git Version: 9a7492b-modified

All values in ns/iter.

            micromath       libm intrinsics
abs              16.6       16.6       16.7
acos            178.0      120.3
asin            110.1      139.6
atan             85.0       78.5
atan_norm        76.8
ceil             48.4       58.1
cos             105.0     1668.0
exp             233.5      116.8
floor            38.5       61.4
fract            21.6
inv              16.6
invsqrt          16.6
ln              161.8      157.4
log2            166.9      177.6
log10           166.8      174.2
round            36.6
sin             110.0     1707.4
sqrt             53.4      491.8       38.4
tan             160.1     2560.6
trunc            26.3       45.0

There would still be the advantages of being compile time optimizable and being more accurate. But it seems micromath is really optimized already. Kudos.

@Finomnis
Copy link
Author

Finomnis commented May 31, 2023

@tarcieri Should I create a PR with those benchmarks? It only includes the 1-parameter functions, so far; and I probably won't add more since I will no longer use intrinsics for myself now that I know how little the gain is.

Either way, here they are in case you are curious: https://github.com/Finomnis/micromath/tree/teensy40_benchmark/examples/benchmark_teensy40

@tarcieri
Copy link
Owner

Yes sure, that'd be great

@Finomnis
Copy link
Author

Finomnis commented Jun 1, 2023

@tarcieri #98

@Finomnis
Copy link
Author

Finomnis commented Jun 1, 2023

@tarcieri Btw, it seems that some of the libm implementations are faster than ours, at least on the Teensy 4.0. Might be worth investigating. The biggest difference seems to be acos. Although I'm only running it repeatedly on one single value, I'm not sure how input dependent the time is.

@Finomnis
Copy link
Author

Finomnis commented Jun 1, 2023

Will close this for now as I won't continue working on it.

@Finomnis Finomnis closed this as completed Jun 1, 2023
@tarcieri
Copy link
Owner

tarcieri commented Jun 1, 2023

it seems that some of the libm implementations are faster than ours

Yeah, that's an interesting data point, although this library optimizes for code size over performance.

I can investigate if there are any implementations which are both faster and relatively similar in terms of code size and potentially adopt those.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants