Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for ARM hardware intrinsics #63

Open
Shnatsel opened this issue Dec 31, 2022 · 15 comments
Open

Add support for ARM hardware intrinsics #63

Shnatsel opened this issue Dec 31, 2022 · 15 comments

Comments

@Shnatsel
Copy link
Contributor

ARM provides intrinsics to convert from f32 to f16 since ARMv8, see e.g. VCVT-F16-F32

Unfortunately the Rust standard library does not implement this intrinsic yet, even though it does implement lots of similar ones - e.g. vcvt_f64_f32.

Adding support for this intrinsic in the standard library should be fairly trivial, since all the groundwork is already laid out.

@starkat99
Copy link
Owner

Ya, been waiting for these to show up, but probably just going to have to PR them myself like I'm doing for the F16C stabilization.

@starkat99
Copy link
Owner

Did some digging. It looks like the holdup for ARM conversions is that the ARM code all relies on simd_cast LLVM intrinsic rather than the ARM intrinsics directly, and the LLVM simd_cast relies on proper typing to use the right hardware intrinsics. Without a builtin f16 type, float16x4_t doesn't have a way to indicate the right cast type for simd_cast to implement the ARM intrinsics.

So it would be a completely different implementation that accessed the LLVM hardware intrinsics directly without going through the easier simd_cast. It looks like relevant LLVM intrinsics are llvm.arm.vcvt[b,t].f32.f16 and llvm.arm.vcvt[b,t].f16.f32. Not sure the difference between b/t variants. But it's definitely not as straightforward as just doing same as other conversions.

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Jan 2, 2023

I think the only reasonable option in the short term is to bypass the intrinsics and use inline assembly to get to these instructions.

Inline assembly on ARM has been stable since Rust 1.59, so this will not even require a nightly compiler.

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Jan 6, 2023

is_aarch64_feature_detected has been stable since 1.60, so looks like with inline assembly this could be implemented in stable today, without waiting on any std or compiler features.

@starkat99
Copy link
Owner

Interesting, did not realize that. Though 32-bit ARM detection is NOT stable (so could not detect "neon" feature set for this) so there would be some disparity but definitely worth getting this working.

@starkat99
Copy link
Owner

I've implemented the assembly conversions in the aarch64-intrinsics branch, and it's passing tests on CI, although I don't have access to an ARM machine to benchmark it myself. Currently still nightly toolchain only, I'll try to get it working on stable rust before merging the branch.

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Jan 9, 2023

Amazing! That was quick!

As for benchmarking, a number of public clouds provide ARM machines and free compute credits upon signup.

For example, Google Cloud (full disclosure: they're my employer, I may be biased) provides $300 in free credit, and they have ARM machines. That should be more than enough to test and benchmark this on real hardware.

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Jan 9, 2023

Also, I have access to an ARM machine, and I could run the benchmarks you specify.

Note that the criterion benchmarks in the repository lack black-boxing, which may make them not representative of the real-world performance. Both the input and output should be wrapped in std::hint::black_box. You can find more info here.

@starkat99
Copy link
Owner

starkat99 commented Jan 23, 2023

Merged the aarch64 branch, so main branch now has AArch64 hardware support on Stable Rust now. Also includes arithmetic hardware operations on f16 now too. Because it requires a MSRV bump, it'll be a bit before crates gets a new release, given the MSRV policy for this crate requires a major version bump. Will probably wait until the 1.69 when x86 F16C support should be stabilized.

Leaving issue open for future possible 32-bit ARM support

@Shnatsel
Copy link
Contributor Author

I assume the upcoming release also enabled the use-intrinsics feature by default, or get rid of it entirely?

@starkat99
Copy link
Owner

Feature is still there until the F16C stuff gets stabilized. Will probably remove after yes

@Shnatsel
Copy link
Contributor Author

the MSRV policy for this crate requires a major version bump.

This is unfortunate because all dependents that re-export the type or expose it in the public API at all, such as the exr crate, will also have to make a breaking change to get these improvements. So a major version bump in half will force many other dependents to make a major version bump themselves, which is undesirable.

Worse, the major version bumps would also have to be carried out in sync across all dependents, or they will lose interoperability with each other. The ecosystem had it already with futures crate versions 0.1 and 0.3, and it was Not Fun.

Since no breaking changes are actually made other than MSRV, may I instead suggest keeping the use-intrinsics feature off by default for a while longer, and turning it on by default a few releases after stabilization with a minor version bump? That gives control over the MSRV to the end user who's actually tuning these features, instead of forcing crate maintainers to pick either the version with or without intrinstics and splitting the ecosystem.

@starkat99
Copy link
Owner

The MSRV policy is due to a previous minor version bump with MSRV bump that caused compile failures in downstream creates. It's kind of tough either way, with either choice causing downstream issues. I'll see what I can do about phasing out use-intrinsics in the way you mentioned, that may work in this case. I also want to see if I can finangle a deprecation warning on the use-intrinsics feature somehow, which could help with this issue.

@starkat99
Copy link
Owner

Now that 1.70 is out with the x86 f16c intrinsics stabilized, I think I'll just do the minor release with a full MSRV bump and all the hardware support automatically enabled, deprecating use-intrinsics, and just pull the bandage off all at once.

@utkarshgupta137
Copy link

Not sure if this is part of this issue, but aarch64 has bf16/svebf16 features, which are currently not used by this crate?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants