Implement RVV backend #372

silvanshade · 2024-01-18T06:47:24Z

No description provided.

BurningEnlightenment · 2024-01-18T09:51:28Z

c/CMakeLists.txt

+set(CMAKE_GENERATOR Ninja)
+set(CMAKE_BUILD_TYPE Release)
+set(CMAKE_SYSTEM_NAME Linux)
+set(CMAKE_CROSSCOMPILING_EMULATOR qemu-riscv64-static)
+set(CMAKE_ASM_COMPILER clang-17)
+set(CMAKE_ASM_COMPILER_TARGET riscv64-unknown-linux-gnu)
+set(CMAKE_ASM_FLAGS_INIT "-march=rv64gcv1p0")
+set(CMAKE_C_COMPILER clang-17)
+set(CMAKE_C_COMPILER_TARGET riscv64-unknown-linux-gnu)
+set(CMAKE_C_FLAGS_INIT "-march=rv64gcv1p0")
+set(CMAKE_CXX_COMPILER clang++-17)
+set(CMAKE_CXX_COMPILER_TARGET riscv64-unknown-linux-gnu)
+set(CMAKE_CXX_FLAGS_INIT "-flto=thin-march=rv64gcv1p0")
+set(CMAKE_EXE_LINKER_FLAGS "-fuse-ld=lld-17")


I recommend using CMakePresets as they are quite a bit more ergonomic.

c/CMakeLists.txt

oconnor663 · 2024-01-18T20:18:34Z

I have less free time for code reviews than I used to, so apologies in advance for taking a while to get to this. You might be interested in an RVV assembly implementation that I've been working on here: https://github.com/BLAKE3-team/BLAKE3/blob/guts_api/rust/guts/src/riscv_rva23u64.S. Unfortunately that branch is tied to a large refactoring, which makes it hard for me to land it in master.

silvanshade · 2024-01-18T21:01:47Z

@oconnor663 Oh cool, I didn't realize there was already some implementation work for RVV.

I'll probably give it a closer look soon but just out of curiosity, what state is it in? Any idea about the performance characteristics of it or anything else interesting to note?

Also, have you done any work on any SVE backend?

oconnor663 · 2024-01-18T22:45:59Z

(I just pushed a commit to clean up some function names, so you might need to refresh the page if you still have that .S file open.)

My implementation uses the Zbb and Zvbb extensions, so I don't think it will run on most real chips yet, even those that support V 1.0. I've been doing all the development under Qemu, so I've never done any real benchmarks, but it is passing tests. The missing work that makes it hard to land this is porting other SIMD implementations to this new API. I've done AVX-512 on that branch, but I need to do SSE2/4.1 and AVX2. There was also a minor perf regression in AVX512 that I'll need to track down. Then there are loose ends to tie up around e.g. MSVC-flavored assembly.

Most of the heavy lifting in the parallel implementation (which is what really matters for performance) is in blake3_guts_riscv_rva23u64_kernel, but that code is pretty straightforward without any significant open questions. There are more questions about how transposition should be done in calling functions like blake3_guts_riscv_rva23u64_hash_blocks, which currently uses vlsseg8e32.v. That instruction might be slow on real hardware, and I might need to experiment with doing simpler loads and then transposing in registers.

I haven't tried ARM SVE yet, no. (Also the NEON implementation in master almost certainly has some perf mistakes that someone more experienced could spot.)

silvanshade · 2024-01-18T23:39:21Z

My implementation uses the Zbb and Zvbb extensions, so I don't think it will run on most real chips yet, even those that support V 1.0. I've been doing all the development under Qemu, so I've never done any real benchmarks, but it is passing tests.

Interesting. Thanks for the information.

I've also been doing most of my experimentation under qemu. I did recently get a hold of a Pioneer (SG2042) but it only supports their 0.71 RVV and I haven't even tried to get tooling to work with that yet (in fact I've barely just gotten it to boot, heh). But it might be interesting to try and adapt what you have (sans the Zbb/Zvbb and whatever else is missing).

The missing work that makes it hard to land this is porting other SIMD implementations to this new API. I've done AVX-512 on that branch, but I need to do SSE2/4.1 and AVX2.

I'd be interested in helping with that effort if you'd like. If you could give me some pointers on where to start or whatever, I'd certainly take a look.

There are more questions about how transposition should be done in calling functions like blake3_guts_riscv_rva23u64_hash_blocks, which currently uses vlsseg8e32.v. That instruction might be slow on real hardware, and I might need to experiment with doing simpler loads and then transposing in registers.

Yeah, I noticed that. Seemed interesting. I'm also wondering how that will work out.

I haven't tried ARM SVE yet, no.

I was really kind of looking for an interesting project to try something VLA related but since it seems like you've mostly solved the RVV side, maybe I will give SVE a try instead.

(Also the NEON implementation in master almost certainly has some perf mistakes that someone more experienced could spot.)

I actually made an attempt to finish the missing parts for the NEON implementation at #369. I'm certainly not an expert though and this was my first real attempt using NEON for anything.

Like you suggested though, implementing compress didn't make any practical difference. I tried a few different approaches there but overall nothing seemed to help. I'm guessing it will be hard to get better performance without some sort of more fundamental redesign of the algorithm but I don't even know what that would look like. I suspect all the shuffling in particular is hard to make efficient for NEON.

One thing I was thinking about though, for better performance on Apple Silicon at least, is to try an implementation using Metal, but making use of the unified memory modes to try and avoid the latency issues that made the Vulkan (and SYCL version I saw elsewhere) not very usable.

Another thing I've been wondering about is whether it might be possible to use the AMX coprocessor for some parts of the algorithm, perhaps genlut in particular.

Anyway, interesting stuff. Let me know if there's some way I can help with that branch or maybe if you have some suggestions for other ideas worth exploring.

BurningEnlightenment reviewed Jan 18, 2024

View reviewed changes

c/CMakeLists.txt Outdated Show resolved Hide resolved

silvanshade force-pushed the feature/rvv branch from af4a32f to 99db257 Compare January 18, 2024 13:05

Add temporary config files and settings for RISC-V development

d60b753

silvanshade force-pushed the feature/rvv branch from 99db257 to 5445a52 Compare January 18, 2024 18:03

Implement RVV backend

9c46892

silvanshade force-pushed the feature/rvv branch from 5445a52 to 9c46892 Compare January 18, 2024 18:33

silvanshade closed this Feb 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement RVV backend #372

Implement RVV backend #372

silvanshade commented Jan 18, 2024

BurningEnlightenment Jan 18, 2024

oconnor663 commented Jan 18, 2024

silvanshade commented Jan 18, 2024 •

edited

Loading

oconnor663 commented Jan 18, 2024

silvanshade commented Jan 18, 2024

Implement RVV backend #372

Implement RVV backend #372

Conversation

silvanshade commented Jan 18, 2024

BurningEnlightenment Jan 18, 2024

Choose a reason for hiding this comment

oconnor663 commented Jan 18, 2024

silvanshade commented Jan 18, 2024 • edited Loading

oconnor663 commented Jan 18, 2024

silvanshade commented Jan 18, 2024

silvanshade commented Jan 18, 2024 •

edited

Loading