port the memchr implementations to "generic SIMD" code #87

BurntSushi · 2021-07-29T12:06:59Z

When I wrote the new memmem implementation earlier this year, one thing I did was write the implementation as something that was generic over the vector type:

memchr/src/memmem/genericsimd.rs

Lines 95 to 107 in 186ac04

    
           /// # Safety 
        
           /// 
        
           /// Since this is meant to be used with vector functions, callers need to 
        
           /// specialize this inside of a function with a `target_feature` attribute. 
        
           /// Therefore, callers must ensure that whatever target feature is being used 
        
           /// supports the vector functions that this function is specialized for. (For 
        
           /// the specific vector functions used, see the Vector trait implementations.) 
        
           #[inline(always)] 
        
           pub(crate) unsafe fn fwd_find<V: Vector>( 
        
               fwd: &Forward, 
        
               haystack: &[u8], 
        
               needle: &[u8], 
        
           ) -> Option<usize> {

where an example of it being called, e.g. for AVX2, is:

memchr/src/memmem/x86/avx.rs

Line 27 in 186ac04

genericsimd::Forward::new(ninfo, needle).map(Forward)

So basically, the idea here is, you write the nasty SIMD code once, and then write some trivial shims for each target feature you want to support.

The actual use of SIMD in this crate is reasonably simple, so it turns out that the trait defining the API of a vector is quite small:

memchr/src/memmem/vector.rs

Lines 21 to 32 in 186ac04

    
           pub(crate) trait Vector: Copy + core::fmt::Debug { 
        
               /// _mm_set1_epi8 or _mm256_set1_epi8 
        
               unsafe fn splat(byte: u8) -> Self; 
        
               /// _mm_loadu_si128 or _mm256_loadu_si256 
        
               unsafe fn load_unaligned(data: *const u8) -> Self; 
        
               /// _mm_movemask_epi8 or _mm256_movemask_epi8 
        
               unsafe fn movemask(self) -> u32; 
        
               /// _mm_cmpeq_epi8 or _mm256_cmpeq_epi8 
        
               unsafe fn cmpeq(self, vector2: Self) -> Self; 
        
               /// _mm_and_si128 or _mm256_and_si256 
        
               unsafe fn and(self, vector2: Self) -> Self; 
        
           }

OK, so what's this issue about? I think ideally, we would push the Vector trait up a level in the module hierarchy, port the existing x86 SIMD memchr implementation to a "generic" version, and then replace the existing implementations with shims that call out to the generic version.

This will hopefully let us easily add a WASM implementation of memchr, but adding other implementations in the future would be good too once more intrinsics (e.g., for ARM) are added to std.

(One wonders whether we should just wait for portable SIMD to land in std, but I don't know when that will happen.)

The text was updated successfully, but these errors were encountered:

alexcrichton · 2021-07-29T22:04:20Z

Ok I got a bit overeager and started implementing this. I basically just copied sse2.rs to a genericsimd.rs and then started thwacking things until all the SSE-specific bits were gone and everything used a V: Vector. Some things I've seen:

I haven't benchmarked this yet. Codegen looks correct, though, where there's no function calls
This does not implement the avx2.rs optimization where matched, for the memchr* functions, is an outlined cold/inline(never) function. (that wasn't in the initial sse2 translation and then I figured it'd show up eventually in benchmarks)
The #[inline(always)] annotations on functions in genericsimd.rs are surprisingly required for correctness. Initially I omitted them or just had #[inline], but that actually led to test failures which went away when I did println!, making me suspect memory corruption and/or a compiler bug. I have not narrowed this down but I suspect there's a compiler bug lurking somewhere. The structure of the code is different from before where all the generic code doesn't have #[target_feature], but it's sandwiched between two functions that do (the top-level caller and the leaf intrinsic functions). The theory is that everything is inlined into the top-level caller #[target_feature] function, but this shouldn't be necessary in debug mode.

I'm not super happy how the dispatch currently works in src/memchr/mod.rs. I suspect that a scheme like memmem is doing would probably be better. This also doesn't implement the optimization where avx2 falls back to sse2 given short enough searches.

Porting this to wasm should be quite simple on top of #84, and in theory porting to AArch64 would also be super simple since now it's basically just an impl of Vector plus some boilerplate.

alexcrichton · 2021-07-30T03:31:30Z

Ok benchmarks. I did one run where I forcibly disabled avx, and this is the result:

$ critcmp {old,new}-sse -t 5 --filter krate
group                           new-sse                                old-sse
-----                           -------                                -------
memchr1/krate/empty/never       1.00      0.2±0.01ns        0 B/sec    1.98      0.5±0.01ns        0 B/sec
memchr1/krate/huge/never        1.05     16.5±0.28µs    33.6 GB/sec    1.00     15.7±0.57µs    35.3 GB/sec
memchr1/krate/tiny/never        1.00      5.1±0.12ns    12.5 GB/sec    1.11      5.7±0.11ns    11.3 GB/sec
memchr1/krate/tiny/rare         1.00      7.7±0.19ns     8.4 GB/sec    1.05      8.1±0.20ns     8.0 GB/sec
memchr1/krate/tiny/uncommon     1.00     16.8±0.35ns     3.8 GB/sec    1.14     19.1±0.33ns     3.4 GB/sec
memchr3/krate/huge/common       1.00   565.1±14.57µs  1004.0 MB/sec    1.05   594.9±18.69µs   953.8 MB/sec
memchr3/krate/small/uncommon    1.00    152.5±2.30ns     4.1 GB/sec    1.13   172.8±20.97ns     3.6 GB/sec
memchr3/krate/tiny/never        1.21      8.5±0.61ns     7.5 GB/sec    1.00      7.1±0.16ns     9.1 GB/sec

and then when I enabled avx:

$ critcmp {old,new}-avx -t 5
group                           new-avx                                old-avx
-----                           -------                                -------
memchr1/krate/empty/never       1.00      0.2±0.01ns        0 B/sec    2.01      0.5±0.02ns        0 B/sec
memchr1/krate/huge/common       1.05    227.6±5.76µs     2.4 GB/sec    1.00    216.3±6.52µs     2.6 GB/sec
memchr1/krate/huge/never        1.05     10.1±0.91µs    54.7 GB/sec    1.00      9.6±0.58µs    57.7 GB/sec
memchr1/krate/huge/uncommon     1.00     80.1±2.59µs     6.9 GB/sec    1.08     86.8±4.75µs     6.4 GB/sec
memchr1/krate/small/never       1.05      9.2±0.27ns    67.0 GB/sec    1.00      8.8±0.78ns    70.4 GB/sec
memchr1/krate/small/rare        1.06     13.0±0.24ns    47.7 GB/sec    1.00     12.2±0.19ns    50.7 GB/sec
memchr1/krate/small/uncommon    1.07     50.8±1.16ns    12.2 GB/sec    1.00     47.7±1.14ns    13.0 GB/sec
memchr1/krate/tiny/never        1.18      4.5±0.19ns    14.2 GB/sec    1.00      3.8±0.10ns    16.9 GB/sec
memchr1/krate/tiny/rare         1.26      7.4±0.19ns     8.7 GB/sec    1.00      5.9±0.18ns    11.0 GB/sec
memchr1/krate/tiny/uncommon     1.76     31.6±0.71ns     2.0 GB/sec    1.00     18.0±0.42ns     3.6 GB/sec
memchr2/krate/small/never       1.07     17.4±0.27ns    35.6 GB/sec    1.00     16.2±0.19ns    38.1 GB/sec
memchr2/krate/small/rare        1.00     24.6±0.65ns    25.1 GB/sec    1.10     27.2±0.88ns    22.8 GB/sec
memchr2/krate/tiny/never        1.00      5.0±0.07ns    12.9 GB/sec    1.13      5.6±0.28ns    11.4 GB/sec
memchr2/krate/tiny/rare         1.19     13.0±0.25ns     4.9 GB/sec    1.00     10.9±0.23ns     5.9 GB/sec
memchr2/krate/tiny/uncommon     1.11     61.4±1.18ns  1071.4 MB/sec    1.00     55.1±1.02ns  1193.9 MB/sec
memchr3/krate/huge/uncommon     1.06    228.8±5.87µs     2.4 GB/sec    1.00    216.4±5.42µs     2.6 GB/sec
memchr3/krate/small/never       1.08     18.6±0.37ns    33.3 GB/sec    1.00     17.2±0.31ns    36.1 GB/sec
memchr3/krate/small/rare        1.06     33.2±0.61ns    18.6 GB/sec    1.00     31.4±0.70ns    19.7 GB/sec
memchr3/krate/tiny/never        1.22      6.3±0.22ns    10.1 GB/sec    1.00      5.2±0.08ns    12.3 GB/sec
memchr3/krate/tiny/rare         1.18     19.6±0.47ns     3.3 GB/sec    1.00     16.7±0.36ns     3.9 GB/sec

I think the sse bits have transitioned well but looks like some tuning is needed for avx.

BurntSushi · 2023-08-28T22:31:14Z

Closed by #129

BurntSushi mentioned this issue Jul 29, 2021

Add wasm simd support #84

Merged

BurntSushi closed this as completed Aug 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

port the memchr implementations to "generic SIMD" code #87

port the memchr implementations to "generic SIMD" code #87

BurntSushi commented Jul 29, 2021

alexcrichton commented Jul 29, 2021

alexcrichton commented Jul 30, 2021

BurntSushi commented Aug 28, 2023

port the memchr implementations to "generic SIMD" code #87

port the memchr implementations to "generic SIMD" code #87

Comments

BurntSushi commented Jul 29, 2021

alexcrichton commented Jul 29, 2021

alexcrichton commented Jul 30, 2021

BurntSushi commented Aug 28, 2023