Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add aarch64 SIMD implementations of memchr and memmem (and other goodies) #129

Merged
merged 1 commit into from
Aug 28, 2023

Conversation

BurntSushi
Copy link
Owner

@BurntSushi BurntSushi commented Aug 26, 2023

This PR doesn't just add aarch64-specific code, but it refactors pretty much everything about how the code is organized. There are big perf wins for aarch64 (see benchmark results below), and also latency improvements across the board. A brief summary of the changes in this PR:

  • I've added aarch64 NEON vector implementations for memchr, memrchr, memchr2, memrchr2, memchr3, memrchr3 and memmem. This should lead to massive speed improvements on an increasing popular target, due in large part to Apple silicon.
  • I've added wasm32 simd128 vector implementations for memchr, memrchr, memchr2, memrchr2, memchr3 and memrchr3. (@alexcrichton previously contributed a vector implementation for memmem and that remains.)
  • x86_64 has no real additions other than the memchr_iter(needle, haystack).count() specialization. It already has SSE2 and AVX2 implementations of memchr (and friends) and memmem. It uses AVX2 automatically via runtime inspection of what the current CPU supports. There is no need to compile with the avx2 feature enabled.
  • I've replaced the benchmark suite using Criterion with a benchmark suite using rebar. While I designed rebar to be used for regex engines, it can be used for any substring or multi-substring search task.
  • I've added a new arch sub-module that exposes a lot of the internal routines (including target specific routines) used to implement memchr and memmem. This module is part of a major refactoring of how this crate is organized and it seemed prudent to expose the internals as their APIs are pretty straight-forward. That is, there isn't a huge API design space IMO. This module includes scalar substring search implementations of Shift-Or, Rabin-Karp and Two-Way.
  • As a result of the refactoring mentioned above, most of the conditional compilation stuff has been pushed down and mostly abstracted away. Moreover, since each implementation now has its own proper API surface that is uniform across other implementations, each thing can be easily independently tested. Because of this, I was able to remove a reliance on the variety of custom cfg knobs that the previous version of memchr setup in its build script. This in turn allowed me to remove the build script entirely. Given the ubiquity of this crate, this may lead to compile time improvements downstream. (Likely small in each individual case but perhaps large in aggregate.) I can't promise that a build script will never re-appear, but I'll try to resist adding one in the future if possible.
  • Despite the above, compile times for this crate have sadly seemed to increase slightly. Namely, a fresh time rebar build -e '^rust/memchrold/memmem/prebuilt$' reports 0.944 seconds on my system while a fresh time rebar build -e '^rust/memchr/memmem/prebuilt$' reports 1.164 seconds. This is on x86_64 where no real additional code was added. This could be because of the "nicer" abstractions now present in the arch sub-module or perhaps how the internals are structured. (Previously there were multiple monomorphic implementations of memchr for example and now there is a single generic implementation that is monomorphized automatically by the compiler via generics. Perhaps that is more expensive?)
  • I've specialized memchr_iter(needle, haystack).count() to use a different vector implementation that specifically only counts matches instead of reporting the offsets of each match. This can make huge (potentially over an order of magnitude) differences when counting the number of matches of a frequently (even semi-frequently) occurring byte in a large haystack. This is effectively what the bytecount crate does (which is what ripgrep currently uses to compute line numbers for matches), but the marginal cost of adding it to the memchr crate was very low. So I did. And I plan to move ripgrep to using memchr_iter(needle, haystack).count(). (Also, the benchmarks below suggest that the counting implementation I wrote is faster than the one in bytecount in some cases which look like they'll be relevant for ripgrep. This was surprising to me.)
  • I've added an alloc feature which permits compiling this crate without the standard library but with the alloc crate. This crate is designed through-and-through to work in a core-only context, so this doesn't unlock much compared to just disabling the std feature. It adds a couple of APIs requiring allocation (like memmem::Finder::into_owned) and other things like arch::all::shiftor which really want an allocation to store its bit-parallel state machine.
  • The libc feature is DEPRECATED and is now a no-op. I don't think there is any real benefit to it any more.
  • A new disabled-by-default logging feature has been added. When enabled, this crate will emit a smattering of log messages. Usually these messages are used to indicate what kind of strategy is selected. For example, whether a vector or scalar algorithm is used for substring search.

Selected benchmark differences for x86_64

Differences across the board from the status quo. Showing only measurements with a 1.2x (or greater) difference.

$ rebar diff tmp/old.csv tmp/new.csv -t 1.2 -e memmem -E oneshot
benchmark                                         engine                       tmp/old.csv          tmp/new.csv
---------                                         ------                       -----------          -----------
memmem/code/rust-library-never-fn-strength        rust/memchr/memmem/prebuilt  42.8 GB/s (1.25x)    53.6 GB/s (1.00x)
memmem/code/rust-library-never-fn-strength-paren  rust/memchr/memmem/prebuilt  40.8 GB/s (1.32x)    53.8 GB/s (1.00x)
memmem/code/rust-library-never-fn-quux            rust/memchr/memmem/prebuilt  40.5 GB/s (1.37x)    55.6 GB/s (1.00x)
memmem/code/rust-library-rare-fn-from-str         rust/memchr/memmem/prebuilt  39.3 GB/s (1.37x)    53.8 GB/s (1.00x)
memmem/code/rust-library-common-fn-is-empty       rust/memchr/memmem/prebuilt  40.5 GB/s (1.30x)    52.6 GB/s (1.00x)
memmem/code/rust-library-common-fn                rust/memchr/memmem/prebuilt  21.6 GB/s (1.27x)    27.5 GB/s (1.00x)
memmem/pathological/rare-repeated-huge-tricky     rust/memchr/memmem/prebuilt  40.9 GB/s (1.55x)    63.4 GB/s (1.00x)
memmem/pathological/rare-repeated-small-match     rust/memchr/memmem/prebuilt  1468.7 MB/s (1.23x)  1811.4 MB/s (1.00x)
memmem/sliceslice/short                           rust/memchr/memmem/prebuilt  14.74ms (2.08x)      7.08ms (1.00x)
memmem/sliceslice/seemingly-random                rust/memchr/memmem/prebuilt  9.1 MB/s (1.23x)     11.2 MB/s (1.00x)
memmem/sliceslice/i386                            rust/memchr/memmem/prebuilt  41.4 MB/s (1.35x)    55.8 MB/s (1.00x)
memmem/subtitles/common/huge-en-you               rust/memchr/memmem/prebuilt  10.7 GB/s (1.26x)    13.5 GB/s (1.00x)
memmem/subtitles/common/huge-zh-that              rust/memchr/memmem/prebuilt  25.2 GB/s (1.49x)    37.5 GB/s (1.00x)
memmem/subtitles/never/huge-en-john-watson        rust/memchr/memmem/prebuilt  42.9 GB/s (1.48x)    63.6 GB/s (1.00x)
memmem/subtitles/never/huge-en-all-common-bytes   rust/memchr/memmem/prebuilt  41.9 GB/s (1.26x)    52.7 GB/s (1.00x)
memmem/subtitles/never/teeny-en-all-common-bytes  rust/memchr/memmem/prebuilt  1161.0 MB/s (1.53x)  1780.2 MB/s (1.00x)
memmem/subtitles/never/teeny-en-some-rare-bytes   rust/memchr/memmem/prebuilt  1161.0 MB/s (1.53x)  1780.2 MB/s (1.00x)
memmem/subtitles/never/teeny-en-two-space         rust/memchr/memmem/prebuilt  1161.0 MB/s (1.53x)  1780.2 MB/s (1.00x)
memmem/subtitles/never/huge-ru-john-watson        rust/memchr/memmem/prebuilt  40.6 GB/s (1.56x)    63.5 GB/s (1.00x)
memmem/subtitles/never/teeny-ru-john-watson       rust/memchr/memmem/prebuilt  1741.5 MB/s (1.44x)  2.4 GB/s (1.00x)
memmem/subtitles/never/huge-zh-john-watson        rust/memchr/memmem/prebuilt  41.1 GB/s (1.46x)    59.9 GB/s (1.00x)
memmem/subtitles/never/teeny-zh-john-watson       rust/memchr/memmem/prebuilt  1285.4 MB/s (1.53x)  1970.9 MB/s (1.00x)
memmem/subtitles/rare/huge-en-sherlock-holmes     rust/memchr/memmem/prebuilt  41.9 GB/s (1.52x)    63.5 GB/s (1.00x)
memmem/subtitles/rare/huge-en-sherlock            rust/memchr/memmem/prebuilt  41.9 GB/s (1.46x)    61.3 GB/s (1.00x)
memmem/subtitles/rare/huge-en-medium-needle       rust/memchr/memmem/prebuilt  38.3 GB/s (1.46x)    55.9 GB/s (1.00x)
memmem/subtitles/rare/huge-en-long-needle         rust/memchr/memmem/prebuilt  2.5 GB/s (17.34x)    44.0 GB/s (1.00x)
memmem/subtitles/rare/huge-en-huge-needle         rust/memchr/memmem/prebuilt  2.3 GB/s (20.24x)    45.7 GB/s (1.00x)
memmem/subtitles/rare/teeny-en-sherlock-holmes    rust/memchr/memmem/prebuilt  1068.1 MB/s (1.47x)  1570.8 MB/s (1.00x)
memmem/subtitles/rare/teeny-en-sherlock           rust/memchr/memmem/prebuilt  953.7 MB/s (1.27x)   1213.8 MB/s (1.00x)
memmem/subtitles/rare/teeny-ru-sherlock-holmes    rust/memchr/memmem/prebuilt  1430.5 MB/s (1.47x)  2.1 GB/s (1.00x)
memmem/subtitles/rare/teeny-ru-sherlock           rust/memchr/memmem/prebuilt  1213.8 MB/s (1.32x)  1602.2 MB/s (1.00x)
memmem/subtitles/rare/huge-zh-sherlock-holmes     rust/memchr/memmem/prebuilt  41.8 GB/s (1.33x)    55.5 GB/s (1.00x)
memmem/subtitles/rare/huge-zh-sherlock            rust/memchr/memmem/prebuilt  43.0 GB/s (1.38x)    59.4 GB/s (1.00x)
memmem/subtitles/rare/teeny-zh-sherlock           rust/memchr/memmem/prebuilt  895.9 MB/s (1.27x)   1137.1 MB/s (1.00x)

A comparison with the sliceslice crate for just substring search. We only include measurements with a 1.2x difference or greater.

$ rebar cmp benchmarks/record/x86_64/2023-08-26.csv -e sliceslice/memmem/prebuilt -e rust/memchr/memmem/prebuilt -t 1.2
benchmark                                                   rust/memchr/memmem/prebuilt  rust/sliceslice/memmem/prebuilt
---------                                                   ---------------------------  -------------------------------
memmem/byterank/binary                                      4.4 GB/s (1.32x)             5.8 GB/s (1.00x)
memmem/code/rust-library-never-fn-strength                  53.6 GB/s (1.00x)            39.8 GB/s (1.35x)
memmem/code/rust-library-never-fn-strength-paren            53.8 GB/s (1.00x)            39.7 GB/s (1.35x)
memmem/code/rust-library-never-fn-quux                      55.6 GB/s (1.00x)            38.7 GB/s (1.44x)
memmem/code/rust-library-rare-fn-from-str                   53.8 GB/s (2.65x)            142.7 GB/s (1.00x)
memmem/pathological/md5-huge-no-hash                        50.1 GB/s (1.00x)            25.7 GB/s (1.95x)
memmem/pathological/md5-huge-last-hash                      47.6 GB/s (1.00x)            27.7 GB/s (1.72x)
memmem/pathological/rare-repeated-huge-tricky               63.4 GB/s (1.00x)            41.9 GB/s (1.51x)
memmem/pathological/rare-repeated-small-tricky              25.2 GB/s (1.32x)            33.3 GB/s (1.00x)
memmem/pathological/defeat-simple-vector-alphabet           4.1 GB/s (1.65x)             6.7 GB/s (1.00x)
memmem/pathological/defeat-simple-vector-freq-alphabet      19.2 GB/s (1.00x)            2.6 GB/s (7.33x)
memmem/pathological/defeat-simple-vector-repeated-alphabet  1234.5 MB/s (1.00x)          508.7 MB/s (2.43x)
memmem/sliceslice/short                                     7.08ms (1.00x)               14.10ms (1.99x)
memmem/sliceslice/i386                                      55.8 MB/s (1.00x)            39.6 MB/s (1.41x)
memmem/subtitles/never/huge-en-john-watson                  63.6 GB/s (1.00x)            41.7 GB/s (1.53x)
memmem/subtitles/never/huge-en-all-common-bytes             52.7 GB/s (1.00x)            42.6 GB/s (1.24x)
memmem/subtitles/never/teeny-en-john-watson                 1027.0 MB/s (2.17x)          2.2 GB/s (1.00x)
memmem/subtitles/never/teeny-en-all-common-bytes            1780.2 MB/s (1.25x)          2.2 GB/s (1.00x)
memmem/subtitles/never/teeny-en-some-rare-bytes             1780.2 MB/s (1.25x)          2.2 GB/s (1.00x)
memmem/subtitles/never/teeny-en-two-space                   1780.2 MB/s (1.25x)          2.2 GB/s (1.00x)
memmem/subtitles/never/huge-ru-john-watson                  63.5 GB/s (1.00x)            12.7 GB/s (4.99x)
memmem/subtitles/never/teeny-ru-john-watson                 2.4 GB/s (1.23x)             3.0 GB/s (1.00x)
memmem/subtitles/never/huge-zh-john-watson                  59.9 GB/s (1.00x)            41.1 GB/s (1.46x)
memmem/subtitles/never/teeny-zh-john-watson                 1970.9 MB/s (1.25x)          2.4 GB/s (1.00x)
memmem/subtitles/rare/huge-en-sherlock-holmes               63.5 GB/s (1.00x)            41.6 GB/s (1.53x)
memmem/subtitles/rare/huge-en-sherlock                      61.3 GB/s (1.00x)            43.0 GB/s (1.42x)
memmem/subtitles/rare/huge-en-medium-needle                 55.9 GB/s (1.00x)            25.7 GB/s (2.17x)
memmem/subtitles/rare/huge-en-long-needle                   44.0 GB/s (1.00x)            25.9 GB/s (1.70x)
memmem/subtitles/rare/huge-en-huge-needle                   45.7 GB/s (1.00x)            29.3 GB/s (1.56x)
memmem/subtitles/rare/teeny-en-sherlock                     1213.8 MB/s (1.37x)          1668.9 MB/s (1.00x)
memmem/subtitles/rare/huge-ru-sherlock-holmes               40.7 GB/s (1.00x)            15.2 GB/s (2.67x)
memmem/subtitles/rare/teeny-ru-sherlock                     1602.2 MB/s (1.56x)          2.4 GB/s (1.00x)
memmem/subtitles/rare/huge-zh-sherlock-holmes               55.5 GB/s (1.00x)            26.6 GB/s (2.09x)
memmem/subtitles/rare/huge-zh-sherlock                      59.4 GB/s (1.00x)            42.4 GB/s (1.40x)
memmem/subtitles/rare/teeny-zh-sherlock-holmes              1055.9 MB/s (1.87x)          1970.9 MB/s (1.00x)
memmem/subtitles/rare/teeny-zh-sherlock                     1137.1 MB/s (1.86x)          2.1 GB/s (1.00x)

Differences with the substring search implementation and memmem as provided by GNU libc. Showing only measurements with 2x difference or greater.

$ rebar cmp benchmarks/record/x86_64/2023-08-26.csv -e libc/memmem/oneshot -e rust/memchr/memmem/oneshot -t 2
benchmark                                         libc/memmem/oneshot  rust/memchr/memmem/oneshot
---------                                         -------------------  --------------------------
memmem/code/rust-library-never-fn-strength        11.4 GB/s (4.75x)    54.1 GB/s (1.00x)
memmem/code/rust-library-never-fn-strength-paren  12.4 GB/s (4.36x)    54.0 GB/s (1.00x)
memmem/code/rust-library-never-fn-quux            8.1 GB/s (6.91x)     55.8 GB/s (1.00x)
memmem/code/rust-library-rare-fn-from-str         15.0 GB/s (3.59x)    53.8 GB/s (1.00x)
memmem/code/rust-library-common-fn-is-empty       12.5 GB/s (4.16x)    51.9 GB/s (1.00x)
memmem/code/rust-library-common-fn                2.2 GB/s (5.89x)     13.0 GB/s (1.00x)
memmem/code/rust-library-common-let               3.2 GB/s (2.65x)     8.5 GB/s (1.00x)
memmem/pathological/rare-repeated-huge-tricky     17.8 GB/s (3.56x)    63.3 GB/s (1.00x)
memmem/pathological/rare-repeated-huge-match      718.0 MB/s (1.00x)   289.1 MB/s (2.48x)
memmem/pathological/rare-repeated-small-match     707.1 MB/s (1.00x)   303.1 MB/s (2.33x)
memmem/subtitles/common/huge-en-that              3.7 GB/s (4.22x)     15.7 GB/s (1.00x)
memmem/subtitles/common/huge-en-one-space         1543.9 MB/s (1.00x)  541.6 MB/s (2.85x)
memmem/subtitles/common/huge-ru-that              2.7 GB/s (4.22x)     11.6 GB/s (1.00x)
memmem/subtitles/common/huge-ru-not               2.0 GB/s (2.47x)     5.0 GB/s (1.00x)
memmem/subtitles/common/huge-ru-one-space         2.9 GB/s (1.00x)     1081.0 MB/s (2.71x)
memmem/subtitles/common/huge-zh-that              4.2 GB/s (3.20x)     13.4 GB/s (1.00x)
memmem/subtitles/common/huge-zh-do-not            2.6 GB/s (2.40x)     6.3 GB/s (1.00x)
memmem/subtitles/common/huge-zh-one-space         5.7 GB/s (1.00x)     2.4 GB/s (2.38x)
memmem/subtitles/never/huge-en-john-watson        15.4 GB/s (4.12x)    63.3 GB/s (1.00x)
memmem/subtitles/never/huge-en-all-common-bytes   11.9 GB/s (4.41x)    52.2 GB/s (1.00x)
memmem/subtitles/never/huge-en-some-rare-bytes    11.0 GB/s (5.77x)    63.6 GB/s (1.00x)
memmem/subtitles/never/huge-en-two-space          2.3 GB/s (27.77x)    63.5 GB/s (1.00x)
memmem/subtitles/never/huge-ru-john-watson        5.2 GB/s (11.56x)    59.9 GB/s (1.00x)
memmem/subtitles/never/huge-zh-john-watson        20.7 GB/s (2.86x)    59.2 GB/s (1.00x)
memmem/subtitles/rare/huge-en-sherlock-holmes     17.0 GB/s (3.71x)    63.1 GB/s (1.00x)
memmem/subtitles/rare/huge-en-sherlock            11.8 GB/s (5.18x)    60.9 GB/s (1.00x)
memmem/subtitles/rare/huge-en-huge-needle         19.3 GB/s (2.02x)    38.9 GB/s (1.00x)
memmem/subtitles/rare/huge-ru-sherlock-holmes     6.5 GB/s (9.47x)     61.5 GB/s (1.00x)
memmem/subtitles/rare/huge-ru-sherlock            3.8 GB/s (16.23x)    61.6 GB/s (1.00x)
memmem/subtitles/rare/huge-zh-sherlock            10.8 GB/s (5.48x)    59.1 GB/s (1.00x)

Differences with the bytecount crate as memchr_iter(needle, haystack).count() is now specialized to its own vector implementation just for counting the number of matches (instead of reporting the offset of each match). The thoughput improvements as compared to bytecount on large haystacks are most interesting IMO. (I was somewhat surprised by this, as bytecount seems to do something clever while memchr_iter(needle, haystack).count() is basically just memchr but with the branching for reporting matches removed.) Either way, I expect this to translate directly to improvements in ripgrep, although I haven't measured that yet.

$ rebar cmp benchmarks/record/x86_64/2023-08-26.csv -e '^rust/bytecount/memchr/oneshot$' -e '^rust/memchr/memchr/onlycount$'
benchmark                          rust/bytecount/memchr/oneshot  rust/memchr/memchr/onlycount
---------                          -----------------------------  ----------------------------
memchr/sherlock/common/huge1       28.5 GB/s (1.94x)              55.3 GB/s (1.00x)
memchr/sherlock/common/small1      17.7 GB/s (1.25x)              22.1 GB/s (1.00x)
memchr/sherlock/common/tiny1       4.3 GB/s (1.00x)               3.8 GB/s (1.13x)
memchr/sherlock/never/huge1        28.4 GB/s (2.09x)              59.3 GB/s (1.00x)
memchr/sherlock/never/small1       17.7 GB/s (1.25x)              22.1 GB/s (1.00x)
memchr/sherlock/never/tiny1        4.3 GB/s (1.00x)               3.8 GB/s (1.13x)
memchr/sherlock/never/empty1       11.00ns (1.00x)                11.00ns (1.00x)
memchr/sherlock/rare/huge1         28.5 GB/s (1.94x)              55.2 GB/s (1.00x)
memchr/sherlock/rare/small1        17.7 GB/s (1.25x)              22.1 GB/s (1.00x)
memchr/sherlock/rare/tiny1         4.3 GB/s (1.00x)               3.8 GB/s (1.13x)
memchr/sherlock/uncommon/huge1     26.9 GB/s (2.20x)              59.3 GB/s (1.00x)
memchr/sherlock/uncommon/small1    17.7 GB/s (1.25x)              22.1 GB/s (1.00x)
memchr/sherlock/uncommon/tiny1     4.3 GB/s (1.00x)               3.8 GB/s (1.13x)
memchr/sherlock/verycommon/huge1   28.4 GB/s (2.09x)              59.3 GB/s (1.00x)
memchr/sherlock/verycommon/small1  17.7 GB/s (1.25x)              22.1 GB/s (1.00x)

Selected benchmark differences for aarch64

Differences across the board from the status quo. Note that here, I've only included measurements with a 4x difference from the old memchr crate. Otherwise, pretty much every benchmark has a pretty sizeable improvement from the old version. (Because previously, aarch64 had no vector implementations at all.)

$ rebar diff tmp/old-aarch64.csv tmp/new-aarch64.csv -t 4 -E oneshot
benchmark                                         engine                       tmp/old-aarch64.csv   tmp/new-aarch64.csv
---------                                         ------                       -------------------   -------------------
memchr/sherlock/never/huge2                       rust/memchr/memchr2          10.8 GB/s (4.27x)     46.3 GB/s (1.00x)
memchr/sherlock/never/small1                      rust/memchr/memchr/prebuilt  15.1 GB/s (41.00x)    618.4 GB/s (1.00x)
memchr/sherlock/never/small1                      rust/memchr/memrchr          14.7 GB/s (42.00x)    618.4 GB/s (1.00x)
memchr/sherlock/never/small2                      rust/memchr/memchr2          7.5 GB/s (83.00x)     618.4 GB/s (1.00x)
memchr/sherlock/never/small2                      rust/memchr/memrchr2         7.5 GB/s (83.00x)     618.4 GB/s (1.00x)
memchr/sherlock/never/small3                      rust/memchr/memchr3          7.5 GB/s (83.00x)     618.4 GB/s (1.00x)
memchr/sherlock/never/small3                      rust/memchr/memrchr3         7.5 GB/s (83.00x)     618.4 GB/s (1.00x)
memchr/sherlock/rare/small1                       rust/memchr/memchr/prebuilt  14.7 GB/s (42.00x)    618.4 GB/s (1.00x)
memchr/sherlock/rare/small1                       rust/memchr/memrchr          14.7 GB/s (42.00x)    618.4 GB/s (1.00x)
memchr/sherlock/rare/small2                       rust/memchr/memchr2          7.5 GB/s (83.00x)     618.4 GB/s (1.00x)
memchr/sherlock/rare/small2                       rust/memchr/memrchr2         7.5 GB/s (83.00x)     618.4 GB/s (1.00x)
memchr/sherlock/uncommon/tiny1                    rust/memchr/memchr/prebuilt  1605.0 MB/s (41.00x)  64.3 GB/s (1.00x)
memchr/sherlock/uncommon/tiny1                    rust/memchr/memrchr          1605.0 MB/s (41.00x)  64.3 GB/s (1.00x)
memmem/code/rust-library-never-fn-strength        rust/memchr/memmem/prebuilt  7.1 GB/s (4.17x)      29.6 GB/s (1.00x)
memmem/code/rust-library-never-fn-strength-paren  rust/memchr/memmem/prebuilt  6.9 GB/s (4.19x)      29.0 GB/s (1.00x)
memmem/code/rust-library-rare-fn-from-str         rust/memchr/memmem/prebuilt  6.5 GB/s (4.42x)      28.7 GB/s (1.00x)
memmem/code/rust-library-common-fn                rust/memchr/memmem/prebuilt  3.2 GB/s (5.58x)      18.0 GB/s (1.00x)
memmem/code/rust-library-common-let               rust/memchr/memmem/prebuilt  2012.9 MB/s (6.45x)   12.7 GB/s (1.00x)
memmem/pathological/md5-huge-no-hash              rust/memchr/memmem/prebuilt  1070.2 MB/s (24.69x)  25.8 GB/s (1.00x)
memmem/pathological/md5-huge-last-hash            rust/memchr/memmem/prebuilt  1148.2 MB/s (22.85x)  25.6 GB/s (1.00x)
memmem/pathological/rare-repeated-huge-tricky     rust/memchr/memmem/prebuilt  1299.3 MB/s (23.87x)  30.3 GB/s (1.00x)
memmem/pathological/rare-repeated-small-tricky    rust/memchr/memmem/prebuilt  1146.0 MB/s (19.83x)  22.2 GB/s (1.00x)
memmem/sliceslice/seemingly-random                rust/memchr/memmem/prebuilt  1485.7 KB/s (4.13x)   6.0 MB/s (1.00x)
memmem/sliceslice/i386                            rust/memchr/memmem/prebuilt  6.0 MB/s (5.07x)      30.3 MB/s (1.00x)
memmem/subtitles/common/huge-en-that              rust/memchr/memmem/prebuilt  1418.2 MB/s (11.50x)  15.9 GB/s (1.00x)
memmem/subtitles/common/huge-ru-that              rust/memchr/memmem/prebuilt  1389.1 MB/s (13.44x)  18.2 GB/s (1.00x)
memmem/subtitles/common/huge-ru-not               rust/memchr/memmem/prebuilt  1482.7 MB/s (7.06x)   10.2 GB/s (1.00x)
memmem/subtitles/never/huge-en-all-common-bytes   rust/memchr/memmem/prebuilt  1813.7 MB/s (12.81x)  22.7 GB/s (1.00x)
memmem/subtitles/never/huge-en-two-space          rust/memchr/memmem/prebuilt  1370.2 MB/s (25.23x)  33.8 GB/s (1.00x)
memmem/subtitles/never/teeny-en-two-space         rust/memchr/memmem/prebuilt  651.3 MB/s (41.00x)   26.1 GB/s (1.00x)
memmem/subtitles/rare/huge-en-sherlock            rust/memchr/memmem/prebuilt  7.0 GB/s (4.40x)      30.6 GB/s (1.00x)
memmem/subtitles/rare/huge-en-medium-needle       rust/memchr/memmem/prebuilt  6.4 GB/s (4.43x)      28.3 GB/s (1.00x)
memmem/subtitles/rare/huge-en-long-needle         rust/memchr/memmem/prebuilt  7.1 GB/s (4.64x)      32.8 GB/s (1.00x)
memmem/subtitles/rare/teeny-en-sherlock-holmes    rust/memchr/memmem/prebuilt  651.3 MB/s (41.00x)   26.1 GB/s (1.00x)
memmem/subtitles/rare/teeny-en-sherlock           rust/memchr/memmem/prebuilt  651.3 MB/s (41.00x)   26.1 GB/s (1.00x)
memmem/subtitles/rare/teeny-ru-sherlock-holmes    rust/memchr/memmem/prebuilt  953.7 MB/s (42.00x)   39.1 GB/s (1.00x)
memmem/subtitles/rare/teeny-ru-sherlock           rust/memchr/memmem/prebuilt  976.9 MB/s (41.00x)   39.1 GB/s (1.00x)
memmem/subtitles/rare/huge-zh-sherlock-holmes     rust/memchr/memmem/prebuilt  4.1 GB/s (7.06x)      28.8 GB/s (1.00x)
memmem/subtitles/rare/huge-zh-sherlock            rust/memchr/memmem/prebuilt  6.1 GB/s (4.81x)      29.6 GB/s (1.00x)
memmem/subtitles/rare/teeny-zh-sherlock-holmes    rust/memchr/memmem/prebuilt  721.1 MB/s (41.00x)   28.9 GB/s (1.00x)
memmem/subtitles/rare/teeny-zh-sherlock           rust/memchr/memmem/prebuilt  721.1 MB/s (41.00x)   28.9 GB/s (1.00x)

A comparison with the sliceslice crate, which has its own custom aarch64 vector implementation of substring search. We only show measurements with 1.2x or greater difference.

$ rebar cmp benchmarks/record/aarch64/2023-08-26.csv -e sliceslice/memmem/prebuilt -e rust/memchr/memmem/prebuilt -t 1.2
benchmark                                                   rust/memchr/memmem/prebuilt  rust/sliceslice/memmem/prebuilt
---------                                                   ---------------------------  -------------------------------
memmem/byterank/binary                                      3.1 GB/s (1.00x)             1586.4 MB/s (2.01x)
memmem/code/rust-library-never-fn-strength                  29.6 GB/s (1.00x)            16.1 GB/s (1.84x)
memmem/code/rust-library-never-fn-strength-paren            29.0 GB/s (1.00x)            15.6 GB/s (1.86x)
memmem/code/rust-library-never-fn-quux                      30.2 GB/s (1.00x)            15.1 GB/s (2.00x)
memmem/code/rust-library-rare-fn-from-str                   28.7 GB/s (1.93x)            55.5 GB/s (1.00x)
memmem/pathological/md5-huge-no-hash                        25.8 GB/s (1.00x)            13.6 GB/s (1.89x)
memmem/pathological/md5-huge-last-hash                      25.6 GB/s (1.00x)            13.5 GB/s (1.90x)
memmem/pathological/rare-repeated-huge-tricky               30.3 GB/s (1.00x)            16.6 GB/s (1.83x)
memmem/pathological/rare-repeated-small-tricky              22.2 GB/s (1.00x)            11.2 GB/s (1.98x)
memmem/pathological/defeat-simple-vector-alphabet           3.0 GB/s (1.00x)             1114.1 MB/s (2.77x)
memmem/pathological/defeat-simple-vector-freq-alphabet      14.8 GB/s (1.00x)            2.2 GB/s (6.72x)
memmem/pathological/defeat-simple-vector-repeated-alphabet  835.1 MB/s (1.00x)           173.8 MB/s (4.80x)
memmem/sliceslice/short                                     7.33ms (1.00x)               36.55ms (4.99x)
memmem/sliceslice/seemingly-random                          6.0 MB/s (1.00x)             3.6 MB/s (1.67x)
memmem/sliceslice/i386                                      30.3 MB/s (1.00x)            15.1 MB/s (2.00x)
memmem/subtitles/never/huge-en-john-watson                  30.9 GB/s (1.00x)            16.6 GB/s (1.86x)
memmem/subtitles/never/huge-en-all-common-bytes             22.7 GB/s (1.00x)            13.8 GB/s (1.64x)
memmem/subtitles/never/huge-en-some-rare-bytes              30.9 GB/s (1.00x)            16.6 GB/s (1.86x)
memmem/subtitles/never/huge-en-two-space                    33.8 GB/s (1.00x)            16.6 GB/s (2.03x)
memmem/subtitles/never/huge-ru-john-watson                  30.3 GB/s (1.00x)            7.1 GB/s (4.25x)
memmem/subtitles/never/huge-zh-john-watson                  29.2 GB/s (1.00x)            16.0 GB/s (1.83x)
memmem/subtitles/rare/huge-en-sherlock-holmes               30.3 GB/s (1.00x)            16.3 GB/s (1.86x)
memmem/subtitles/rare/huge-en-sherlock                      30.6 GB/s (1.00x)            16.6 GB/s (1.85x)
memmem/subtitles/rare/huge-en-medium-needle                 28.3 GB/s (1.00x)            12.4 GB/s (2.28x)
memmem/subtitles/rare/huge-en-long-needle                   32.8 GB/s (1.00x)            15.7 GB/s (2.08x)
memmem/subtitles/rare/huge-en-huge-needle                   32.9 GB/s (1.00x)            16.1 GB/s (2.05x)
memmem/subtitles/rare/huge-ru-sherlock-holmes               30.3 GB/s (1.00x)            8.0 GB/s (3.80x)
memmem/subtitles/rare/huge-ru-sherlock                      30.2 GB/s (1.00x)            10.1 GB/s (3.00x)
memmem/subtitles/rare/huge-zh-sherlock-holmes               28.8 GB/s (1.00x)            14.7 GB/s (1.95x)
memmem/subtitles/rare/huge-zh-sherlock                      29.6 GB/s (1.00x)            14.0 GB/s (2.12x)

Differences with the substring search implementation and memmem as provided by macOS's libc. Showing only measurements with 2x difference or greater. This is what utter destruction looks like. (I'm not sure what's going on in benchmarks like memmem/subtitles/rare/teeny-en-sherlock-holmes. It's a tiny haystack and macOS seems to either measure 1ns or 41ns. I wonder if there's something odd about time precision on macOS? You can see the reverse happen in memmem/subtitles/rare/teeny-zh-sherlock.)

$ rebar cmp benchmarks/record/aarch64/2023-08-26.csv -e libc/memmem/oneshot -e rust/memchr/memmem/oneshot -t 2
benchmark                                                   libc/memmem/oneshot   rust/memchr/memmem/oneshot
---------                                                   -------------------   --------------------------
memmem/byterank/binary                                      626.1 MB/s (5.11x)    3.1 GB/s (1.00x)
memmem/code/rust-library-never-fn-strength                  1320.8 MB/s (22.98x)  29.6 GB/s (1.00x)
memmem/code/rust-library-never-fn-strength-paren            1320.8 MB/s (22.49x)  29.0 GB/s (1.00x)
memmem/code/rust-library-never-fn-quux                      1332.0 MB/s (23.25x)  30.2 GB/s (1.00x)
memmem/code/rust-library-rare-fn-from-str                   1442.0 MB/s (20.37x)  28.7 GB/s (1.00x)
memmem/code/rust-library-common-fn-is-empty                 1320.8 MB/s (22.02x)  28.4 GB/s (1.00x)
memmem/code/rust-library-common-fn                          1320.8 MB/s (11.44x)  14.8 GB/s (1.00x)
memmem/code/rust-library-common-let                         1114.7 MB/s (8.59x)   9.4 GB/s (1.00x)
memmem/pathological/md5-huge-no-hash                        994.0 MB/s (26.39x)   25.6 GB/s (1.00x)
memmem/pathological/md5-huge-last-hash                      994.3 MB/s (26.39x)   25.6 GB/s (1.00x)
memmem/pathological/rare-repeated-huge-tricky               1670.8 MB/s (18.56x)  30.3 GB/s (1.00x)
memmem/pathological/rare-repeated-huge-match                1353.0 MB/s (1.00x)   378.5 MB/s (3.57x)
memmem/pathological/rare-repeated-small-tricky              1637.4 MB/s (13.88x)  22.2 GB/s (1.00x)
memmem/pathological/rare-repeated-small-match               1348.3 MB/s (1.00x)   394.5 MB/s (3.42x)
memmem/pathological/defeat-simple-vector-alphabet           568.1 MB/s (5.43x)    3.0 GB/s (1.00x)
memmem/pathological/defeat-simple-vector-freq-alphabet      1027.2 MB/s (14.55x)  14.6 GB/s (1.00x)
memmem/pathological/defeat-simple-vector-repeated-alphabet  173.8 MB/s (4.80x)    834.2 MB/s (1.00x)
memmem/subtitles/common/huge-en-that                        841.6 MB/s (13.19x)   10.8 GB/s (1.00x)
memmem/subtitles/common/huge-en-you                         1161.7 MB/s (4.00x)   4.5 GB/s (1.00x)
memmem/subtitles/common/huge-ru-that                        590.9 MB/s (19.48x)   11.2 GB/s (1.00x)
memmem/subtitles/common/huge-ru-not                         334.3 MB/s (18.62x)   6.1 GB/s (1.00x)
memmem/subtitles/common/huge-zh-that                        1340.1 MB/s (11.49x)  15.0 GB/s (1.00x)
memmem/subtitles/common/huge-zh-do-not                      858.5 MB/s (9.15x)    7.7 GB/s (1.00x)
memmem/subtitles/never/huge-en-john-watson                  1648.3 MB/s (19.14x)  30.8 GB/s (1.00x)
memmem/subtitles/never/huge-en-all-common-bytes             1075.4 MB/s (21.65x)  22.7 GB/s (1.00x)
memmem/subtitles/never/huge-en-some-rare-bytes              1655.7 MB/s (19.10x)  30.9 GB/s (1.00x)
memmem/subtitles/never/huge-en-two-space                    541.6 MB/s (63.83x)   33.8 GB/s (1.00x)
memmem/subtitles/never/teeny-en-two-space                   651.3 MB/s (41.00x)   26.1 GB/s (1.00x)
memmem/subtitles/never/huge-ru-john-watson                  427.0 MB/s (72.56x)   30.3 GB/s (1.00x)
memmem/subtitles/never/huge-zh-john-watson                  1155.4 MB/s (25.81x)  29.1 GB/s (1.00x)
memmem/subtitles/rare/huge-en-sherlock-holmes               1577.4 MB/s (19.60x)  30.2 GB/s (1.00x)
memmem/subtitles/rare/huge-en-sherlock                      1577.4 MB/s (19.78x)  30.5 GB/s (1.00x)
memmem/subtitles/rare/huge-en-medium-needle                 1155.6 MB/s (24.95x)  28.2 GB/s (1.00x)
memmem/subtitles/rare/huge-en-long-needle                   1488.8 MB/s (20.77x)  30.2 GB/s (1.00x)
memmem/subtitles/rare/huge-en-huge-needle                   1609.5 MB/s (17.27x)  27.1 GB/s (1.00x)
memmem/subtitles/rare/teeny-en-sherlock-holmes              26.1 GB/s (1.00x)     651.3 MB/s (41.00x)
memmem/subtitles/rare/huge-ru-sherlock-holmes               427.0 MB/s (72.41x)   30.2 GB/s (1.00x)
memmem/subtitles/rare/huge-ru-sherlock                      348.2 MB/s (91.21x)   31.0 GB/s (1.00x)
memmem/subtitles/rare/huge-zh-sherlock-holmes               955.8 MB/s (31.66x)   29.6 GB/s (1.00x)
memmem/subtitles/rare/huge-zh-sherlock                      853.4 MB/s (35.46x)   29.6 GB/s (1.00x)
memmem/subtitles/rare/teeny-zh-sherlock-holmes              28.9 GB/s (1.00x)     721.1 MB/s (41.00x)
memmem/subtitles/rare/teeny-zh-sherlock                     721.1 MB/s (41.00x)   28.9 GB/s (1.00x)

Differences with the bytecount crate as memchr_iter(needle, haystack).count() is now specialized to its own vector implementation just for counting the number of matches (instead of reporting the offset of each match).

$ rebar cmp benchmarks/record/aarch64/2023-08-26.csv -e '^rust/bytecount/memchr/oneshot$' -e '^rust/memchr/memchr/onlycount$'
benchmark                          rust/bytecount/memchr/oneshot  rust/memchr/memchr/onlycount
---------                          -----------------------------  ----------------------------
memchr/sherlock/common/huge1       29.5 GB/s (1.40x)              41.4 GB/s (1.00x)
memchr/sherlock/common/small1      618.4 GB/s (1.00x)             618.4 GB/s (1.00x)
memchr/sherlock/common/tiny1       64.3 GB/s (1.00x)              64.3 GB/s (1.00x)
memchr/sherlock/never/huge1        29.5 GB/s (1.40x)              41.4 GB/s (1.00x)
memchr/sherlock/never/small1       618.4 GB/s (1.00x)             618.4 GB/s (1.00x)
memchr/sherlock/never/tiny1        64.3 GB/s (1.00x)              64.3 GB/s (1.00x)
memchr/sherlock/never/empty1       1.00ns (1.00x)                 1.00ns (1.00x)
memchr/sherlock/rare/huge1         29.5 GB/s (1.40x)              41.4 GB/s (1.00x)
memchr/sherlock/rare/small1        618.4 GB/s (1.00x)             618.4 GB/s (1.00x)
memchr/sherlock/rare/tiny1         64.3 GB/s (1.00x)              64.3 GB/s (1.00x)
memchr/sherlock/uncommon/huge1     29.5 GB/s (1.40x)              41.4 GB/s (1.00x)
memchr/sherlock/uncommon/small1    618.4 GB/s (1.00x)             618.4 GB/s (1.00x)
memchr/sherlock/uncommon/tiny1     64.3 GB/s (1.00x)              64.3 GB/s (1.00x)
memchr/sherlock/verycommon/huge1   28.7 GB/s (1.44x)              41.4 GB/s (1.00x)
memchr/sherlock/verycommon/small1  618.4 GB/s (1.00x)             618.4 GB/s (1.00x)

Selected Benchmark differences for regex on aarch64

This shows benchmark results before and after this change for the regex crate. We only show results with a difference of 1.2x or greater.

$ rebar diff tmp/before.csv tmp/after.csv -t 1.2
benchmark                                     engine      tmp/before.csv       tmp/after.csv
---------                                     ------      --------------       -------------
curated/01-literal/sherlock-en                rust/regex  9.4 GB/s (2.61x)     24.5 GB/s (1.00x)
curated/01-literal/sherlock-ru                rust/regex  15.9 GB/s (1.57x)    25.0 GB/s (1.00x)
curated/01-literal/sherlock-zh                rust/regex  3.3 GB/s (8.69x)     28.9 GB/s (1.00x)
curated/04-ruff-noqa/real                     rust/regex  994.2 MB/s (1.37x)   1365.4 MB/s (1.00x)
curated/04-ruff-noqa/tweaked                  rust/regex  941.9 MB/s (1.39x)   1313.4 MB/s (1.00x)
curated/06-cloud-flare-redos/simplified-long  rust/regex  20.3 GB/s (2.75x)    55.8 GB/s (1.00x)
curated/09-aws-keys/quick                     rust/regex  1111.4 MB/s (1.36x)  1511.9 MB/s (1.00x)

Improvements to ripgrep on aarch64

In short, simple ripgrep searches (likely the most common kind) get about twice as fast on Apple silicon now.

$ hyperfine --warmup 5 "rg-memchr-2.5.0 -c 'Sherlock Holmes' OpenSubtitles2018.half.en" "rg-memchr-2.6.0 -c 'Sherlock Holmes' OpenSubtitles2018.half.en"

Benchmark 1: rg-memchr-2.5.0 -c 'Sherlock Holmes' OpenSubtitles2018.half.en
  Time (mean ± σ):      1.251 s ±  0.001 s    [User: 0.840 s, System: 0.410 s]
  Range (min … max):    1.248 s …  1.253 s    10 runs

Benchmark 2: rg-memchr-2.6.0 -c 'Sherlock Holmes' OpenSubtitles2018.half.en
  Time (mean ± σ):     655.7 ms ±   1.4 ms    [User: 259.3 ms, System: 396.0 ms]
  Range (min … max):   654.7 ms … 659.5 ms    10 runs

Summary
  rg-memchr-2.6.0 -c 'Sherlock Holmes' OpenSubtitles2018.half.en ran
    1.91 ± 0.00 times faster than rg-memchr-2.5.0 -c 'Sherlock Holmes' OpenSubtitles2018.half.en

@BurntSushi BurntSushi changed the title add aarch64 SIMD implementations of memchr and memmem add aarch64 SIMD implementations of memchr and memmem (and other goodies) Aug 27, 2023
@BurntSushi BurntSushi force-pushed the ag/refactor branch 5 times, most recently from cfee283 to 49c143f Compare August 28, 2023 11:23
This PR doesn't just add `aarch64`-specific code, but it refactors pretty
much everything about how the code is organized. There are big perf wins for
`aarch64` (see benchmark results below), and also latency improvements across
the board. A brief summary of the changes in this PR:

* I've added `aarch64` NEON vector implementations for `memchr`,
`memrchr`, `memchr2`, `memrchr2`, `memchr3`, `memrchr3` and `memmem`.
This should lead to massive speed improvements on an increasing popular
target, due in large part to Apple silicon.
* I've added `wasm32` simd128 vector implementations for `memchr`,
`memrchr`, `memchr2`, `memrchr2`, `memchr3` and `memrchr3`.
(alexcrichton previously contributed a vector implementation for
`memmem` and that remains.)
* `x86_64` has no real additions other than the `memchr_iter(needle,
haystack).count()` specialization. It already has SSE2 and AVX2
implementations of `memchr` (and friends) and `memmem`. It uses AVX2
automatically via runtime inspection of what the current CPU supports.
There is no need to compile with the `avx2` feature enabled.
* I've replaced the benchmark suite using Criterion with a
benchmark suite using [rebar](https://github.com/BurntSushi/rebar).
While I designed rebar to be used for regex engines, it
can be used for [any substring or multi-substring search
task](https://github.com/BurntSushi/rebar/blob/45afe89f437173d2dd970fee
7d7f1db5d0e05588/BYOB.md).
* I've added a new `arch` sub-module that exposes a lot of the internal
routines (including target specific routines) used to implement
`memchr` and `memmem`. This module is part of a major refactoring
of how this crate is organized and it seemed prudent to expose the
internals as their APIs are pretty straight-forward. That is, there
isn't a huge API design space IMO. This module includes scalar
substring search implementations of Shift-Or, Rabin-Karp and Two-Way.
* As a result of the refactoring mentioned above, most of the
conditional compilation stuff has been pushed down and mostly
abstracted away. Moreover, since each implementation now has its own
proper API surface that is uniform across other implementations, each
thing can be easily independently tested. Because of this, I was able
to remove a reliance on the variety of custom `cfg` knobs that the
previous version of `memchr` setup in its build script. This in turn
**allowed me to remove the build script entirely.** Given the ubiquity
of this crate, this may lead to compile time improvements downstream.
(Likely small in each individual case but perhaps large in aggregate.)
I can't promise that a build script will never re-appear, but I'll try
to resist adding one in the future if possible.
* Despite the above, compile times for this crate have sadly
seemed to increase slightly. Namely, a fresh `time rebar build -e
'^rust/memchrold/memmem/prebuilt$'` reports 0.944 seconds on my system
while a fresh `time rebar build -e '^rust/memchr/memmem/prebuilt$'`
reports 1.164 seconds. This is on `x86_64` where no real additional
code was added. This could be because of the "nicer" abstractions
now present in the `arch` sub-module or perhaps how the internals
are structured. (Previously there were multiple monomorphic
implementations of `memchr` for example and now there is a single
generic implementation that is monomorphized automatically by the
compiler via generics. Perhaps that is more expensive?)
* I've specialized `memchr_iter(needle, haystack).count()` to use
a different vector implementation that specifically only counts
matches instead of reporting the offsets of each match. This can make
*huge* (potentially over an order of magnitude) differences when
counting the number of matches of a frequently (even semi-frequently)
occurring byte in a large haystack. This is effectively what the
[`bytecount`](https://crates.io/crates/bytecount) crate does (which
is what ripgrep currently uses to compute line numbers for matches),
but the marginal cost of adding it to the `memchr` crate was very low.
So I did. And I plan to move ripgrep to using `memchr_iter(needle,
haystack).count()`. (Also, the benchmarks below suggest that the
counting implementation I wrote is faster than the one in `bytecount`
in some cases which look like they'll be relevant for ripgrep. This was
surprising to me.)
* I've added an `alloc` feature which permits compiling this
crate without the standard library but with the `alloc` crate.
This crate is designed through-and-through to work in a core-only
context, so this doesn't unlock much compared to just disabling
the `std` feature. It adds a couple of APIs requiring allocation
(like `memmem::Finder::into_owned`) and other things like
`arch::all::shiftor` which really want an allocation to store its
bit-parallel state machine.
* The `libc` feature is **DEPRECATED** and is now a no-op. I don't
think there is any real benefit to it any more.
* A new disabled-by-default `logging` feature has been added. When
enabled, this crate will emit a smattering of log messages. Usually
these messages are used to indicate what kind of strategy is selected.
For example, whether a vector or scalar algorithm is used for substring
search.

Differences across the board from the status quo. Showing only
measurements with a 1.2x (or greater) difference.

```
$ rebar diff tmp/old.csv tmp/new.csv -t 1.2 -e memmem -E oneshot
benchmark                                         engine                       tmp/old.csv          tmp/new.csv
---------                                         ------                       -----------          -----------
memmem/code/rust-library-never-fn-strength        rust/memchr/memmem/prebuilt  42.8 GB/s (1.25x)    53.6 GB/s (1.00x)
memmem/code/rust-library-never-fn-strength-paren  rust/memchr/memmem/prebuilt  40.8 GB/s (1.32x)    53.8 GB/s (1.00x)
memmem/code/rust-library-never-fn-quux            rust/memchr/memmem/prebuilt  40.5 GB/s (1.37x)    55.6 GB/s (1.00x)
memmem/code/rust-library-rare-fn-from-str         rust/memchr/memmem/prebuilt  39.3 GB/s (1.37x)    53.8 GB/s (1.00x)
memmem/code/rust-library-common-fn-is-empty       rust/memchr/memmem/prebuilt  40.5 GB/s (1.30x)    52.6 GB/s (1.00x)
memmem/code/rust-library-common-fn                rust/memchr/memmem/prebuilt  21.6 GB/s (1.27x)    27.5 GB/s (1.00x)
memmem/pathological/rare-repeated-huge-tricky     rust/memchr/memmem/prebuilt  40.9 GB/s (1.55x)    63.4 GB/s (1.00x)
memmem/pathological/rare-repeated-small-match     rust/memchr/memmem/prebuilt  1468.7 MB/s (1.23x)  1811.4 MB/s (1.00x)
memmem/sliceslice/short                           rust/memchr/memmem/prebuilt  14.74ms (2.08x)      7.08ms (1.00x)
memmem/sliceslice/seemingly-random                rust/memchr/memmem/prebuilt  9.1 MB/s (1.23x)     11.2 MB/s (1.00x)
memmem/sliceslice/i386                            rust/memchr/memmem/prebuilt  41.4 MB/s (1.35x)    55.8 MB/s (1.00x)
memmem/subtitles/common/huge-en-you               rust/memchr/memmem/prebuilt  10.7 GB/s (1.26x)    13.5 GB/s (1.00x)
memmem/subtitles/common/huge-zh-that              rust/memchr/memmem/prebuilt  25.2 GB/s (1.49x)    37.5 GB/s (1.00x)
memmem/subtitles/never/huge-en-john-watson        rust/memchr/memmem/prebuilt  42.9 GB/s (1.48x)    63.6 GB/s (1.00x)
memmem/subtitles/never/huge-en-all-common-bytes   rust/memchr/memmem/prebuilt  41.9 GB/s (1.26x)    52.7 GB/s (1.00x)
memmem/subtitles/never/teeny-en-all-common-bytes  rust/memchr/memmem/prebuilt  1161.0 MB/s (1.53x)  1780.2 MB/s (1.00x)
memmem/subtitles/never/teeny-en-some-rare-bytes   rust/memchr/memmem/prebuilt  1161.0 MB/s (1.53x)  1780.2 MB/s (1.00x)
memmem/subtitles/never/teeny-en-two-space         rust/memchr/memmem/prebuilt  1161.0 MB/s (1.53x)  1780.2 MB/s (1.00x)
memmem/subtitles/never/huge-ru-john-watson        rust/memchr/memmem/prebuilt  40.6 GB/s (1.56x)    63.5 GB/s (1.00x)
memmem/subtitles/never/teeny-ru-john-watson       rust/memchr/memmem/prebuilt  1741.5 MB/s (1.44x)  2.4 GB/s (1.00x)
memmem/subtitles/never/huge-zh-john-watson        rust/memchr/memmem/prebuilt  41.1 GB/s (1.46x)    59.9 GB/s (1.00x)
memmem/subtitles/never/teeny-zh-john-watson       rust/memchr/memmem/prebuilt  1285.4 MB/s (1.53x)  1970.9 MB/s (1.00x)
memmem/subtitles/rare/huge-en-sherlock-holmes     rust/memchr/memmem/prebuilt  41.9 GB/s (1.52x)    63.5 GB/s (1.00x)
memmem/subtitles/rare/huge-en-sherlock            rust/memchr/memmem/prebuilt  41.9 GB/s (1.46x)    61.3 GB/s (1.00x)
memmem/subtitles/rare/huge-en-medium-needle       rust/memchr/memmem/prebuilt  38.3 GB/s (1.46x)    55.9 GB/s (1.00x)
memmem/subtitles/rare/huge-en-long-needle         rust/memchr/memmem/prebuilt  2.5 GB/s (17.34x)    44.0 GB/s (1.00x)
memmem/subtitles/rare/huge-en-huge-needle         rust/memchr/memmem/prebuilt  2.3 GB/s (20.24x)    45.7 GB/s (1.00x)
memmem/subtitles/rare/teeny-en-sherlock-holmes    rust/memchr/memmem/prebuilt  1068.1 MB/s (1.47x)  1570.8 MB/s (1.00x)
memmem/subtitles/rare/teeny-en-sherlock           rust/memchr/memmem/prebuilt  953.7 MB/s (1.27x)   1213.8 MB/s (1.00x)
memmem/subtitles/rare/teeny-ru-sherlock-holmes    rust/memchr/memmem/prebuilt  1430.5 MB/s (1.47x)  2.1 GB/s (1.00x)
memmem/subtitles/rare/teeny-ru-sherlock           rust/memchr/memmem/prebuilt  1213.8 MB/s (1.32x)  1602.2 MB/s (1.00x)
memmem/subtitles/rare/huge-zh-sherlock-holmes     rust/memchr/memmem/prebuilt  41.8 GB/s (1.33x)    55.5 GB/s (1.00x)
memmem/subtitles/rare/huge-zh-sherlock            rust/memchr/memmem/prebuilt  43.0 GB/s (1.38x)    59.4 GB/s (1.00x)
memmem/subtitles/rare/teeny-zh-sherlock           rust/memchr/memmem/prebuilt  895.9 MB/s (1.27x)   1137.1 MB/s (1.00x)
```

A comparison with the
[`sliceslice`](https://crates.io/crates/sliceslice) crate for just
substring search. We only include measurements with a 1.2x difference
or greater.

```
$ rebar cmp benchmarks/record/x86_64/2023-08-26.csv -e sliceslice/memmem/prebuilt -e rust/memchr/memmem/prebuilt -t 1.2
benchmark                                                   rust/memchr/memmem/prebuilt  rust/sliceslice/memmem/prebuilt
---------                                                   ---------------------------  -------------------------------
memmem/byterank/binary                                      4.4 GB/s (1.32x)             5.8 GB/s (1.00x)
memmem/code/rust-library-never-fn-strength                  53.6 GB/s (1.00x)            39.8 GB/s (1.35x)
memmem/code/rust-library-never-fn-strength-paren            53.8 GB/s (1.00x)            39.7 GB/s (1.35x)
memmem/code/rust-library-never-fn-quux                      55.6 GB/s (1.00x)            38.7 GB/s (1.44x)
memmem/code/rust-library-rare-fn-from-str                   53.8 GB/s (2.65x)            142.7 GB/s (1.00x)
memmem/pathological/md5-huge-no-hash                        50.1 GB/s (1.00x)            25.7 GB/s (1.95x)
memmem/pathological/md5-huge-last-hash                      47.6 GB/s (1.00x)            27.7 GB/s (1.72x)
memmem/pathological/rare-repeated-huge-tricky               63.4 GB/s (1.00x)            41.9 GB/s (1.51x)
memmem/pathological/rare-repeated-small-tricky              25.2 GB/s (1.32x)            33.3 GB/s (1.00x)
memmem/pathological/defeat-simple-vector-alphabet           4.1 GB/s (1.65x)             6.7 GB/s (1.00x)
memmem/pathological/defeat-simple-vector-freq-alphabet      19.2 GB/s (1.00x)            2.6 GB/s (7.33x)
memmem/pathological/defeat-simple-vector-repeated-alphabet  1234.5 MB/s (1.00x)          508.7 MB/s (2.43x)
memmem/sliceslice/short                                     7.08ms (1.00x)               14.10ms (1.99x)
memmem/sliceslice/i386                                      55.8 MB/s (1.00x)            39.6 MB/s (1.41x)
memmem/subtitles/never/huge-en-john-watson                  63.6 GB/s (1.00x)            41.7 GB/s (1.53x)
memmem/subtitles/never/huge-en-all-common-bytes             52.7 GB/s (1.00x)            42.6 GB/s (1.24x)
memmem/subtitles/never/teeny-en-john-watson                 1027.0 MB/s (2.17x)          2.2 GB/s (1.00x)
memmem/subtitles/never/teeny-en-all-common-bytes            1780.2 MB/s (1.25x)          2.2 GB/s (1.00x)
memmem/subtitles/never/teeny-en-some-rare-bytes             1780.2 MB/s (1.25x)          2.2 GB/s (1.00x)
memmem/subtitles/never/teeny-en-two-space                   1780.2 MB/s (1.25x)          2.2 GB/s (1.00x)
memmem/subtitles/never/huge-ru-john-watson                  63.5 GB/s (1.00x)            12.7 GB/s (4.99x)
memmem/subtitles/never/teeny-ru-john-watson                 2.4 GB/s (1.23x)             3.0 GB/s (1.00x)
memmem/subtitles/never/huge-zh-john-watson                  59.9 GB/s (1.00x)            41.1 GB/s (1.46x)
memmem/subtitles/never/teeny-zh-john-watson                 1970.9 MB/s (1.25x)          2.4 GB/s (1.00x)
memmem/subtitles/rare/huge-en-sherlock-holmes               63.5 GB/s (1.00x)            41.6 GB/s (1.53x)
memmem/subtitles/rare/huge-en-sherlock                      61.3 GB/s (1.00x)            43.0 GB/s (1.42x)
memmem/subtitles/rare/huge-en-medium-needle                 55.9 GB/s (1.00x)            25.7 GB/s (2.17x)
memmem/subtitles/rare/huge-en-long-needle                   44.0 GB/s (1.00x)            25.9 GB/s (1.70x)
memmem/subtitles/rare/huge-en-huge-needle                   45.7 GB/s (1.00x)            29.3 GB/s (1.56x)
memmem/subtitles/rare/teeny-en-sherlock                     1213.8 MB/s (1.37x)          1668.9 MB/s (1.00x)
memmem/subtitles/rare/huge-ru-sherlock-holmes               40.7 GB/s (1.00x)            15.2 GB/s (2.67x)
memmem/subtitles/rare/teeny-ru-sherlock                     1602.2 MB/s (1.56x)          2.4 GB/s (1.00x)
memmem/subtitles/rare/huge-zh-sherlock-holmes               55.5 GB/s (1.00x)            26.6 GB/s (2.09x)
memmem/subtitles/rare/huge-zh-sherlock                      59.4 GB/s (1.00x)            42.4 GB/s (1.40x)
memmem/subtitles/rare/teeny-zh-sherlock-holmes              1055.9 MB/s (1.87x)          1970.9 MB/s (1.00x)
memmem/subtitles/rare/teeny-zh-sherlock                     1137.1 MB/s (1.86x)          2.1 GB/s (1.00x)
```

Differences with the substring search implementation and `memmem` as
provided by GNU libc. Showing only measurements with 2x difference or
greater.

```
$ rebar cmp benchmarks/record/x86_64/2023-08-26.csv -e libc/memmem/oneshot -e rust/memchr/memmem/oneshot -t 2
benchmark                                         libc/memmem/oneshot  rust/memchr/memmem/oneshot
---------                                         -------------------  --------------------------
memmem/code/rust-library-never-fn-strength        11.4 GB/s (4.75x)    54.1 GB/s (1.00x)
memmem/code/rust-library-never-fn-strength-paren  12.4 GB/s (4.36x)    54.0 GB/s (1.00x)
memmem/code/rust-library-never-fn-quux            8.1 GB/s (6.91x)     55.8 GB/s (1.00x)
memmem/code/rust-library-rare-fn-from-str         15.0 GB/s (3.59x)    53.8 GB/s (1.00x)
memmem/code/rust-library-common-fn-is-empty       12.5 GB/s (4.16x)    51.9 GB/s (1.00x)
memmem/code/rust-library-common-fn                2.2 GB/s (5.89x)     13.0 GB/s (1.00x)
memmem/code/rust-library-common-let               3.2 GB/s (2.65x)     8.5 GB/s (1.00x)
memmem/pathological/rare-repeated-huge-tricky     17.8 GB/s (3.56x)    63.3 GB/s (1.00x)
memmem/pathological/rare-repeated-huge-match      718.0 MB/s (1.00x)   289.1 MB/s (2.48x)
memmem/pathological/rare-repeated-small-match     707.1 MB/s (1.00x)   303.1 MB/s (2.33x)
memmem/subtitles/common/huge-en-that              3.7 GB/s (4.22x)     15.7 GB/s (1.00x)
memmem/subtitles/common/huge-en-one-space         1543.9 MB/s (1.00x)  541.6 MB/s (2.85x)
memmem/subtitles/common/huge-ru-that              2.7 GB/s (4.22x)     11.6 GB/s (1.00x)
memmem/subtitles/common/huge-ru-not               2.0 GB/s (2.47x)     5.0 GB/s (1.00x)
memmem/subtitles/common/huge-ru-one-space         2.9 GB/s (1.00x)     1081.0 MB/s (2.71x)
memmem/subtitles/common/huge-zh-that              4.2 GB/s (3.20x)     13.4 GB/s (1.00x)
memmem/subtitles/common/huge-zh-do-not            2.6 GB/s (2.40x)     6.3 GB/s (1.00x)
memmem/subtitles/common/huge-zh-one-space         5.7 GB/s (1.00x)     2.4 GB/s (2.38x)
memmem/subtitles/never/huge-en-john-watson        15.4 GB/s (4.12x)    63.3 GB/s (1.00x)
memmem/subtitles/never/huge-en-all-common-bytes   11.9 GB/s (4.41x)    52.2 GB/s (1.00x)
memmem/subtitles/never/huge-en-some-rare-bytes    11.0 GB/s (5.77x)    63.6 GB/s (1.00x)
memmem/subtitles/never/huge-en-two-space          2.3 GB/s (27.77x)    63.5 GB/s (1.00x)
memmem/subtitles/never/huge-ru-john-watson        5.2 GB/s (11.56x)    59.9 GB/s (1.00x)
memmem/subtitles/never/huge-zh-john-watson        20.7 GB/s (2.86x)    59.2 GB/s (1.00x)
memmem/subtitles/rare/huge-en-sherlock-holmes     17.0 GB/s (3.71x)    63.1 GB/s (1.00x)
memmem/subtitles/rare/huge-en-sherlock            11.8 GB/s (5.18x)    60.9 GB/s (1.00x)
memmem/subtitles/rare/huge-en-huge-needle         19.3 GB/s (2.02x)    38.9 GB/s (1.00x)
memmem/subtitles/rare/huge-ru-sherlock-holmes     6.5 GB/s (9.47x)     61.5 GB/s (1.00x)
memmem/subtitles/rare/huge-ru-sherlock            3.8 GB/s (16.23x)    61.6 GB/s (1.00x)
memmem/subtitles/rare/huge-zh-sherlock            10.8 GB/s (5.48x)    59.1 GB/s (1.00x)
```

Differences with the [`bytecount`](https://crates.io/crates/bytecount)
crate as `memchr_iter(needle, haystack).count()` is now specialized
to its own vector implementation just for counting the number
of matches (instead of reporting the offset of each match). The
thoughput improvements as compared to `bytecount` on large haystacks
are most interesting IMO. (I was somewhat surprised by this, as
`bytecount` seems to do something clever while `memchr_iter(needle,
haystack).count()` is basically just `memchr` but with the branching
for reporting matches removed.) Either way, I expect this to translate
directly to improvements in ripgrep, although I haven't measured that
yet.

```
$ rebar cmp benchmarks/record/x86_64/2023-08-26.csv -e '^rust/bytecount/memchr/oneshot$' -e '^rust/memchr/memchr/onlycount$'
benchmark                          rust/bytecount/memchr/oneshot  rust/memchr/memchr/onlycount
---------                          -----------------------------  ----------------------------
memchr/sherlock/common/huge1       28.5 GB/s (1.94x)              55.3 GB/s (1.00x)
memchr/sherlock/common/small1      17.7 GB/s (1.25x)              22.1 GB/s (1.00x)
memchr/sherlock/common/tiny1       4.3 GB/s (1.00x)               3.8 GB/s (1.13x)
memchr/sherlock/never/huge1        28.4 GB/s (2.09x)              59.3 GB/s (1.00x)
memchr/sherlock/never/small1       17.7 GB/s (1.25x)              22.1 GB/s (1.00x)
memchr/sherlock/never/tiny1        4.3 GB/s (1.00x)               3.8 GB/s (1.13x)
memchr/sherlock/never/empty1       11.00ns (1.00x)                11.00ns (1.00x)
memchr/sherlock/rare/huge1         28.5 GB/s (1.94x)              55.2 GB/s (1.00x)
memchr/sherlock/rare/small1        17.7 GB/s (1.25x)              22.1 GB/s (1.00x)
memchr/sherlock/rare/tiny1         4.3 GB/s (1.00x)               3.8 GB/s (1.13x)
memchr/sherlock/uncommon/huge1     26.9 GB/s (2.20x)              59.3 GB/s (1.00x)
memchr/sherlock/uncommon/small1    17.7 GB/s (1.25x)              22.1 GB/s (1.00x)
memchr/sherlock/uncommon/tiny1     4.3 GB/s (1.00x)               3.8 GB/s (1.13x)
memchr/sherlock/verycommon/huge1   28.4 GB/s (2.09x)              59.3 GB/s (1.00x)
memchr/sherlock/verycommon/small1  17.7 GB/s (1.25x)              22.1 GB/s (1.00x)
```

Differences across the board from the status quo. Note that here, I've
only included measurements with a 4x difference from the old memchr
crate. Otherwise, pretty much every benchmark has a pretty sizeable
improvement from the old version. (Because previously, `aarch64` had no
vector implementations at all.)

```
$ rebar diff tmp/old-aarch64.csv tmp/new-aarch64.csv -t 4 -E oneshot
benchmark                                         engine                       tmp/old-aarch64.csv   tmp/new-aarch64.csv
---------                                         ------                       -------------------   -------------------
memchr/sherlock/never/huge2                       rust/memchr/memchr2          10.8 GB/s (4.27x)     46.3 GB/s (1.00x)
memchr/sherlock/never/small1                      rust/memchr/memchr/prebuilt  15.1 GB/s (41.00x)    618.4 GB/s (1.00x)
memchr/sherlock/never/small1                      rust/memchr/memrchr          14.7 GB/s (42.00x)    618.4 GB/s (1.00x)
memchr/sherlock/never/small2                      rust/memchr/memchr2          7.5 GB/s (83.00x)     618.4 GB/s (1.00x)
memchr/sherlock/never/small2                      rust/memchr/memrchr2         7.5 GB/s (83.00x)     618.4 GB/s (1.00x)
memchr/sherlock/never/small3                      rust/memchr/memchr3          7.5 GB/s (83.00x)     618.4 GB/s (1.00x)
memchr/sherlock/never/small3                      rust/memchr/memrchr3         7.5 GB/s (83.00x)     618.4 GB/s (1.00x)
memchr/sherlock/rare/small1                       rust/memchr/memchr/prebuilt  14.7 GB/s (42.00x)    618.4 GB/s (1.00x)
memchr/sherlock/rare/small1                       rust/memchr/memrchr          14.7 GB/s (42.00x)    618.4 GB/s (1.00x)
memchr/sherlock/rare/small2                       rust/memchr/memchr2          7.5 GB/s (83.00x)     618.4 GB/s (1.00x)
memchr/sherlock/rare/small2                       rust/memchr/memrchr2         7.5 GB/s (83.00x)     618.4 GB/s (1.00x)
memchr/sherlock/uncommon/tiny1                    rust/memchr/memchr/prebuilt  1605.0 MB/s (41.00x)  64.3 GB/s (1.00x)
memchr/sherlock/uncommon/tiny1                    rust/memchr/memrchr          1605.0 MB/s (41.00x)  64.3 GB/s (1.00x)
memmem/code/rust-library-never-fn-strength        rust/memchr/memmem/prebuilt  7.1 GB/s (4.17x)      29.6 GB/s (1.00x)
memmem/code/rust-library-never-fn-strength-paren  rust/memchr/memmem/prebuilt  6.9 GB/s (4.19x)      29.0 GB/s (1.00x)
memmem/code/rust-library-rare-fn-from-str         rust/memchr/memmem/prebuilt  6.5 GB/s (4.42x)      28.7 GB/s (1.00x)
memmem/code/rust-library-common-fn                rust/memchr/memmem/prebuilt  3.2 GB/s (5.58x)      18.0 GB/s (1.00x)
memmem/code/rust-library-common-let               rust/memchr/memmem/prebuilt  2012.9 MB/s (6.45x)   12.7 GB/s (1.00x)
memmem/pathological/md5-huge-no-hash              rust/memchr/memmem/prebuilt  1070.2 MB/s (24.69x)  25.8 GB/s (1.00x)
memmem/pathological/md5-huge-last-hash            rust/memchr/memmem/prebuilt  1148.2 MB/s (22.85x)  25.6 GB/s (1.00x)
memmem/pathological/rare-repeated-huge-tricky     rust/memchr/memmem/prebuilt  1299.3 MB/s (23.87x)  30.3 GB/s (1.00x)
memmem/pathological/rare-repeated-small-tricky    rust/memchr/memmem/prebuilt  1146.0 MB/s (19.83x)  22.2 GB/s (1.00x)
memmem/sliceslice/seemingly-random                rust/memchr/memmem/prebuilt  1485.7 KB/s (4.13x)   6.0 MB/s (1.00x)
memmem/sliceslice/i386                            rust/memchr/memmem/prebuilt  6.0 MB/s (5.07x)      30.3 MB/s (1.00x)
memmem/subtitles/common/huge-en-that              rust/memchr/memmem/prebuilt  1418.2 MB/s (11.50x)  15.9 GB/s (1.00x)
memmem/subtitles/common/huge-ru-that              rust/memchr/memmem/prebuilt  1389.1 MB/s (13.44x)  18.2 GB/s (1.00x)
memmem/subtitles/common/huge-ru-not               rust/memchr/memmem/prebuilt  1482.7 MB/s (7.06x)   10.2 GB/s (1.00x)
memmem/subtitles/never/huge-en-all-common-bytes   rust/memchr/memmem/prebuilt  1813.7 MB/s (12.81x)  22.7 GB/s (1.00x)
memmem/subtitles/never/huge-en-two-space          rust/memchr/memmem/prebuilt  1370.2 MB/s (25.23x)  33.8 GB/s (1.00x)
memmem/subtitles/never/teeny-en-two-space         rust/memchr/memmem/prebuilt  651.3 MB/s (41.00x)   26.1 GB/s (1.00x)
memmem/subtitles/rare/huge-en-sherlock            rust/memchr/memmem/prebuilt  7.0 GB/s (4.40x)      30.6 GB/s (1.00x)
memmem/subtitles/rare/huge-en-medium-needle       rust/memchr/memmem/prebuilt  6.4 GB/s (4.43x)      28.3 GB/s (1.00x)
memmem/subtitles/rare/huge-en-long-needle         rust/memchr/memmem/prebuilt  7.1 GB/s (4.64x)      32.8 GB/s (1.00x)
memmem/subtitles/rare/teeny-en-sherlock-holmes    rust/memchr/memmem/prebuilt  651.3 MB/s (41.00x)   26.1 GB/s (1.00x)
memmem/subtitles/rare/teeny-en-sherlock           rust/memchr/memmem/prebuilt  651.3 MB/s (41.00x)   26.1 GB/s (1.00x)
memmem/subtitles/rare/teeny-ru-sherlock-holmes    rust/memchr/memmem/prebuilt  953.7 MB/s (42.00x)   39.1 GB/s (1.00x)
memmem/subtitles/rare/teeny-ru-sherlock           rust/memchr/memmem/prebuilt  976.9 MB/s (41.00x)   39.1 GB/s (1.00x)
memmem/subtitles/rare/huge-zh-sherlock-holmes     rust/memchr/memmem/prebuilt  4.1 GB/s (7.06x)      28.8 GB/s (1.00x)
memmem/subtitles/rare/huge-zh-sherlock            rust/memchr/memmem/prebuilt  6.1 GB/s (4.81x)      29.6 GB/s (1.00x)
memmem/subtitles/rare/teeny-zh-sherlock-holmes    rust/memchr/memmem/prebuilt  721.1 MB/s (41.00x)   28.9 GB/s (1.00x)
memmem/subtitles/rare/teeny-zh-sherlock           rust/memchr/memmem/prebuilt  721.1 MB/s (41.00x)   28.9 GB/s (1.00x)
```

A comparison with the
[`sliceslice`](https://crates.io/crates/sliceslice) crate, which has
its own custom `aarch64` vector implementation of substring search. We
only show measurements with 1.2x or greater difference.

```
$ rebar cmp benchmarks/record/aarch64/2023-08-26.csv -e sliceslice/memmem/prebuilt -e rust/memchr/memmem/prebuilt -t 1.2
benchmark                                                   rust/memchr/memmem/prebuilt  rust/sliceslice/memmem/prebuilt
---------                                                   ---------------------------  -------------------------------
memmem/byterank/binary                                      3.1 GB/s (1.00x)             1586.4 MB/s (2.01x)
memmem/code/rust-library-never-fn-strength                  29.6 GB/s (1.00x)            16.1 GB/s (1.84x)
memmem/code/rust-library-never-fn-strength-paren            29.0 GB/s (1.00x)            15.6 GB/s (1.86x)
memmem/code/rust-library-never-fn-quux                      30.2 GB/s (1.00x)            15.1 GB/s (2.00x)
memmem/code/rust-library-rare-fn-from-str                   28.7 GB/s (1.93x)            55.5 GB/s (1.00x)
memmem/pathological/md5-huge-no-hash                        25.8 GB/s (1.00x)            13.6 GB/s (1.89x)
memmem/pathological/md5-huge-last-hash                      25.6 GB/s (1.00x)            13.5 GB/s (1.90x)
memmem/pathological/rare-repeated-huge-tricky               30.3 GB/s (1.00x)            16.6 GB/s (1.83x)
memmem/pathological/rare-repeated-small-tricky              22.2 GB/s (1.00x)            11.2 GB/s (1.98x)
memmem/pathological/defeat-simple-vector-alphabet           3.0 GB/s (1.00x)             1114.1 MB/s (2.77x)
memmem/pathological/defeat-simple-vector-freq-alphabet      14.8 GB/s (1.00x)            2.2 GB/s (6.72x)
memmem/pathological/defeat-simple-vector-repeated-alphabet  835.1 MB/s (1.00x)           173.8 MB/s (4.80x)
memmem/sliceslice/short                                     7.33ms (1.00x)               36.55ms (4.99x)
memmem/sliceslice/seemingly-random                          6.0 MB/s (1.00x)             3.6 MB/s (1.67x)
memmem/sliceslice/i386                                      30.3 MB/s (1.00x)            15.1 MB/s (2.00x)
memmem/subtitles/never/huge-en-john-watson                  30.9 GB/s (1.00x)            16.6 GB/s (1.86x)
memmem/subtitles/never/huge-en-all-common-bytes             22.7 GB/s (1.00x)            13.8 GB/s (1.64x)
memmem/subtitles/never/huge-en-some-rare-bytes              30.9 GB/s (1.00x)            16.6 GB/s (1.86x)
memmem/subtitles/never/huge-en-two-space                    33.8 GB/s (1.00x)            16.6 GB/s (2.03x)
memmem/subtitles/never/huge-ru-john-watson                  30.3 GB/s (1.00x)            7.1 GB/s (4.25x)
memmem/subtitles/never/huge-zh-john-watson                  29.2 GB/s (1.00x)            16.0 GB/s (1.83x)
memmem/subtitles/rare/huge-en-sherlock-holmes               30.3 GB/s (1.00x)            16.3 GB/s (1.86x)
memmem/subtitles/rare/huge-en-sherlock                      30.6 GB/s (1.00x)            16.6 GB/s (1.85x)
memmem/subtitles/rare/huge-en-medium-needle                 28.3 GB/s (1.00x)            12.4 GB/s (2.28x)
memmem/subtitles/rare/huge-en-long-needle                   32.8 GB/s (1.00x)            15.7 GB/s (2.08x)
memmem/subtitles/rare/huge-en-huge-needle                   32.9 GB/s (1.00x)            16.1 GB/s (2.05x)
memmem/subtitles/rare/huge-ru-sherlock-holmes               30.3 GB/s (1.00x)            8.0 GB/s (3.80x)
memmem/subtitles/rare/huge-ru-sherlock                      30.2 GB/s (1.00x)            10.1 GB/s (3.00x)
memmem/subtitles/rare/huge-zh-sherlock-holmes               28.8 GB/s (1.00x)            14.7 GB/s (1.95x)
memmem/subtitles/rare/huge-zh-sherlock                      29.6 GB/s (1.00x)            14.0 GB/s (2.12x)
```

Differences with the substring search implementation and
`memmem` as provided by macOS's libc. Showing only measurements
with 2x difference or greater. This is what utter destruction
looks like. (I'm not sure what's going on in benchmarks like
`memmem/subtitles/rare/teeny-en-sherlock-holmes`. It's a tiny haystack
and macOS seems to either measure 1ns or 41ns. I wonder if there's
something odd about time precision on macOS? You can see the reverse
happen in `memmem/subtitles/rare/teeny-zh-sherlock`.)

```
$ rebar cmp benchmarks/record/aarch64/2023-08-26.csv -e libc/memmem/oneshot -e rust/memchr/memmem/oneshot -t 2
benchmark                                                   libc/memmem/oneshot   rust/memchr/memmem/oneshot
---------                                                   -------------------   --------------------------
memmem/byterank/binary                                      626.1 MB/s (5.11x)    3.1 GB/s (1.00x)
memmem/code/rust-library-never-fn-strength                  1320.8 MB/s (22.98x)  29.6 GB/s (1.00x)
memmem/code/rust-library-never-fn-strength-paren            1320.8 MB/s (22.49x)  29.0 GB/s (1.00x)
memmem/code/rust-library-never-fn-quux                      1332.0 MB/s (23.25x)  30.2 GB/s (1.00x)
memmem/code/rust-library-rare-fn-from-str                   1442.0 MB/s (20.37x)  28.7 GB/s (1.00x)
memmem/code/rust-library-common-fn-is-empty                 1320.8 MB/s (22.02x)  28.4 GB/s (1.00x)
memmem/code/rust-library-common-fn                          1320.8 MB/s (11.44x)  14.8 GB/s (1.00x)
memmem/code/rust-library-common-let                         1114.7 MB/s (8.59x)   9.4 GB/s (1.00x)
memmem/pathological/md5-huge-no-hash                        994.0 MB/s (26.39x)   25.6 GB/s (1.00x)
memmem/pathological/md5-huge-last-hash                      994.3 MB/s (26.39x)   25.6 GB/s (1.00x)
memmem/pathological/rare-repeated-huge-tricky               1670.8 MB/s (18.56x)  30.3 GB/s (1.00x)
memmem/pathological/rare-repeated-huge-match                1353.0 MB/s (1.00x)   378.5 MB/s (3.57x)
memmem/pathological/rare-repeated-small-tricky              1637.4 MB/s (13.88x)  22.2 GB/s (1.00x)
memmem/pathological/rare-repeated-small-match               1348.3 MB/s (1.00x)   394.5 MB/s (3.42x)
memmem/pathological/defeat-simple-vector-alphabet           568.1 MB/s (5.43x)    3.0 GB/s (1.00x)
memmem/pathological/defeat-simple-vector-freq-alphabet      1027.2 MB/s (14.55x)  14.6 GB/s (1.00x)
memmem/pathological/defeat-simple-vector-repeated-alphabet  173.8 MB/s (4.80x)    834.2 MB/s (1.00x)
memmem/subtitles/common/huge-en-that                        841.6 MB/s (13.19x)   10.8 GB/s (1.00x)
memmem/subtitles/common/huge-en-you                         1161.7 MB/s (4.00x)   4.5 GB/s (1.00x)
memmem/subtitles/common/huge-ru-that                        590.9 MB/s (19.48x)   11.2 GB/s (1.00x)
memmem/subtitles/common/huge-ru-not                         334.3 MB/s (18.62x)   6.1 GB/s (1.00x)
memmem/subtitles/common/huge-zh-that                        1340.1 MB/s (11.49x)  15.0 GB/s (1.00x)
memmem/subtitles/common/huge-zh-do-not                      858.5 MB/s (9.15x)    7.7 GB/s (1.00x)
memmem/subtitles/never/huge-en-john-watson                  1648.3 MB/s (19.14x)  30.8 GB/s (1.00x)
memmem/subtitles/never/huge-en-all-common-bytes             1075.4 MB/s (21.65x)  22.7 GB/s (1.00x)
memmem/subtitles/never/huge-en-some-rare-bytes              1655.7 MB/s (19.10x)  30.9 GB/s (1.00x)
memmem/subtitles/never/huge-en-two-space                    541.6 MB/s (63.83x)   33.8 GB/s (1.00x)
memmem/subtitles/never/teeny-en-two-space                   651.3 MB/s (41.00x)   26.1 GB/s (1.00x)
memmem/subtitles/never/huge-ru-john-watson                  427.0 MB/s (72.56x)   30.3 GB/s (1.00x)
memmem/subtitles/never/huge-zh-john-watson                  1155.4 MB/s (25.81x)  29.1 GB/s (1.00x)
memmem/subtitles/rare/huge-en-sherlock-holmes               1577.4 MB/s (19.60x)  30.2 GB/s (1.00x)
memmem/subtitles/rare/huge-en-sherlock                      1577.4 MB/s (19.78x)  30.5 GB/s (1.00x)
memmem/subtitles/rare/huge-en-medium-needle                 1155.6 MB/s (24.95x)  28.2 GB/s (1.00x)
memmem/subtitles/rare/huge-en-long-needle                   1488.8 MB/s (20.77x)  30.2 GB/s (1.00x)
memmem/subtitles/rare/huge-en-huge-needle                   1609.5 MB/s (17.27x)  27.1 GB/s (1.00x)
memmem/subtitles/rare/teeny-en-sherlock-holmes              26.1 GB/s (1.00x)     651.3 MB/s (41.00x)
memmem/subtitles/rare/huge-ru-sherlock-holmes               427.0 MB/s (72.41x)   30.2 GB/s (1.00x)
memmem/subtitles/rare/huge-ru-sherlock                      348.2 MB/s (91.21x)   31.0 GB/s (1.00x)
memmem/subtitles/rare/huge-zh-sherlock-holmes               955.8 MB/s (31.66x)   29.6 GB/s (1.00x)
memmem/subtitles/rare/huge-zh-sherlock                      853.4 MB/s (35.46x)   29.6 GB/s (1.00x)
memmem/subtitles/rare/teeny-zh-sherlock-holmes              28.9 GB/s (1.00x)     721.1 MB/s (41.00x)
memmem/subtitles/rare/teeny-zh-sherlock                     721.1 MB/s (41.00x)   28.9 GB/s (1.00x)
```

Differences with the [`bytecount`](https://crates.io/crates/bytecount)
crate as `memchr_iter(needle, haystack).count()` is now specialized to
its own vector implementation just for counting the number of matches
(instead of reporting the offset of each match).

```
$ rebar cmp benchmarks/record/aarch64/2023-08-26.csv -e '^rust/bytecount/memchr/oneshot$' -e '^rust/memchr/memchr/onlycount$'
benchmark                          rust/bytecount/memchr/oneshot  rust/memchr/memchr/onlycount
---------                          -----------------------------  ----------------------------
memchr/sherlock/common/huge1       29.5 GB/s (1.40x)              41.4 GB/s (1.00x)
memchr/sherlock/common/small1      618.4 GB/s (1.00x)             618.4 GB/s (1.00x)
memchr/sherlock/common/tiny1       64.3 GB/s (1.00x)              64.3 GB/s (1.00x)
memchr/sherlock/never/huge1        29.5 GB/s (1.40x)              41.4 GB/s (1.00x)
memchr/sherlock/never/small1       618.4 GB/s (1.00x)             618.4 GB/s (1.00x)
memchr/sherlock/never/tiny1        64.3 GB/s (1.00x)              64.3 GB/s (1.00x)
memchr/sherlock/never/empty1       1.00ns (1.00x)                 1.00ns (1.00x)
memchr/sherlock/rare/huge1         29.5 GB/s (1.40x)              41.4 GB/s (1.00x)
memchr/sherlock/rare/small1        618.4 GB/s (1.00x)             618.4 GB/s (1.00x)
memchr/sherlock/rare/tiny1         64.3 GB/s (1.00x)              64.3 GB/s (1.00x)
memchr/sherlock/uncommon/huge1     29.5 GB/s (1.40x)              41.4 GB/s (1.00x)
memchr/sherlock/uncommon/small1    618.4 GB/s (1.00x)             618.4 GB/s (1.00x)
memchr/sherlock/uncommon/tiny1     64.3 GB/s (1.00x)              64.3 GB/s (1.00x)
memchr/sherlock/verycommon/huge1   28.7 GB/s (1.44x)              41.4 GB/s (1.00x)
memchr/sherlock/verycommon/small1  618.4 GB/s (1.00x)             618.4 GB/s (1.00x)
```
@BurntSushi BurntSushi merged commit 93662e7 into master Aug 28, 2023
16 checks passed
@BurntSushi BurntSushi deleted the ag/refactor branch August 28, 2023 15:37
@llogiq
Copy link

llogiq commented Aug 28, 2023

Cool! Before replacing bytecount with memchr_iter(..).count(), please note that it also doesn't yet have any aarch64 optimization, so it's using usize-width code, no NEON intrinsics for now. I'm having a hard time to test and benchmark this stuff (I think my current draft PR fails on endianness or some such), as the only ARM CPU I have is in my phone, but I'll get a M2 on Friday, so stay tuned.

osiewicz added a commit to zed-industries/zed that referenced this pull request Aug 28, 2023
Fresh off the press, memchr 2.6.0 adds vector search routines for aarch64. That directly improves our search performance for both text and regex searches.
Per BurntSushi's claims, the simple string searches in ripgrep got ~2 times faster (more details available in BurntSushi/memchr#129).
osiewicz added a commit to zed-industries/zed that referenced this pull request Aug 28, 2023
Fresh off the press, memchr 2.6.0 adds vector search routines for
aarch64. That directly improves our search performance for both text and
regex searches. Per BurntSushi's claims, the simple string searches in
ripgrep got ~2 times faster (more details available in
BurntSushi/memchr#129).

Release Notes:

- N/A
@BurntSushi
Copy link
Owner Author

@llogiq Note that it's not just for aarch64. There are improvements on big haystacks for x86_64 as well.

@BurntSushi
Copy link
Owner Author

@llogiq If you want to run memchr's benchmarks with a specific focus on bytecount, this should do the trick (from the root of this repo). The --test command just runs a single iteration from each benchmark and is a good way to ensure everything is setup properly before collecting measurements.

$ cargo install --git https://github.com/BurntSushi/rebar rebar

$ rebar build -e '^rust/memchr/memchr/onlycount$' -e '^rust/bytecount/memchr/oneshot$'
rust/bytecount/memchr/oneshot: running: cd "benchmarks/./engines/rust-bytecount" && "cargo" "build" "--release"
rust/bytecount/memchr/oneshot: build complete for version 0.5.3
rust/memchr/memchr/onlycount: running: cd "benchmarks/./engines/rust-memchr" && "cargo" "build" "--release"
rust/memchr/memchr/onlycount: build complete for version 2.6.0

$ rebar measure -e '^rust/memchr/memchr/onlycount$' -e '^rust/bytecount/memchr/oneshot$' --test
[... snip ...]

$ rebar measure -e '^rust/memchr/memchr/onlycount$' -e '^rust/bytecount/memchr/oneshot$' | tee results.csv
[... snip ...]

$ rebar cmp results.csv
benchmark                          rust/bytecount/memchr/oneshot  rust/memchr/memchr/onlycount
---------                          -----------------------------  ----------------------------
memchr/sherlock/common/huge1       28.5 GB/s (2.08x)              59.3 GB/s (1.00x)
memchr/sherlock/common/small1      17.7 GB/s (1.25x)              22.1 GB/s (1.00x)
memchr/sherlock/common/tiny1       4.3 GB/s (1.00x)               3.8 GB/s (1.13x)
memchr/sherlock/never/huge1        28.5 GB/s (2.08x)              59.3 GB/s (1.00x)
memchr/sherlock/never/small1       17.7 GB/s (1.25x)              22.1 GB/s (1.00x)
memchr/sherlock/never/tiny1        4.3 GB/s (1.00x)               3.8 GB/s (1.13x)
memchr/sherlock/never/empty1       11.00ns (1.00x)                11.00ns (1.00x)
memchr/sherlock/rare/huge1         28.5 GB/s (2.08x)              59.3 GB/s (1.00x)
memchr/sherlock/rare/small1        17.7 GB/s (1.21x)              21.3 GB/s (1.00x)
memchr/sherlock/rare/tiny1         4.3 GB/s (1.00x)               3.8 GB/s (1.13x)
memchr/sherlock/uncommon/huge1     28.5 GB/s (2.08x)              59.3 GB/s (1.00x)
memchr/sherlock/uncommon/small1    17.7 GB/s (1.25x)              22.1 GB/s (1.00x)
memchr/sherlock/uncommon/tiny1     4.3 GB/s (1.00x)               3.8 GB/s (1.13x)
memchr/sherlock/verycommon/huge1   28.5 GB/s (2.08x)              59.3 GB/s (1.00x)
memchr/sherlock/verycommon/small1  17.2 GB/s (1.29x)              22.1 GB/s (1.00x)

BurntSushi added a commit to BurntSushi/ripgrep that referenced this pull request Aug 29, 2023
This in particular brings in a PR[1] that provides huge speedups on
aarch64 (e.g., Apple silicon).

[1]: BurntSushi/memchr#129
tomtau added a commit to tomtau/pest that referenced this pull request Aug 29, 2023
memchr now supports aarch64: BurntSushi/memchr#129
cargo lib (not-bootstrap-in-src) did not respect umask: GHSA-j3xp-wfr4-hx87
(probably not an issue here)
BurntSushi added a commit to rust-lang/regex that referenced this pull request Aug 29, 2023
This bumps the minimum memchr version to 2.6, which brings in
massive improvements to aarch64 for single substring search. We also can
now enable the new `alloc` feature in `memchr` when `alloc` is enable
for `regex` and `regex-automata`.

We also squash some warnings.

[1]: BurntSushi/memchr#129
BurntSushi added a commit to rust-lang/regex that referenced this pull request Aug 29, 2023
This bumps the minimum memchr version to 2.6, which brings in
massive improvements to aarch64 for single substring search. We also can
now enable the new `alloc` feature in `memchr` when `alloc` is enable
for `regex` and `regex-automata`.

We also squash some warnings.

[1]: BurntSushi/memchr#129
BurntSushi added a commit to rust-lang/regex that referenced this pull request Aug 29, 2023
This bumps the minimum memchr version to 2.6, which brings in
massive improvements to aarch64 for single substring search. We also can
now enable the new `alloc` feature in `memchr` when `alloc` is enable
for `regex` and `regex-automata`.

We also squash some warnings.

[1]: BurntSushi/memchr#129
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants