Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize basic_string::rfind (the single character overload) #5087

Merged
merged 6 commits into from
Nov 19, 2024

Conversation

AlexGuteniev
Copy link
Contributor

@AlexGuteniev AlexGuteniev commented Nov 14, 2024

Towards #5036

⏱️ Benchmark results

Benchmark main this comment
bm<uint8_t, not_highly_aligned_allocator, Op::FindSized>/8021/3056 37.2 ns 45.8 ns random variation
bm<uint8_t, not_highly_aligned_allocator, Op::FindSized>/63/62 3.86 ns 3.30 ns
bm<uint8_t, not_highly_aligned_allocator, Op::FindSized>/31/30 6.91 ns 8.26 ns
bm<uint8_t, not_highly_aligned_allocator, Op::FindSized>/15/14 6.05 ns 6.15 ns
bm<uint8_t, not_highly_aligned_allocator, Op::FindSized>/7/6 2.76 ns 3.05 ns
bm<uint8_t, highly_aligned_allocator, Op::FindSized>/8021/3056 44.8 ns 46.6 ns
bm<uint8_t, highly_aligned_allocator, Op::FindSized>/63/62 3.14 ns 3.27 ns
bm<uint8_t, highly_aligned_allocator, Op::FindSized>/31/30 7.56 ns 7.86 ns
bm<uint8_t, highly_aligned_allocator, Op::FindSized>/15/14 5.96 ns 6.39 ns
bm<uint8_t, highly_aligned_allocator, Op::FindSized>/7/6 3.04 ns 2.85 ns
bm<uint8_t, not_highly_aligned_allocator, Op::FindUnsized>/8021/3056 80.5 ns 80.5 ns
bm<uint8_t, not_highly_aligned_allocator, Op::FindUnsized>/63/62 3.90 ns 3.90 ns
bm<uint8_t, not_highly_aligned_allocator, Op::FindUnsized>/31/30 3.31 ns 3.27 ns
bm<uint8_t, not_highly_aligned_allocator, Op::FindUnsized>/15/14 3.04 ns 3.14 ns
bm<uint8_t, not_highly_aligned_allocator, Op::FindUnsized>/7/6 2.29 ns 2.42 ns
bm<uint8_t, highly_aligned_allocator, Op::FindUnsized>/8021/3056 77.3 ns 69.6 ns random variation
bm<uint8_t, highly_aligned_allocator, Op::FindUnsized>/63/62 2.11 ns 2.13 ns
bm<uint8_t, highly_aligned_allocator, Op::FindUnsized>/31/30 1.53 ns 1.54 ns
bm<uint8_t, highly_aligned_allocator, Op::FindUnsized>/15/14 1.29 ns 1.30 ns
bm<uint8_t, highly_aligned_allocator, Op::FindUnsized>/7/6 1.29 ns 1.29 ns
bm<uint8_t, not_highly_aligned_allocator, Op::Count>/8021/3056 87.7 ns 90.4 ns
bm<uint8_t, not_highly_aligned_allocator, Op::Count>/63/62 4.50 ns 4.52 ns
bm<uint8_t, not_highly_aligned_allocator, Op::Count>/31/30 7.49 ns 7.98 ns
bm<uint8_t, not_highly_aligned_allocator, Op::Count>/15/14 6.20 ns 5.20 ns
bm<uint8_t, not_highly_aligned_allocator, Op::Count>/7/6 3.57 ns 3.29 ns
bm<uint8_t, highly_aligned_allocator, Op::Count>/8021/3056 75.1 ns 75.4 ns
bm<uint8_t, highly_aligned_allocator, Op::Count>/63/62 4.48 ns 4.49 ns
bm<uint8_t, highly_aligned_allocator, Op::Count>/31/30 7.60 ns 7.90 ns
bm<uint8_t, highly_aligned_allocator, Op::Count>/15/14 5.89 ns 5.17 ns
bm<uint8_t, highly_aligned_allocator, Op::Count>/7/6 4.01 ns 3.31 ns
bm<char, not_highly_aligned_allocator, Op::StringFind>/8021/3056 81.0 ns 79.8 ns
bm<char, not_highly_aligned_allocator, Op::StringFind>/63/62 14.1 ns 15.2 ns
bm<char, not_highly_aligned_allocator, Op::StringFind>/31/30 13.3 ns 12.7 ns
bm<char, not_highly_aligned_allocator, Op::StringFind>/15/14 4.95 ns 5.03 ns
bm<char, not_highly_aligned_allocator, Op::StringFind>/7/6 2.71 ns 2.71 ns
bm<char, highly_aligned_allocator, Op::StringFind>/8021/3056 71.6 ns 70.5 ns
bm<char, highly_aligned_allocator, Op::StringFind>/63/62 13.0 ns 12.9 ns
bm<char, highly_aligned_allocator, Op::StringFind>/31/30 12.5 ns 12.1 ns
bm<char, highly_aligned_allocator, Op::StringFind>/15/14 4.85 ns 4.86 ns
bm<char, highly_aligned_allocator, Op::StringFind>/7/6 2.72 ns 2.72 ns
bm<char, not_highly_aligned_allocator, Op::StringRFind>/8021/3056 733 ns 43.4 ns vectorized
bm<char, not_highly_aligned_allocator, Op::StringRFind>/63/62 17.1 ns 3.77 ns vectorized
bm<char, not_highly_aligned_allocator, Op::StringRFind>/31/30 9.13 ns 5.81 ns vectorized
bm<char, not_highly_aligned_allocator, Op::StringRFind>/15/14 6.18 ns 5.07 ns vectorized
bm<char, not_highly_aligned_allocator, Op::StringRFind>/7/6 3.00 ns 3.10 ns vectorized
bm<char, highly_aligned_allocator, Op::StringRFind>/8021/3056 731 ns 45.1 ns vectorized
bm<char, highly_aligned_allocator, Op::StringRFind>/63/62 16.7 ns 3.78 ns vectorized
bm<char, highly_aligned_allocator, Op::StringRFind>/31/30 9.15 ns 7.25 ns vectorized
bm<char, highly_aligned_allocator, Op::StringRFind>/15/14 6.15 ns 6.53 ns vectorized
bm<char, highly_aligned_allocator, Op::StringRFind>/7/6 3.11 ns 3.14 ns vectorized
bm<uint16_t, not_highly_aligned_allocator, Op::FindSized>/8021/3056 81.5 ns 82.5 ns
bm<uint16_t, not_highly_aligned_allocator, Op::FindSized>/63/62 3.19 ns 3.28 ns
bm<uint16_t, not_highly_aligned_allocator, Op::FindSized>/31/30 2.67 ns 2.70 ns
bm<uint16_t, not_highly_aligned_allocator, Op::FindSized>/15/14 3.05 ns 3.09 ns
bm<uint16_t, not_highly_aligned_allocator, Op::FindSized>/7/6 2.58 ns 2.59 ns
bm<uint16_t, not_highly_aligned_allocator, Op::Count>/8021/3056 164 ns 164 ns
bm<uint16_t, not_highly_aligned_allocator, Op::Count>/63/62 4.71 ns 4.72 ns
bm<uint16_t, not_highly_aligned_allocator, Op::Count>/31/30 4.48 ns 4.49 ns
bm<uint16_t, not_highly_aligned_allocator, Op::Count>/15/14 4.69 ns 4.69 ns
bm<uint16_t, not_highly_aligned_allocator, Op::Count>/7/6 3.52 ns 3.54 ns
bm<wchar_t, not_highly_aligned_allocator, Op::StringFind>/8021/3056 731 ns 734 ns
bm<wchar_t, not_highly_aligned_allocator, Op::StringFind>/63/62 17.3 ns 17.4 ns
bm<wchar_t, not_highly_aligned_allocator, Op::StringFind>/31/30 9.35 ns 9.32 ns
bm<wchar_t, not_highly_aligned_allocator, Op::StringFind>/15/14 6.69 ns 5.86 ns
bm<wchar_t, not_highly_aligned_allocator, Op::StringFind>/7/6 2.69 ns 2.75 ns
bm<wchar_t, not_highly_aligned_allocator, Op::StringRFind>/8021/3056 732 ns 82.6 ns vectorized
bm<wchar_t, not_highly_aligned_allocator, Op::StringRFind>/63/62 16.7 ns 4.01 ns vectorized
bm<wchar_t, not_highly_aligned_allocator, Op::StringRFind>/31/30 8.73 ns 3.42 ns vectorized
bm<wchar_t, not_highly_aligned_allocator, Op::StringRFind>/15/14 4.41 ns 3.76 ns vectorized
bm<wchar_t, not_highly_aligned_allocator, Op::StringRFind>/7/6 2.47 ns 3.05 ns vectorized
bm<uint32_t, not_highly_aligned_allocator, Op::FindSized>/8021/3056 154 ns 152 ns
bm<uint32_t, not_highly_aligned_allocator, Op::FindSized>/63/62 3.99 ns 3.97 ns
bm<uint32_t, not_highly_aligned_allocator, Op::FindSized>/31/30 2.85 ns 3.00 ns
bm<uint32_t, not_highly_aligned_allocator, Op::FindSized>/15/14 2.63 ns 2.59 ns
bm<uint32_t, not_highly_aligned_allocator, Op::FindSized>/7/6 2.56 ns 2.38 ns
bm<uint32_t, not_highly_aligned_allocator, Op::Count>/8021/3056 324 ns 321 ns
bm<uint32_t, not_highly_aligned_allocator, Op::Count>/63/62 4.79 ns 4.70 ns
bm<uint32_t, not_highly_aligned_allocator, Op::Count>/31/30 4.23 ns 4.22 ns
bm<uint32_t, not_highly_aligned_allocator, Op::Count>/15/14 3.85 ns 3.83 ns
bm<uint32_t, not_highly_aligned_allocator, Op::Count>/7/6 3.75 ns 3.77 ns
bm<char32_t, not_highly_aligned_allocator, Op::StringFind>/8021/3056 729 ns 733 ns
bm<char32_t, not_highly_aligned_allocator, Op::StringFind>/63/62 16.9 ns 30.2 ns
bm<char32_t, not_highly_aligned_allocator, Op::StringFind>/31/30 9.46 ns 15.7 ns
bm<char32_t, not_highly_aligned_allocator, Op::StringFind>/15/14 6.76 ns 4.46 ns
bm<char32_t, not_highly_aligned_allocator, Op::StringFind>/7/6 2.88 ns 2.37 ns
bm<char32_t, not_highly_aligned_allocator, Op::StringRFind>/8021/3056 731 ns 155 ns vectorized
bm<char32_t, not_highly_aligned_allocator, Op::StringRFind>/63/62 16.6 ns 5.46 ns vectorized
bm<char32_t, not_highly_aligned_allocator, Op::StringRFind>/31/30 8.93 ns 3.89 ns vectorized
bm<char32_t, not_highly_aligned_allocator, Op::StringRFind>/15/14 5.07 ns 3.51 ns vectorized
bm<char32_t, not_highly_aligned_allocator, Op::StringRFind>/7/6 2.58 ns 3.05 ns vectorized
bm<uint64_t, not_highly_aligned_allocator, Op::FindSized>/8021/3056 288 ns 287 ns
bm<uint64_t, not_highly_aligned_allocator, Op::FindSized>/63/62 6.73 ns 6.80 ns
bm<uint64_t, not_highly_aligned_allocator, Op::FindSized>/31/30 3.97 ns 3.99 ns
bm<uint64_t, not_highly_aligned_allocator, Op::FindSized>/15/14 2.84 ns 2.82 ns
bm<uint64_t, not_highly_aligned_allocator, Op::FindSized>/7/6 2.65 ns 2.64 ns
bm<uint64_t, not_highly_aligned_allocator, Op::Count>/8021/3056 922 ns 930 ns
bm<uint64_t, not_highly_aligned_allocator, Op::Count>/63/62 6.41 ns 9.30 ns
bm<uint64_t, not_highly_aligned_allocator, Op::Count>/31/30 4.19 ns 4.17 ns
bm<uint64_t, not_highly_aligned_allocator, Op::Count>/15/14 3.51 ns 3.54 ns
bm<uint64_t, not_highly_aligned_allocator, Op::Count>/7/6 3.16 ns 3.20 ns

🥇 Results observation

  • The StringRFind cases are improved greatly
  • The StringFind cases may need improvement
  • Some interesting very good results for FindUnsized with aligned allocator

♾️ FindUnsized results explanation

TL;DR: There's interesting results, but unfortunately not useful.

The FindUnsized with highly_aligned_allocator shows surprisingly small timings. Apparently there's an optimization in memchr similar to the reverted unsized find vectorization, that reads beyond the valid range.

Looks like it only reads after the valid range, but does not do aligning read before the valid range, so it requires some alignment for the optimization to fully engage. The required alignment is 16 bytes, implying there's SSE inside, but not AVX. Should work well with default malloc 16 bytes alignment.

This doesn't seem to work this good for small sized range. StringFind currently uses memchr, and it doesn't show that good results.

🔜 Further steps

I want to question, whether we want to also vectorize basic_string::find. Here are some points:

  • For 8-bit elements, FindSized is twice faster than StringFind. Looks like memchr doesn't use AVX. So we can stop calling memchr and use our vectorization. The counter point could be that the C runtime should be optimized instead.
  • For 16-bit elements, which are more rare, the results would be more significant. It is because wmemchr is slow. But it is expected to improve in a new Windows Kit.
  • For bigger characters, there are no C runtime functions, so the vectorization would gve huge benefit, but they are more rare.

@AlexGuteniev AlexGuteniev requested a review from a team as a code owner November 14, 2024 22:33
@StephanTLavavej StephanTLavavej added the performance Must go faster label Nov 14, 2024
@StephanTLavavej StephanTLavavej self-assigned this Nov 14, 2024
benchmarks/src/find_and_count.cpp Outdated Show resolved Hide resolved
benchmarks/src/find_and_count.cpp Outdated Show resolved Hide resolved
benchmarks/src/find_and_count.cpp Show resolved Hide resolved
tests/std/tests/VSO_0000000_vector_algorithms/test.cpp Outdated Show resolved Hide resolved
@StephanTLavavej
Copy link
Member

5950X results:

Benchmark Before After Speedup
bm<char, not_highly_aligned_allocator, Op::StringRFind>/8021/3056 656 ns 49.1 ns 13.36
bm<char, not_highly_aligned_allocator, Op::StringRFind>/63/62 16.5 ns 5.33 ns 3.10
bm<char, not_highly_aligned_allocator, Op::StringRFind>/31/30 9.69 ns 8.72 ns 1.11
bm<char, not_highly_aligned_allocator, Op::StringRFind>/15/14 5.50 ns 8.25 ns 0.67
bm<char, not_highly_aligned_allocator, Op::StringRFind>/7/6 3.01 ns 5.68 ns 0.53
bm<char, highly_aligned_allocator, Op::StringRFind>/8021/3056 652 ns 48.9 ns 13.33
bm<char, highly_aligned_allocator, Op::StringRFind>/63/62 16.4 ns 5.29 ns 3.10
bm<char, highly_aligned_allocator, Op::StringRFind>/31/30 9.65 ns 8.51 ns 1.13
bm<char, highly_aligned_allocator, Op::StringRFind>/15/14 5.49 ns 8.33 ns 0.66
bm<char, highly_aligned_allocator, Op::StringRFind>/7/6 3.04 ns 5.68 ns 0.54
bm<wchar_t, not_highly_aligned_allocator, Op::StringRFind>/8021/3056 688 ns 87.0 ns 7.91
bm<wchar_t, not_highly_aligned_allocator, Op::StringRFind>/63/62 16.2 ns 5.75 ns 2.82
bm<wchar_t, not_highly_aligned_allocator, Op::StringRFind>/31/30 9.23 ns 5.32 ns 1.73
bm<wchar_t, not_highly_aligned_allocator, Op::StringRFind>/15/14 5.48 ns 7.88 ns 0.70
bm<wchar_t, not_highly_aligned_allocator, Op::StringRFind>/7/6 2.93 ns 7.62 ns 0.38
bm<char32_t, not_highly_aligned_allocator, Op::StringRFind>/8021/3056 663 ns 172 ns 3.85
bm<char32_t, not_highly_aligned_allocator, Op::StringRFind>/63/62 15.7 ns 6.61 ns 2.38
bm<char32_t, not_highly_aligned_allocator, Op::StringRFind>/31/30 8.94 ns 5.14 ns 1.74
bm<char32_t, not_highly_aligned_allocator, Op::StringRFind>/15/14 5.39 ns 4.69 ns 1.15
bm<char32_t, not_highly_aligned_allocator, Op::StringRFind>/7/6 2.92 ns 4.89 ns 0.60

Looks good, the only slowdowns are where the character is found immediately and the call is super fast anyways.

@StephanTLavavej StephanTLavavej removed their assignment Nov 19, 2024
@StephanTLavavej StephanTLavavej self-assigned this Nov 19, 2024
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit 1711bc3 into microsoft:main Nov 19, 2024
39 checks passed
@StephanTLavavej
Copy link
Member

'!', 's', 'k', 'n', 'a', 'h', 'T' 😹 ⏪ 🤪

@AlexGuteniev AlexGuteniev deleted the strrchr branch November 19, 2024 09:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants