You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I observed that assigning or copying vector integer elements via STL algorithm with changed bit width does not engage vectorization, whereas manually-written index-based loop is vectorized in all reasonable conversion cases.
The question is what to do with it.
Whereas it may not be worth to pursue optimization of every algorithm where input and output size differs, the plain assignment/copying is common and probably deserves optimization.
Benchmark results overview
The following cases are tested:
vector::assign using pair of iterators (assign)
std::copy of vector iterators (copy alg)
for loop with vectors index element-wise copying (copy raw)
They are tested on x64 with default architecture option, also with /d2archSSE42 and with /arch:AVX2
There are the following
assign is only vectorized for equal-sized via memcpy
copy alg is vectorized for memcpy cases, and also in some other cases
copy raw is vectorized for all reasonable conversion cases, but fails for memcpy cases
Benchmark results
Bold means noticeably better than nothing or noticeably better than previous arch level for SSE42 and AVX2.
assign does not vary between arch levels.
_Uninitialized_copy[_meow] used in vecor::assign and _Copy_meow used in std::copy use metaprogramming to call memmove/memcpy, but otherwise have simple loops. The compiler somehow is confused by these loops, and doesn't always vectorize them.
Possible solutions
The following makes sense to me
Report missed optimization bugs
Rewrite these loops so that compiler could optimize them
I don't think manually vectorizing every conversion is a good idea, as there are too many of them. Though the advantage would be runtime CPU detection.
The text was updated successfully, but these errors were encountered:
We talked about this at the weekly maintainer meeting and we agree:
Missed optimizations should be reported to the compiler back-end team
Following the usual pattern to detect when auto-vectorization would be possible, and adding STL code that's auto-vectorization friendly (e.g. raw pointers with indices instead of iterator loops) isn't too burdensome, and we can typically respond faster than compiler optimizations can be enhanced
There isn't much hope for the compiler to improve. On reporting one of the issues, similar problem was found DevCom-1262302 and it is Closed - Lower Priority
Summary
I observed that assigning or copying vector integer elements via STL algorithm with changed bit width does not engage vectorization, whereas manually-written index-based loop is vectorized in all reasonable conversion cases.
The question is what to do with it.
Whereas it may not be worth to pursue optimization of every algorithm where input and output size differs, the plain assignment/copying is common and probably deserves optimization.
Benchmark results overview
The following cases are tested:
vector::assign
using pair of iterators (assign)std::copy
of vector iterators (copy alg)for
loop with vectors index element-wise copying (copy raw)They are tested on x64 with default architecture option, also with
/d2archSSE42
and with/arch:AVX2
There are the following
memcpy
memcpy
cases, and also in some other casesmemcpy
casesBenchmark results
Bold means noticeably better than nothing or noticeably better than previous arch level for SSE42 and AVX2.
assign does not vary between arch levels.
Benchmark program
Explanation
_Uninitialized_copy[_meow]
used invecor::assign
and_Copy_meow
used instd::copy
use metaprogramming to callmemmove
/memcpy
, but otherwise have simple loops. The compiler somehow is confused by these loops, and doesn't always vectorize them.Possible solutions
The following makes sense to me
I don't think manually vectorizing every conversion is a good idea, as there are too many of them. Though the advantage would be runtime CPU detection.
The text was updated successfully, but these errors were encountered: