Suboptimal codegen of mixed bit width vector copying #4992

AlexGuteniev · 2024-09-30T10:57:41Z

Summary

I observed that assigning or copying vector integer elements via STL algorithm with changed bit width does not engage vectorization, whereas manually-written index-based loop is vectorized in all reasonable conversion cases.

The question is what to do with it.

Whereas it may not be worth to pursue optimization of every algorithm where input and output size differs, the plain assignment/copying is common and probably deserves optimization.

Benchmark results overview

The following cases are tested:

vector::assign using pair of iterators (assign)
std::copy of vector iterators (copy alg)
for loop with vectors index element-wise copying (copy raw)

They are tested on x64 with default architecture option, also with /d2archSSE42 and with /arch:AVX2

There are the following

assign is only vectorized for equal-sized via memcpy
copy alg is vectorized for memcpy cases, and also in some other cases
copy raw is vectorized for all reasonable conversion cases, but fails for memcpy cases

Benchmark results

Bold means noticeably better than nothing or noticeably better than previous arch level for SSE42 and AVX2.
assign does not vary between arch levels.

To	From	Count	assign	alg SSE2	alg SSE42	alg AVX2	raw SSE2	raw SSE42	raw AVX2
uint8_t	uint8_t	2024	16.9 ns	24.3 ns	24.0 ns	23.6 ns	649 ns	632 ns	607 ns
uint8_t	uint16_t	2024	586 ns	490 ns	462 ns	461 ns	70.2 ns	71.4 ns	55.2 ns
uint8_t	uint32_t	2024	508 ns	458 ns	475 ns	473 ns	252 ns	254 ns	105 ns
uint8_t	uint64_t	2024	495 ns	471 ns	452 ns	452 ns	731 ns	739 ns	196 ns
uint16_t	uint8_t	2024	504 ns	63.7 ns	58.2 ns	50.4 ns	63.2 ns	59.7 ns	47.3 ns
uint16_t	uint16_t	2024	24.5 ns	37.3 ns	37.5 ns	38.2 ns	635 ns	632 ns	630 ns
uint16_t	uint32_t	2024	501 ns	499 ns	453 ns	466 ns	199 ns	200 ns	87.7 ns
uint16_t	uint64_t	2024	506 ns	496 ns	460 ns	455 ns	259 ns	257 ns	204 ns
uint32_t	uint8_t	2024	500 ns	152 ns	114 ns	92.7 ns	155 ns	118 ns	92.9 ns
uint32_t	uint16_t	2024	507 ns	130 ns	118 ns	91.6 ns	133 ns	115 ns	92.2 ns
uint32_t	uint32_t	2024	43.0 ns	70.6 ns	71.7 ns	71.0 ns	628 ns	636 ns	641 ns
uint32_t	uint64_t	2024	507 ns	461 ns	541 ns	535 ns	204 ns	201 ns	170 ns
uint64_t	uint8_t	2024	508 ns	725 ns	503 ns	501 ns	663 ns	668 ns	672 ns
uint64_t	uint16_t	2024	499 ns	302 ns	219 ns	186 ns	306 ns	221 ns	183 ns
uint64_t	uint32_t	2024	501 ns	250 ns	229 ns	185 ns	248 ns	207 ns	183 ns
uint64_t	uint64_t	2024	110 ns	168 ns	133 ns	132 ns	665 ns	751 ns	681 ns

To	From	Count	assign	alg SSE2	alg SSE42	alg AVX2	raw SSE2	raw SSE42	raw AVX2
int8_t,	int8_t	2024	16.7 ns	24.1 ns	24.3 ns	23.9 ns	613 ns	618 ns	608 ns
int8_t,	int16_t	2024	503 ns	512 ns	452 ns	448 ns	70.1 ns	70.1 ns	56.8 ns
int8_t,	int32_t	2024	499 ns	500 ns	471 ns	474 ns	254 ns	249 ns	105 ns
int8_t,	int64_t	2024	500 ns	522 ns	453 ns	447 ns	738 ns	730 ns	195 ns
int16_t	int8_t	2024	486 ns	67.8 ns	58.7 ns	64.0 ns	70.8 ns	58.0 ns	47.7 ns
int16_t	int16_t	2024	24.0 ns	37.5 ns	38.2 ns	37.1 ns	633 ns	628 ns	635 ns
int16_t	int32_t	2024	494 ns	467 ns	456 ns	451 ns	198 ns	197 ns	87.6 ns
int16_t	int64_t	2024	496 ns	497 ns	460 ns	456 ns	258 ns	259 ns	202 ns
int32_t	int8_t	2024	493 ns	164 ns	122 ns	92.1 ns	166 ns	111 ns	93.6 ns
int32_t	int16_t	2024	492 ns	140 ns	115 ns	92.3 ns	143 ns	112 ns	92.2 ns
int32_t	int32_t	2024	42.4 ns	73.0 ns	71.4 ns	70.3 ns	628 ns	630 ns	637 ns
int32_t	int64_t	2024	493 ns	478 ns	478 ns	464 ns	201 ns	208 ns	171 ns
int64_t	int8_t	2024	502 ns	501 ns	509 ns	504 ns	661 ns	665 ns	672 ns
int64_t	int16_t	2024	487 ns	224 ns	219 ns	183 ns	220 ns	228 ns	183 ns
int64_t	int32_t	2024	490 ns	220 ns	211 ns	182 ns	217 ns	227 ns	185 ns
int64_t	int64_t	2024	108 ns	168 ns	133 ns	132 ns	675 ns	747 ns	669 ns

Benchmark program

// Copyright (c) Microsoft Corporation.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

#pragma warning(disable : 4244)

#include <benchmark/benchmark.h>
#include <cstddef>
#include <cstdint>
#include <vector>

using namespace std;

template <class To, class From, From Val = 0x5A>
void vector_assign(benchmark::State& state) {
    const auto size = static_cast<size_t>(state.range(0));
    std::vector<From> a(size, Val);
    std::vector<To> b;
    b.reserve(a.size());

    for (auto _ : state) {
        benchmark::DoNotOptimize(a);
        b.assign(a.begin(), a.end());
        benchmark::DoNotOptimize(b);
        b.clear();
    }
}

template <class To, class From, From Val = 0x5A>
void vector_copy_alg(benchmark::State& state) {
    const auto size = static_cast<size_t>(state.range(0));
    std::vector<From> a(size, Val);
    std::vector<To> b(size);

    for (auto _ : state) {
        benchmark::DoNotOptimize(a);
        copy(a.begin(), a.end(), b.begin());
        benchmark::DoNotOptimize(b);
        fill(b.begin(), b.end(), 0);
    }
}

template <class To, class From, From Val = 0x5A>
void vector_copy_raw(benchmark::State& state) {
    const auto size = static_cast<size_t>(state.range(0));
    std::vector<From> a(size, Val);
    std::vector<To> b(size);

    for (auto _ : state) {
        benchmark::DoNotOptimize(a);
        for (size_t i = 0, m = a.size(); i != m; ++i) {
            b[i] = a[i];
        }
        benchmark::DoNotOptimize(b);
        fill(b.begin(), b.end(), 0);
    }
}


void common_args(auto bm) {
    bm->Arg(2024);
}

BENCHMARK(vector_assign<uint8_t, uint8_t>)->Apply(common_args);
BENCHMARK(vector_assign<uint8_t, uint16_t>)->Apply(common_args);
BENCHMARK(vector_assign<uint8_t, uint32_t>)->Apply(common_args);
BENCHMARK(vector_assign<uint8_t, uint64_t>)->Apply(common_args);

BENCHMARK(vector_assign<uint16_t, uint8_t>)->Apply(common_args);
BENCHMARK(vector_assign<uint16_t, uint16_t>)->Apply(common_args);
BENCHMARK(vector_assign<uint16_t, uint32_t>)->Apply(common_args);
BENCHMARK(vector_assign<uint16_t, uint64_t>)->Apply(common_args);

BENCHMARK(vector_assign<uint32_t, uint8_t>)->Apply(common_args);
BENCHMARK(vector_assign<uint32_t, uint16_t>)->Apply(common_args);
BENCHMARK(vector_assign<uint32_t, uint32_t>)->Apply(common_args);
BENCHMARK(vector_assign<uint32_t, uint64_t>)->Apply(common_args);

BENCHMARK(vector_assign<uint64_t, uint8_t>)->Apply(common_args);
BENCHMARK(vector_assign<uint64_t, uint16_t>)->Apply(common_args);
BENCHMARK(vector_assign<uint64_t, uint32_t>)->Apply(common_args);
BENCHMARK(vector_assign<uint64_t, uint64_t>)->Apply(common_args);

BENCHMARK(vector_assign<int8_t, int8_t>)->Apply(common_args);
BENCHMARK(vector_assign<int8_t, int16_t>)->Apply(common_args);
BENCHMARK(vector_assign<int8_t, int32_t>)->Apply(common_args);
BENCHMARK(vector_assign<int8_t, int64_t>)->Apply(common_args);

BENCHMARK(vector_assign<int16_t, int8_t>)->Apply(common_args);
BENCHMARK(vector_assign<int16_t, int16_t>)->Apply(common_args);
BENCHMARK(vector_assign<int16_t, int32_t>)->Apply(common_args);
BENCHMARK(vector_assign<int16_t, int64_t>)->Apply(common_args);

BENCHMARK(vector_assign<int32_t, int8_t>)->Apply(common_args);
BENCHMARK(vector_assign<int32_t, int16_t>)->Apply(common_args);
BENCHMARK(vector_assign<int32_t, int32_t>)->Apply(common_args);
BENCHMARK(vector_assign<int32_t, int64_t>)->Apply(common_args);

BENCHMARK(vector_assign<int64_t, int8_t>)->Apply(common_args);
BENCHMARK(vector_assign<int64_t, int16_t>)->Apply(common_args);
BENCHMARK(vector_assign<int64_t, int32_t>)->Apply(common_args);
BENCHMARK(vector_assign<int64_t, int64_t>)->Apply(common_args);

BENCHMARK(vector_copy_alg<uint8_t, uint8_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<uint8_t, uint16_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<uint8_t, uint32_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<uint8_t, uint64_t>)->Apply(common_args);

BENCHMARK(vector_copy_alg<uint16_t, uint8_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<uint16_t, uint16_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<uint16_t, uint32_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<uint16_t, uint64_t>)->Apply(common_args);

BENCHMARK(vector_copy_alg<uint32_t, uint8_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<uint32_t, uint16_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<uint32_t, uint32_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<uint32_t, uint64_t>)->Apply(common_args);

BENCHMARK(vector_copy_alg<uint64_t, uint8_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<uint64_t, uint16_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<uint64_t, uint32_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<uint64_t, uint64_t>)->Apply(common_args);

BENCHMARK(vector_copy_alg<int8_t, int8_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<int8_t, int16_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<int8_t, int32_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<int8_t, int64_t>)->Apply(common_args);

BENCHMARK(vector_copy_alg<int16_t, int8_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<int16_t, int16_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<int16_t, int32_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<int16_t, int64_t>)->Apply(common_args);

BENCHMARK(vector_copy_alg<int32_t, int8_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<int32_t, int16_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<int32_t, int32_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<int32_t, int64_t>)->Apply(common_args);

BENCHMARK(vector_copy_alg<int64_t, int8_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<int64_t, int16_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<int64_t, int32_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<int64_t, int64_t>)->Apply(common_args);

BENCHMARK(vector_copy_raw<uint8_t, uint8_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<uint8_t, uint16_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<uint8_t, uint32_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<uint8_t, uint64_t>)->Apply(common_args);

BENCHMARK(vector_copy_raw<uint16_t, uint8_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<uint16_t, uint16_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<uint16_t, uint32_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<uint16_t, uint64_t>)->Apply(common_args);

BENCHMARK(vector_copy_raw<uint32_t, uint8_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<uint32_t, uint16_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<uint32_t, uint32_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<uint32_t, uint64_t>)->Apply(common_args);

BENCHMARK(vector_copy_raw<uint64_t, uint8_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<uint64_t, uint16_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<uint64_t, uint32_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<uint64_t, uint64_t>)->Apply(common_args);

BENCHMARK(vector_copy_raw<int8_t, int8_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<int8_t, int16_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<int8_t, int32_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<int8_t, int64_t>)->Apply(common_args);

BENCHMARK(vector_copy_raw<int16_t, int8_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<int16_t, int16_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<int16_t, int32_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<int16_t, int64_t>)->Apply(common_args);

BENCHMARK(vector_copy_raw<int32_t, int8_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<int32_t, int16_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<int32_t, int32_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<int32_t, int64_t>)->Apply(common_args);

BENCHMARK(vector_copy_raw<int64_t, int8_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<int64_t, int16_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<int64_t, int32_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<int64_t, int64_t>)->Apply(common_args);


BENCHMARK_MAIN();

Explanation

_Uninitialized_copy[_meow] used in vecor::assign and _Copy_meow used in std::copy use metaprogramming to call memmove/memcpy, but otherwise have simple loops. The compiler somehow is confused by these loops, and doesn't always vectorize them.

Possible solutions

The following makes sense to me

Report missed optimization bugs
Rewrite these loops so that compiler could optimize them

I don't think manually vectorizing every conversion is a good idea, as there are too many of them. Though the advantage would be runtime CPU detection.

The text was updated successfully, but these errors were encountered:

StephanTLavavej · 2024-10-02T21:21:00Z

We talked about this at the weekly maintainer meeting and we agree:

Missed optimizations should be reported to the compiler back-end team
Following the usual pattern to detect when auto-vectorization would be possible, and adding STL code that's auto-vectorization friendly (e.g. raw pointers with indices instead of iterator loops) isn't too burdensome, and we can typically respond faster than compiler optimizations can be enhanced

AlexGuteniev · 2024-10-03T16:08:51Z

I've reported the compiler issues:

Plain loop (non STL related): DevCom-10760469
std::copy: DevCom-10760476
vector::assign: DevCom-10760481

There isn't much hope for the compiler to improve. On reporting one of the issues, similar problem was found DevCom-1262302 and it is Closed - Lower Priority

AlexGuteniev added the question Further information is requested label Sep 30, 2024

StephanTLavavej added performance Must go faster compiler Compiler work involved and removed question Further information is requested labels Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suboptimal codegen of mixed bit width vector copying #4992

Suboptimal codegen of mixed bit width vector copying #4992

AlexGuteniev commented Sep 30, 2024 •

edited

Loading

StephanTLavavej commented Oct 2, 2024

AlexGuteniev commented Oct 3, 2024

Suboptimal codegen of mixed bit width vector copying #4992

Suboptimal codegen of mixed bit width vector copying #4992

Comments

AlexGuteniev commented Sep 30, 2024 • edited Loading

Summary

Benchmark results overview

Benchmark results

Explanation

Possible solutions

StephanTLavavej commented Oct 2, 2024

AlexGuteniev commented Oct 3, 2024

AlexGuteniev commented Sep 30, 2024 •

edited

Loading