Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suboptimal codegen of mixed bit width vector copying #4992

Open
AlexGuteniev opened this issue Sep 30, 2024 · 2 comments
Open

Suboptimal codegen of mixed bit width vector copying #4992

AlexGuteniev opened this issue Sep 30, 2024 · 2 comments
Labels
compiler Compiler work involved performance Must go faster

Comments

@AlexGuteniev
Copy link
Contributor

AlexGuteniev commented Sep 30, 2024

Summary

I observed that assigning or copying vector integer elements via STL algorithm with changed bit width does not engage vectorization, whereas manually-written index-based loop is vectorized in all reasonable conversion cases.

The question is what to do with it.

Whereas it may not be worth to pursue optimization of every algorithm where input and output size differs, the plain assignment/copying is common and probably deserves optimization.

Benchmark results overview

The following cases are tested:

  • vector::assign using pair of iterators (assign)
  • std::copy of vector iterators (copy alg)
  • for loop with vectors index element-wise copying (copy raw)

They are tested on x64 with default architecture option, also with /d2archSSE42 and with /arch:AVX2

There are the following

  • assign is only vectorized for equal-sized via memcpy
  • copy alg is vectorized for memcpy cases, and also in some other cases
  • copy raw is vectorized for all reasonable conversion cases, but fails for memcpy cases

Benchmark results

Bold means noticeably better than nothing or noticeably better than previous arch level for SSE42 and AVX2.
assign does not vary between arch levels.

To From Count assign alg SSE2 alg SSE42 alg AVX2 raw SSE2 raw SSE42 raw AVX2
uint8_t uint8_t 2024 16.9 ns 24.3 ns 24.0 ns 23.6 ns 649 ns 632 ns 607 ns
uint8_t uint16_t 2024 586 ns 490 ns 462 ns 461 ns 70.2 ns 71.4 ns 55.2 ns
uint8_t uint32_t 2024 508 ns 458 ns 475 ns 473 ns 252 ns 254 ns 105 ns
uint8_t uint64_t 2024 495 ns 471 ns 452 ns 452 ns 731 ns 739 ns 196 ns
uint16_t uint8_t 2024 504 ns 63.7 ns 58.2 ns 50.4 ns 63.2 ns 59.7 ns 47.3 ns
uint16_t uint16_t 2024 24.5 ns 37.3 ns 37.5 ns 38.2 ns 635 ns 632 ns 630 ns
uint16_t uint32_t 2024 501 ns 499 ns 453 ns 466 ns 199 ns 200 ns 87.7 ns
uint16_t uint64_t 2024 506 ns 496 ns 460 ns 455 ns 259 ns 257 ns 204 ns
uint32_t uint8_t 2024 500 ns 152 ns 114 ns 92.7 ns 155 ns 118 ns 92.9 ns
uint32_t uint16_t 2024 507 ns 130 ns 118 ns 91.6 ns 133 ns 115 ns 92.2 ns
uint32_t uint32_t 2024 43.0 ns 70.6 ns 71.7 ns 71.0 ns 628 ns 636 ns 641 ns
uint32_t uint64_t 2024 507 ns 461 ns 541 ns 535 ns 204 ns 201 ns 170 ns
uint64_t uint8_t 2024 508 ns 725 ns 503 ns 501 ns 663 ns 668 ns 672 ns
uint64_t uint16_t 2024 499 ns 302 ns 219 ns 186 ns 306 ns 221 ns 183 ns
uint64_t uint32_t 2024 501 ns 250 ns 229 ns 185 ns 248 ns 207 ns 183 ns
uint64_t uint64_t 2024 110 ns 168 ns 133 ns 132 ns 665 ns 751 ns 681 ns
To From Count assign alg SSE2 alg SSE42 alg AVX2 raw SSE2 raw SSE42 raw AVX2
int8_t, int8_t 2024 16.7 ns 24.1 ns 24.3 ns 23.9 ns 613 ns 618 ns 608 ns
int8_t, int16_t 2024 503 ns 512 ns 452 ns 448 ns 70.1 ns 70.1 ns 56.8 ns
int8_t, int32_t 2024 499 ns 500 ns 471 ns 474 ns 254 ns 249 ns 105 ns
int8_t, int64_t 2024 500 ns 522 ns 453 ns 447 ns 738 ns 730 ns 195 ns
int16_t int8_t 2024 486 ns 67.8 ns 58.7 ns 64.0 ns 70.8 ns 58.0 ns 47.7 ns
int16_t int16_t 2024 24.0 ns 37.5 ns 38.2 ns 37.1 ns 633 ns 628 ns 635 ns
int16_t int32_t 2024 494 ns 467 ns 456 ns 451 ns 198 ns 197 ns 87.6 ns
int16_t int64_t 2024 496 ns 497 ns 460 ns 456 ns 258 ns 259 ns 202 ns
int32_t int8_t 2024 493 ns 164 ns 122 ns 92.1 ns 166 ns 111 ns 93.6 ns
int32_t int16_t 2024 492 ns 140 ns 115 ns 92.3 ns 143 ns 112 ns 92.2 ns
int32_t int32_t 2024 42.4 ns 73.0 ns 71.4 ns 70.3 ns 628 ns 630 ns 637 ns
int32_t int64_t 2024 493 ns 478 ns 478 ns 464 ns 201 ns 208 ns 171 ns
int64_t int8_t 2024 502 ns 501 ns 509 ns 504 ns 661 ns 665 ns 672 ns
int64_t int16_t 2024 487 ns 224 ns 219 ns 183 ns 220 ns 228 ns 183 ns
int64_t int32_t 2024 490 ns 220 ns 211 ns 182 ns 217 ns 227 ns 185 ns
int64_t int64_t 2024 108 ns 168 ns 133 ns 132 ns 675 ns 747 ns 669 ns
Benchmark program
// Copyright (c) Microsoft Corporation.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

#pragma warning(disable : 4244)

#include <benchmark/benchmark.h>
#include <cstddef>
#include <cstdint>
#include <vector>

using namespace std;

template <class To, class From, From Val = 0x5A>
void vector_assign(benchmark::State& state) {
    const auto size = static_cast<size_t>(state.range(0));
    std::vector<From> a(size, Val);
    std::vector<To> b;
    b.reserve(a.size());

    for (auto _ : state) {
        benchmark::DoNotOptimize(a);
        b.assign(a.begin(), a.end());
        benchmark::DoNotOptimize(b);
        b.clear();
    }
}

template <class To, class From, From Val = 0x5A>
void vector_copy_alg(benchmark::State& state) {
    const auto size = static_cast<size_t>(state.range(0));
    std::vector<From> a(size, Val);
    std::vector<To> b(size);

    for (auto _ : state) {
        benchmark::DoNotOptimize(a);
        copy(a.begin(), a.end(), b.begin());
        benchmark::DoNotOptimize(b);
        fill(b.begin(), b.end(), 0);
    }
}

template <class To, class From, From Val = 0x5A>
void vector_copy_raw(benchmark::State& state) {
    const auto size = static_cast<size_t>(state.range(0));
    std::vector<From> a(size, Val);
    std::vector<To> b(size);

    for (auto _ : state) {
        benchmark::DoNotOptimize(a);
        for (size_t i = 0, m = a.size(); i != m; ++i) {
            b[i] = a[i];
        }
        benchmark::DoNotOptimize(b);
        fill(b.begin(), b.end(), 0);
    }
}


void common_args(auto bm) {
    bm->Arg(2024);
}

BENCHMARK(vector_assign<uint8_t, uint8_t>)->Apply(common_args);
BENCHMARK(vector_assign<uint8_t, uint16_t>)->Apply(common_args);
BENCHMARK(vector_assign<uint8_t, uint32_t>)->Apply(common_args);
BENCHMARK(vector_assign<uint8_t, uint64_t>)->Apply(common_args);

BENCHMARK(vector_assign<uint16_t, uint8_t>)->Apply(common_args);
BENCHMARK(vector_assign<uint16_t, uint16_t>)->Apply(common_args);
BENCHMARK(vector_assign<uint16_t, uint32_t>)->Apply(common_args);
BENCHMARK(vector_assign<uint16_t, uint64_t>)->Apply(common_args);

BENCHMARK(vector_assign<uint32_t, uint8_t>)->Apply(common_args);
BENCHMARK(vector_assign<uint32_t, uint16_t>)->Apply(common_args);
BENCHMARK(vector_assign<uint32_t, uint32_t>)->Apply(common_args);
BENCHMARK(vector_assign<uint32_t, uint64_t>)->Apply(common_args);

BENCHMARK(vector_assign<uint64_t, uint8_t>)->Apply(common_args);
BENCHMARK(vector_assign<uint64_t, uint16_t>)->Apply(common_args);
BENCHMARK(vector_assign<uint64_t, uint32_t>)->Apply(common_args);
BENCHMARK(vector_assign<uint64_t, uint64_t>)->Apply(common_args);

BENCHMARK(vector_assign<int8_t, int8_t>)->Apply(common_args);
BENCHMARK(vector_assign<int8_t, int16_t>)->Apply(common_args);
BENCHMARK(vector_assign<int8_t, int32_t>)->Apply(common_args);
BENCHMARK(vector_assign<int8_t, int64_t>)->Apply(common_args);

BENCHMARK(vector_assign<int16_t, int8_t>)->Apply(common_args);
BENCHMARK(vector_assign<int16_t, int16_t>)->Apply(common_args);
BENCHMARK(vector_assign<int16_t, int32_t>)->Apply(common_args);
BENCHMARK(vector_assign<int16_t, int64_t>)->Apply(common_args);

BENCHMARK(vector_assign<int32_t, int8_t>)->Apply(common_args);
BENCHMARK(vector_assign<int32_t, int16_t>)->Apply(common_args);
BENCHMARK(vector_assign<int32_t, int32_t>)->Apply(common_args);
BENCHMARK(vector_assign<int32_t, int64_t>)->Apply(common_args);

BENCHMARK(vector_assign<int64_t, int8_t>)->Apply(common_args);
BENCHMARK(vector_assign<int64_t, int16_t>)->Apply(common_args);
BENCHMARK(vector_assign<int64_t, int32_t>)->Apply(common_args);
BENCHMARK(vector_assign<int64_t, int64_t>)->Apply(common_args);

BENCHMARK(vector_copy_alg<uint8_t, uint8_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<uint8_t, uint16_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<uint8_t, uint32_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<uint8_t, uint64_t>)->Apply(common_args);

BENCHMARK(vector_copy_alg<uint16_t, uint8_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<uint16_t, uint16_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<uint16_t, uint32_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<uint16_t, uint64_t>)->Apply(common_args);

BENCHMARK(vector_copy_alg<uint32_t, uint8_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<uint32_t, uint16_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<uint32_t, uint32_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<uint32_t, uint64_t>)->Apply(common_args);

BENCHMARK(vector_copy_alg<uint64_t, uint8_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<uint64_t, uint16_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<uint64_t, uint32_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<uint64_t, uint64_t>)->Apply(common_args);

BENCHMARK(vector_copy_alg<int8_t, int8_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<int8_t, int16_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<int8_t, int32_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<int8_t, int64_t>)->Apply(common_args);

BENCHMARK(vector_copy_alg<int16_t, int8_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<int16_t, int16_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<int16_t, int32_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<int16_t, int64_t>)->Apply(common_args);

BENCHMARK(vector_copy_alg<int32_t, int8_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<int32_t, int16_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<int32_t, int32_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<int32_t, int64_t>)->Apply(common_args);

BENCHMARK(vector_copy_alg<int64_t, int8_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<int64_t, int16_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<int64_t, int32_t>)->Apply(common_args);
BENCHMARK(vector_copy_alg<int64_t, int64_t>)->Apply(common_args);

BENCHMARK(vector_copy_raw<uint8_t, uint8_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<uint8_t, uint16_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<uint8_t, uint32_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<uint8_t, uint64_t>)->Apply(common_args);

BENCHMARK(vector_copy_raw<uint16_t, uint8_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<uint16_t, uint16_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<uint16_t, uint32_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<uint16_t, uint64_t>)->Apply(common_args);

BENCHMARK(vector_copy_raw<uint32_t, uint8_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<uint32_t, uint16_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<uint32_t, uint32_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<uint32_t, uint64_t>)->Apply(common_args);

BENCHMARK(vector_copy_raw<uint64_t, uint8_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<uint64_t, uint16_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<uint64_t, uint32_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<uint64_t, uint64_t>)->Apply(common_args);

BENCHMARK(vector_copy_raw<int8_t, int8_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<int8_t, int16_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<int8_t, int32_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<int8_t, int64_t>)->Apply(common_args);

BENCHMARK(vector_copy_raw<int16_t, int8_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<int16_t, int16_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<int16_t, int32_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<int16_t, int64_t>)->Apply(common_args);

BENCHMARK(vector_copy_raw<int32_t, int8_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<int32_t, int16_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<int32_t, int32_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<int32_t, int64_t>)->Apply(common_args);

BENCHMARK(vector_copy_raw<int64_t, int8_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<int64_t, int16_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<int64_t, int32_t>)->Apply(common_args);
BENCHMARK(vector_copy_raw<int64_t, int64_t>)->Apply(common_args);


BENCHMARK_MAIN();

Explanation

_Uninitialized_copy[_meow] used in vecor::assign and _Copy_meow used in std::copy use metaprogramming to call memmove/memcpy, but otherwise have simple loops. The compiler somehow is confused by these loops, and doesn't always vectorize them.

Possible solutions

The following makes sense to me

  • Report missed optimization bugs
  • Rewrite these loops so that compiler could optimize them

I don't think manually vectorizing every conversion is a good idea, as there are too many of them. Though the advantage would be runtime CPU detection.

@AlexGuteniev AlexGuteniev added the question Further information is requested label Sep 30, 2024
@StephanTLavavej StephanTLavavej added performance Must go faster compiler Compiler work involved and removed question Further information is requested labels Oct 2, 2024
@StephanTLavavej
Copy link
Member

We talked about this at the weekly maintainer meeting and we agree:

  • Missed optimizations should be reported to the compiler back-end team
  • Following the usual pattern to detect when auto-vectorization would be possible, and adding STL code that's auto-vectorization friendly (e.g. raw pointers with indices instead of iterator loops) isn't too burdensome, and we can typically respond faster than compiler optimizations can be enhanced

@AlexGuteniev
Copy link
Contributor Author

I've reported the compiler issues:

There isn't much hope for the compiler to improve. On reporting one of the issues, similar problem was found DevCom-1262302 and it is Closed - Lower Priority

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler Compiler work involved performance Must go faster
Projects
None yet
Development

No branches or pull requests

2 participants