Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

major performance regression between Rust 1.50 and beta when using target-cpu=native #83027

Closed
BurntSushi opened this issue Mar 11, 2021 · 21 comments
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code. P-high High priority regression-from-stable-to-beta Performance or correctness regression from stable to beta. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Milestone

Comments

@BurntSushi
Copy link
Member

BurntSushi commented Mar 11, 2021

I'll just start with some reproduction steps that I'm hoping someone else will be able to reproduce. This assumes you've compiled ripgrep with Rust 1.50 to a binary named rg-stable_1.50 and also compiled ripgrep with Rust nightly 2021-03-09 to a binary named rg-nightly_2021-03-09 (alternatively, compile with the beta release, as I've reproduced the problem there in a subsequent comment):

$ curl -LO 'https://burntsushi.net/stuff/subtitles2016-sample.en.gz'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  265M  100  265M    0     0  32.1M      0  0:00:08  0:00:08 --:--:-- 33.4M

$ gunzip subtitles2016-sample.en.gz

$ time rg-stable_1.50 -c --no-mmap -a '[a-z]' subtitles2016-sample.en
31813587

real    1.601
user    1.467
sys     0.133
maxmem  7 MB
faults  0

$ time rg-nightly_2021-03-09 -c --no-mmap -a '[a-z]' subtitles2016-sample.en
31813587

real    3.973
user    3.837
sys     0.133
maxmem  7 MB
faults  0

Here is the relevant part of the profile I extracted by running the ripgrep compiled with nightly under perf:

simd-funs-not-inlined

The key difference between Rust nightly and stable is the fact that it looks like i8x32::new isn't being inlined. But it's not the only one. There are other functions showing up in the profile, like core::core_arch::x86::m256iExt::as_i32x8, that aren't being inlined either. These are trivial cast functions, and them not being inlined is likely a bug. (So an alternative title for this issue might be, "some trivial functions aren't getting inlined in hot code paths." But I figured I'd start with the actual problem I'm seeing in case my analysis is wrong.)

Initially I assumed that maybe something had changed in stdarch recently related to these code paths, but I don't see anything. So I'm a bit worried that perhaps something else changed that impacted inlining decisions, and this is an indirect effect. Alas, I'm stuck at this point and would love some help getting to the bottom of it.

It's possible, perhaps even likely, that this is related to #60637. I note that it is used to justify some inline(always) annotations, but fn new is left at just #[inline].

Perhaps there is a quick fix where we need to go over some of the lower level SIMD routines and make sure they're tagged with inline(always). But really, it seems to me like these functions really should be inlined automatically. I note that this doesn't look like a cross crate problem that might typically be a reason for preventing inlining. In particular, _mm256_setr_epi8 is being inlined (as one would expect), but the call to i8x32 in its implementation is the thing not being inlined. So this seems pretty suspicious to me.

Apologies for not narrowing this down more. A good next step might be to find the specific version of nightly that introduced this problem.

@jonas-schievink jonas-schievink added I-slow Issue: Problems and improvements with respect to performance of generated code. regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. labels Mar 11, 2021
@rustbot rustbot added the I-prioritize Issue: Indicates that prioritization has been requested for this issue. label Mar 11, 2021
@BurntSushi
Copy link
Member Author

Oh, also, I did try to find a smaller reproduction. Since the regression is ultimately rooted in the SIMD implementation found in the memchr crate, I tried compiling this program with stable vs Rust nightly:

use memchr::memchr;

fn main() {
    let haystack = "abcdefghijklmnopqrstuvwxyz".repeat(15);

    for _ in 0..100_000_000 {
        assert_eq!(None, memchr(b'@', haystack.as_bytes()));
    }
}

But both versions of the program inlined all the routines I would expect.

@Mark-Simulacrum
Copy link
Member

Can you check if beta (1.51) reproduces this regression? My immediate guess is that it's caused by the LLVM 12 upgrade, which landed in #81451. cc @rust-lang/wg-llvm

@BurntSushi
Copy link
Member Author

Yes, I am able to reproduce on beta too:

$ time rg-beta_1.51 -c --no-mmap -a '[a-z]' subtitles2016-sample.en
31813587

real    3.921
user    3.802
sys     0.117
maxmem  7 MB
faults  0

@BurntSushi BurntSushi changed the title major performance regression between Rust 1.50 and nightly major performance regression between Rust 1.50 and beta Mar 11, 2021
@Mark-Simulacrum Mark-Simulacrum added regression-from-stable-to-beta Performance or correctness regression from stable to beta. and removed regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. labels Mar 11, 2021
@Mark-Simulacrum Mark-Simulacrum added this to the 1.51.0 milestone Mar 11, 2021
@Mark-Simulacrum Mark-Simulacrum added the E-needs-bisection Call for participation: This issue needs bisection: https://github.com/rust-lang/cargo-bisect-rustc label Mar 11, 2021
@camelid camelid added the E-needs-mcve Call for participation: This issue has a repro, but needs a Minimal Complete and Verifiable Example label Mar 11, 2021
@camelid
Copy link
Member

camelid commented Mar 11, 2021

Needs MCVE because OP said #83027 (comment) did not reproduce the bug.

@tmiasko
Copy link
Contributor

tmiasko commented Mar 12, 2021

Could you describe full reproduction steps, including any custom options and features used when building ripgrep? Do you use target-cpu=native? What is your CPU, as shown by rustc --print target-cpus? Changes from #80749 could also be relevant.

I couldn't reproduce the issue.

@spastorino
Copy link
Member

Also, are you compiling ripgrep master branch or something else?

@BurntSushi
Copy link
Member Author

Ah!!! Thank you so much for mentioning RUSTFLAGS. My script for compiling ripgrep does indeed have target-cpu=native set. The TL;DR from below is that this appears necessary in order to witness the regression. (IMO, this makes this issue a bit lower in priority since compiling with target-cpu=native is a bit more rare.) The really good news is that I was able to come up with a much smaller reproduction. Although, not quite minimal. Read on.

Some preliminaries for checking my environment:

$ uname -a
Linux frink 5.11.4-arch1-1 #1 SMP PREEMPT Sun, 07 Mar 2021 18:00:49 +0000 x86_64 GNU/Linux
$ lscpu | rg Model
Model:                           79
Model name:                      Intel(R) Core(TM) i7-6900K CPU @ 3.20GHz
$ rustc --print target-cpus | rg native
    native         - Select the CPU of the current host (currently broadwell).
$ rustc +stable --version
rustc 1.50.0 (cb75ad5db 2021-02-10)
$ rustc +beta --version
rustc 1.51.0-beta.4 (4d25f4607 2021-03-05)
$ cd /tmp
$ git clone https://github.com/BurntSushi/ripgrep
$ cd ripgrep
$ git rev-parse HEAD
c7730d1f3a366e42fdd497a1e0db4bf090de415c

Compile four different binaries. stable, stable + target-cpu=native, beta and beta + target-cpu=native. Only beta+native has the performance regression.

$ cargo clean && cargo +stable build --release && cp ./target/release/rg ./rg-stable
$ cargo clean && RUSTFLAGS="-C target-cpu=native" cargo +stable build --release && cp ./target/release/rg ./rg-stable-native
$ cargo clean && cargo +beta build --release && cp ./target/release/rg ./rg-beta
$ cargo clean && RUSTFLAGS="-C target-cpu=native" cargo +beta build --release && cp ./target/release/rg ./rg-beta-native

And to show that only beta+native has the issue (the curl command for getting the subtitles is in my OP):

$ time ./rg-stable -c --no-mmap -a '[a-z]' /tmp/subtitles2016-sample.en
31813587

real    1.477
user    1.352
sys     0.123
maxmem  7 MB
faults  0

$ time ./rg-stable-native -c --no-mmap -a '[a-z]' /tmp/subtitles2016-sample.en
31813587

real    1.568
user    1.417
sys     0.150
maxmem  7 MB
faults  0

$ time ./rg-beta -c --no-mmap -a '[a-z]' /tmp/subtitles2016-sample.en
31813587

real    1.557
user    1.416
sys     0.140
maxmem  7 MB
faults  0

$ time ./rg-beta-native -c --no-mmap -a '[a-z]' /tmp/subtitles2016-sample.en
31813587

real    3.916
user    3.807
sys     0.107
maxmem  7 MB
faults  0

So given the new focus on target-cpu=native, I tried my smaller program above that tried to reproduce this with the memchr crate directly, and that did work:

$ cat Cargo.toml
[package]
name = "memchr-perf-regression"
version = "0.1.0"
authors = ["Andrew Gallant <[email protected]>"]
edition = "2018"

[dependencies]
memchr = "2"

$ cat src/main.rs
use memchr::memchr;

fn main() {
    let haystack = "abcdefghijklmnopqrstuvwxyz".repeat(15);

    for _ in 0..100_000_000 {
        assert_eq!(None, memchr(b'@', haystack.as_bytes()));
    }
}

Now compile two binaries: one with beta and one with beta and target-cpu=native:

$ cargo +beta build --release && cp target/release/memchr-perf-regression ./regress-beta
$ RUSTFLAGS="-C target-cpu=native" cargo +beta build --release && cp target/release/memchr-perf-regression ./regress-beta-native

And now run them:

$ time ./regress-beta

real    0.676
user    0.672
sys     0.003
maxmem  7 MB
faults  0

$ time ./regress-beta-native

real    12.773
user    12.768
sys     0.000
maxmem  7 MB
faults  0

I've run perf on the latter command and attached a screenshot of the results. As with the bigger ripgrep example, neither core::core_arch::x86::m256iExt::as_i32x8 nor core::core_arch::simd::i8x32::new are inlined:

memchr-perf-regress-smaller-program

@BurntSushi BurntSushi changed the title major performance regression between Rust 1.50 and beta major performance regression between Rust 1.50 and beta when using target-cpu=native Mar 12, 2021
@nagisa
Copy link
Member

nagisa commented Mar 12, 2021

What is your native? Does it work if you specify that explicitly? I can guess that your regression might be caused because it is unsound to inline between functions that use different feature sets, and libstd/core will be using generic cpu when they are compiled in CI.

Possible cause: #80749

Does this go away with -Zbuild-std?

@BurntSushi
Copy link
Member Author

@nagisa Broadwell:

$ lscpu | rg Model
Model:                           79
Model name:                      Intel(R) Core(TM) i7-6900K CPU @ 3.20GHz
$ rustc --print target-cpus | rg native
    native         - Select the CPU of the current host (currently broadwell).

I can guess that your regression might be caused because it is unsound to inline between functions that use different feature sets, and libstd/core will be using generic cpu when they are compiled in CI.

What is the best way to fix it?

Does this go away with -Zbuild-std?

I've never tried using -Zbuild-std before, so I'm not sure if I'm doing something wrong, but it doesn't seem to work:

$ RUSTFLAGS="-C target-cpu=native -Zbuild-std" cargo +nightly build --release && cp target/release/memchr-perf-regression ./regress-nightly_2021-03-10-native-buildstd
error: failed to run `rustc` to learn about target-specific information

Caused by:
  process didn't exit successfully: `rustc - --crate-name ___ --print=file-names -C target-cpu=native -Zbuild-std --crate-type bin --crate-type rlib --crate-type dylib --crate-type cdylib --crate-type staticlib --crate-type proc-macro --print=sysroot --print=cfg` (exit code: 1)
  --- stderr
  error: unknown debugging option: `build-std`

@nagisa
Copy link
Member

nagisa commented Mar 12, 2021

-Zbuild-std is a cargo flag. Documentation available here.

What is the best way to fix it?

Perhaps using a -Ctarget-cpu=broadwell helps? If my hypothesis of the cause is correct, I'd say that this is something that people working on SIMD support need to figure out how to support ergonomically (maybe by adding #[inline(always)] to everything? not clear to me if that'd be sound, but its the only thing I'm coming up with on the spot)

@BurntSushi
Copy link
Member Author

Ah thanks for the link. I ran this:

$ RUSTFLAGS="-C target-cpu=native" cargo +nightly build -Zbuild-std --target x86_64-unknown-linux-gnu --release && cp target/x86_64-unknown-linux-gnu/release/memchr-perf-regression ./regress-nightly_2021-03-10-native-buildstd

But the regression remains:

$ time ./regress-nightly_2021-03-10

real    0.724
user    0.720
sys     0.003
maxmem  7 MB
faults  0

$ time ./regress-nightly_2021-03-10-native

real    12.158
user    12.150
sys     0.003
maxmem  7 MB
faults  0

$ time ./regress-nightly_2021-03-10-native-buildstd

real    12.271
user    12.263
sys     0.003
maxmem  7 MB
faults  0

Perhaps using a -Ctarget-cpu=broadwell helps?

I guess it helps in the strictest sense that it doesn't have the performance regression:

$ RUSTFLAGS="-C target-cpu=broadwell" cargo +nightly build --release && cp target/release/memchr-perf-regression ./regress-nightly_2021-03-10-broadwell
   Compiling memchr v2.3.4
   Compiling memchr-perf-regression v0.1.0 (/tmp/memchr-perf-regression)
    Finished release [optimized] target(s) in 1.11s

$ time ./regress-nightly_2021-03-10-broadwell

real    0.767
user    0.763
sys     0.003
maxmem  7 MB
faults  0

But I think what I meant was, "how do we not get a performance regression when using target-cpu=native"?

I'd say that this is something that people working on SIMD support need to figure out how to support ergonomically (maybe by adding #[inline(always)] to everything? not clear to me if that'd be sound, but its the only thing I'm coming up with on the spot)

Hmmm... Okay. cc @Amanieu

Is there a more succinct/higher-level description of why #80749 is possibly the cause here? I guess what I mean to say is, what changed that stopped the inlining from happening here?

@nagisa
Copy link
Member

nagisa commented Mar 12, 2021

Is there a more succinct/higher-level description of why #80749 is possibly the cause here? I guess what I mean to say is, what changed that stopped the inlining from happening here?

I'm happy to try and explain it here; I don't recall there being a good description of this elsewhere:

It is not valid for a function to be inlined into another if the feature sets differ between them. On x86_64 in particular this is exemplified by potentially differing ABIs and registers when a feature is available and when it isn't. As the features are tracked at a per-function level, LLVM is forced to disable inlining of such differing functions so that their features don't get lost. The linked PR specifies an exact list of features that shall be applied to all functions that don't specify anything otherwise, so I suspect conflicts in memchr code occur quite naturally when there's interaction between SIMD and regular code.

With that in mind I would've expected -Zbuild-std to help with this, because that way all the code in the binary is compiled with the same feature-set everywhere (again), but it seems like there's still something missing in the equation here and a more minimal example would still be very helpful.

@lqd
Copy link
Member

lqd commented Mar 12, 2021

@BurntSushi I can reproduce on skylake, including with the following:

#[cfg(target_arch = "x86")]
use std::arch::x86::*;

#[cfg(target_arch = "x86_64")]
use std::arch::x86_64::*;

use std::intrinsics::transmute;

fn main() {
    #[target_feature(enable = "avx2")]
    unsafe fn test() {
        let a = _mm256_set_epi32(1, 1, 1, 1, 1, 1, 1, 1);
        let b = _mm256_set_epi32(2, 2, 2, 2, 2, 2, 2, 2);
        
        let e = _mm256_set_epi32(3, 3, 3, 3, 3, 3, 3, 3);
        let r = _mm256_add_epi32(a, b);

        assert_eq_m256i(e, r);
    }

    if is_x86_feature_detected!("avx2") {
        unsafe { test() }
    } else {
        panic!("avx2 feature not detected");
    }
}

#[target_feature(enable = "avx")]
pub unsafe fn assert_eq_m256i(a: __m256i, b: __m256i) {
    assert_eq!(transmute::<_, [u64; 4]>(a), transmute::<_, [u64; 4]>(b))
}

Building without -C target-cpu or with -C target-cpu=skylake will inline the as_i32x8 functions, but not with -C target-cpu=native for me:

$ objdump -d regress| grep as_i32x8
$ objdump -d regress-skylake| grep as_i32x8
$ objdump -d regress-native| grep as_i32x8
0000000000006960 <_ZN4core9core_arch3x868m256iExt8as_i32x817h6f0a02a3bdc3d3e7E>:
    6ade:       e8 7d fe ff ff          callq  6960 <_ZN4core9core_arch3x868m256iExt8as_i32x817h6f0a02a3bdc3d3e7E>
    6b0a:       e8 51 fe ff ff          callq  6960 <_ZN4core9core_arch3x868m256iExt8as_i32x817h6f0a02a3bdc3d3e7E>

@BurntSushi
Copy link
Member Author

@lqd Thanks! Hopefully that helps dig into this a bit more.

@nagisa

It is not valid for a function to be inlined into another if the feature sets differ between them.

So just to be super precise, did you mean "differ" literally? As in, if I have a function compiled with just the sse2 feature but the caller is compiled with sse2,avx, then I would assume that said function could be inlined even though the feature sets are technically distinct.

I'm assuming that you mean, "if the caller's feature set is not a superset of the function, then the function cannot be inlined." If that assumption is wrong, then I think my mental model is broken.

The linked PR specifies an exact list of features that shall be applied to all functions that don't specify anything otherwise, so I suspect conflicts in memchr code occur quite naturally when there's interaction between SIMD and regular code.

Hmmm okay. So let me try to play this back to you in my own words to make sure I grok this. So let's pick a function that isn't getting inlined, say, as_i32x8. It has no target_feature attribute and is only #[inline]. Since it's part of std, it's compiled with the lowest common denominator on x86_64, so its actual features when I compile the repro above don't include avx. And that prevents inlining? But it seems like it should be allowed to be inlined because the calling code is a superset?

I think the key here is that functions like as_i32x8 are being used as a sort of internal platform independent vector type that isn't necessarily tied to AVX. (Although it has been a while since I've touched stdarch.) So they end up getting used in the implementation of AVX specific intrinsics, and we generally expect them to get inlined.

So I guess what I don't quite grok is what it is about target-cpu=native specifically that is preventing inlining here where as other settings work. And yeah, I also don't understand why build-std doesn't fix this, which I think makes me at least as confused as you. (But very likely more so.)

I think your point above about these sorts of functions being tagged with inline(always) as being unsound also, unfortunately, sounds right to me. Unless rustc can guarantee that a function tagged inline(always) won't get inlined into calling code that can't handle that function's ABI. I had thought rustc had some logic to handle that case that maybe @alexcrichton added, but my memory is super hazy. But in this case, it seems like inlining is safe and okay here, so that's why I'm confused.

@alexcrichton
Copy link
Member

LLVM should inline based on subsets, not exact matches. If it's not then that's a bug.

I can't reproduce with -Ctarget-cpu=native myself unfortunately. @BurntSushi can you gist the LLVM IR for this file when the inlining doesn't happen? That will help illuminate why LLVM isn't inlining

#[cfg(target_arch = "x86")]
use std::arch::x86::*;

#[cfg(target_arch = "x86_64")]
use std::arch::x86_64::*;

extern "C" {
    fn black_box(a: *const u8);
}

pub fn foo() {
    #[target_feature(enable = "avx2")]
    unsafe fn test() {
        let a = _mm256_set_epi32(1, 1, 1, 1, 1, 1, 1, 1);
        let b = _mm256_set_epi32(2, 2, 2, 2, 2, 2, 2, 2);

        let e = _mm256_set_epi32(3, 3, 3, 3, 3, 3, 3, 3);
        let r = _mm256_add_epi32(a, b);

        assert_eq_m256i(e, r);
    }

    if is_x86_feature_detected!("avx2") {
        unsafe { test() }
    } else {
        loop {}
    }
}

#[target_feature(enable = "avx")]
pub unsafe fn assert_eq_m256i(a: __m256i, b: __m256i) {
    black_box(&a as *const _ as *const _);
    black_box(&b as *const _ as *const _);
}

@lqd
Copy link
Member

lqd commented Mar 12, 2021

@nagisa
Copy link
Member

nagisa commented Mar 12, 2021

I'm assuming that you mean, "if the caller's feature set is not a superset of the function, then the function cannot be inlined." If that assumption is wrong, then I think my mental model is broken.

I'm sorry for my confusing wording. Its not exactly superset, but when features are compatible. Subset-superset relationship does not always imply compatibility, though it usually does, and for x86_64, as far as I can tell, if the callee has a subset of features, it is compatible for inlining.


So let me try to play this back to you in my own words to make sure I grok this. So let's pick a function that isn't getting inlined, say, as_i32x8. It has no target_feature attribute and is only #[inline]. Since it's part of std, it's compiled with the lowest common denominator on x86_64, so its actual features when I compile the repro above don't include avx.

After some thinking I think what may be happening here is somewhat different. I'll output some LLVM-IR in the further explanation as well as some rust code. Everything (MCVE) together is in this godbolt.

So… when a #[target_feature(enable="avx2")] is specified on top of a function, as such:

#[target_feature(enable = "avx2")]
pub unsafe fn _mm256_add_epi32(a: __m256i, b: __m256i) -> __m256i { ... }

It will translate to a function that looks a lot like this:

define void @_mm256_add_epi32(%__m256i* %0, %__m256i* %1, %__m256i* %2) unnamed_addr #0 { ... }

attributes #0 = { ... "target-cpu"="skylake-avx512" "target-features"="+avx2" }

Similarly, when a function as such is compiled:

pub(crate) trait m256iExt: Sized {
    // ...
    // #[target_feature(default)]
    fn as_i32x8(self) -> i32x8 {
        unsafe { transmute(self.as_m256i()) }
    }
}

It will become a:

define internal fastcc void @as_i32x8(<8 x i32>* %0, %__m256i* %1) unnamed_addr #0 { ... }

attributes #0 = { ... "target-cpu"="skylake-avx512" } ; uses global default target-features!

Now, AFAICT LLVM will not "combine" the per-function target features to the list of global features, but rather overwrite. And so what ought to happen here is that we have a _mm256_add_epi32 with a single target feature (avx2), and a as_i32x8 with whatever the globally set features are (with -Ctarget-cpu=native maybe the entire list of all the features your CPU supports)?


Now, some further exploration with the godbolt example has showed some pretty weird behaviours, so I'm not exactly sure if what I'm saying is entirely correct.

So I think my theory may be plausible to some extent, but also probably incorrect given the two weird behaviours above...

@nagisa
Copy link
Member

nagisa commented Mar 12, 2021

In short, there's at least one bug with -Ctarget-cpu=native handling – we should be prepending the features that we set globally to the target-features set of every function as well. Whether that will help with this bug or not, I'm not sure.

@nagisa nagisa removed the E-needs-mcve Call for participation: This issue has a repro, but needs a Minimal Complete and Verifiable Example label Mar 13, 2021
@lqd
Copy link
Member

lqd commented Mar 14, 2021

Not that there were many doubts left, but I've indeed bisected this to c87ef0a. That is #80749 as expected.

@lqd lqd removed the E-needs-bisection Call for participation: This issue needs bisection: https://github.com/rust-lang/cargo-bisect-rustc label Mar 14, 2021
@JohnTitor JohnTitor added the A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. label Mar 17, 2021
bors added a commit to rust-lang-ci/rust that referenced this issue Mar 17, 2021
…ochenkov

Adjust `-Ctarget-cpu=native` handling in cg_llvm

When cg_llvm encounters the `-Ctarget-cpu=native` it computes an
explciit set of features that applies to the target in order to
correctly compile code for the host CPU (because e.g. `skylake` alone is
not sufficient to tell if some of the instructions are available or
not).

However there were a couple of issues with how we did this. Firstly, the
order in which features were overriden wasn't quite right – conceptually
you'd expect `-Ctarget-cpu=native` option to override the features that
are implicitly set by the target definition. However due to how other
`-Ctarget-cpu` values are handled we must adopt the following order
of priority:

* Features from -Ctarget-cpu=*; are overriden by
* Features implied by --target; are overriden by
* Features from -Ctarget-feature; are overriden by
* function specific features.

Another problem was in that the function level `target-features`
attribute would overwrite the entire set of the globally enabled
features, rather than just the features the
`#[target_feature(enable/disable)]` specified. With something like
`-Ctarget-cpu=native` we'd end up in a situation wherein a function
without `#[target_feature(enable)]` annotation would have a broader
set of features compared to a function with one such attribute. This
turned out to be a cause of heavy run-time regressions in some code
using these function-level attributes in conjunction with
`-Ctarget-cpu=native`, for example.

With this PR rustc is more careful about specifying the entire set of
features for functions that use `#[target_feature(enable/disable)]` or
`#[instruction_set]` attributes.

Sadly testing the original reproducer for this behaviour is quite
impossible – we cannot rely on `-Ctarget-cpu=native` to be anything in
particular on developer or CI machines.

cc rust-lang#83027 `@BurntSushi`
@apiraino
Copy link
Contributor

Assigning P-high as discussed as part of the Prioritization Working Group procedure and removing I-prioritize.

@rustbot label -I-prioritize +P-high

@rustbot rustbot added P-high High priority and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels Mar 17, 2021
@apiraino apiraino added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Mar 17, 2021
@BurntSushi
Copy link
Member Author

This does appear fixed by #83084!

$ time ./rg-nightly-2021-03-10 -c --no-mmap -a '[a-z]' /tmp/subtitles2016-sample.en
31813587

real    3.945
user    3.812
sys     0.130
maxmem  7 MB
faults  0

$ time ./rg-nightly-2021-03-17 -c --no-mmap -a '[a-z]' /tmp/subtitles2016-sample.en
31813587

real    1.507
user    1.372
sys     0.133
maxmem  7 MB
faults  0

Thanks again @nagisa and everyone who helped diagnose this problem. :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code. P-high High priority regression-from-stable-to-beta Performance or correctness regression from stable to beta. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests