Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conditions using bitwise OR on booleans may produce de-optimized code #32414

Closed
mzabaluev opened this issue Mar 21, 2016 · 4 comments
Closed
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@mzabaluev
Copy link
Contributor

When writing optimization-friendly code, sometimes it might seem like a good idea to unroll branches by hand by performing multi-step computations optimistically, while keeping tabs on possible failures of intermediate steps as booleans. These failure flags are then combined using bitwise logic when the validity of the result is finally checked. Here's an example of how it might work with checked arithmetics, inspired by some code in std::hash:

fn calculate_size(elem_size: usize,
                  length: usize,
                  offset: usize)
                  -> Option<usize> {
    let (acc, oflo1) = elem_size.overflowing_mul(length);
    let (acc, oflo2) = acc.overflowing_add(offset);
    if oflo1 | oflo2 {
        None
    } else {
        Some(acc)
    }
}

However, in optimized code generation at least on x86-64, the bitwise OR for booleans is sometimes decomposed into a series of checks and branches, defeating the whole purpose. Here's a condensed benchmark comparing boolean OR with integer bitwise OR, where the results of both are used as the condition for branching.

#![feature(test)]

extern crate test;

use test::Bencher;

#[inline(never)]
fn or_bools(a: bool, b: bool, c: bool) -> Option<u64> {
    if a | b | c { Some(1) } else { None }
}

#[inline(never)]
fn or_bytes(a: u8, b: u8, c: u8) -> Option<u64> {
    if (a | b | c) != 0 { Some(1) } else { None }
}

#[bench]
fn bench_or_bools(b: &mut Bencher) {
    const DATA: [(bool, bool, bool); 4]
              = [(false, false, false),
                 (true , false, false),
                 (false, true , false),
                 (false, false, true)];
    b.iter(|| {
        for i in 0 .. 4 {
            let (a, b, c) = DATA[i];
            test::black_box(or_bools(a, b, c));
        }
    })
}

#[bench]
fn bench_or_bytes(b: &mut Bencher) {
    const DATA: [(u8, u8, u8); 4]
              = [(0u8, 0u8, 0u8),
                 (1u8, 0u8, 0u8),
                 (0u8, 1u8, 0u8),
                 (0u8, 0u8, 1u8)];
    b.iter(|| {
        for i in 0 .. 4 {
            let (a, b, c) = DATA[i];
            test::black_box(or_bytes(a, b, c));
        }
    })
}

The de-optimization looks like work of LLVM, as the IR for or_bools preserves the original intent:

; Function Attrs: noinline norecurse nounwind uwtable
define internal fastcc void @_ZN8or_bools20h51bacbaed15b22f4gaaE(%"2.core::option::Option<u64>"* noalias nocapture dereferenceable(16), i1 zeroext, i1 zeroext, i1 zeroext) unnamed_addr #0 {
entry-block:
  %4 = or i1 %1, %2
  %5 = or i1 %4, %3
  %6 = bitcast %"2.core::option::Option<u64>"* %0 to i8*
  br i1 %5, label %then-block-26-, label %else-block

then-block-26-:                                   ; preds = %entry-block
  tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %6, i8* nonnull bitcast ({ i64, i64, [0 x i8] }* @const5784 to i8*), i64 16, i32 8, i1 false)
  br label %join

else-block:                                       ; preds = %entry-block
  tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %6, i8* nonnull bitcast ({ i64, [8 x i8] }* @const5785 to i8*), i64 16, i32 8, i1 false)
  br label %join

join:                                             ; preds = %else-block, %then-block-26-
  ret void
}
@mzabaluev
Copy link
Contributor Author

This probably should have been filed straight for LLVM, but I'm intimidated by their Bugzilla.

mzabaluev added a commit to mzabaluev/indexed-collections that referenced this issue Mar 22, 2016
Rewrote the hand-unrolled code that used .overflowing_* arithmetic
methods, into idiomatic Option chaining with .checked_*.

Premature optimization is the root of all evil. Especially if it does not
end up actually optimizing:
rust-lang/rust#32414
@eefriedman
Copy link
Contributor

Long discussion on LLVM bugtracker: https://llvm.org/bugs/show_bug.cgi?id=23827 .

@mzabaluev
Copy link
Contributor Author

@eefriedman, thanks. As I understood it, the relative performance of branched vs branchless code to evaluate a boolean-ish expression tree depends on the predictability of branch predicates, and branched code may be better when: 1) the branches are highly predictable; 2) earlier ones tend to be taken often, short-circuiting the evaluation. In the specific case of checked arithmetics, the branches are very predictable (no overflows occur normally), but an evaluation normally has to check all predicates. I guess it's rather a matter of hinting the compiler on the probability of branches, the support for which is yet to be implemented (#26179).

@steveklabnik steveklabnik added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Mar 24, 2016
@Mark-Simulacrum
Copy link
Member

The LLVM bug was closed as invalid, and unstable expect/likely intrinsic is now implemented judging by the tracking issue. I'm going to close this, please ask us to reopen if you disagree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
Development

No branches or pull requests

4 participants