Conditions using bitwise OR on booleans may produce de-optimized code #32414

mzabaluev · 2016-03-21T23:54:25Z

When writing optimization-friendly code, sometimes it might seem like a good idea to unroll branches by hand by performing multi-step computations optimistically, while keeping tabs on possible failures of intermediate steps as booleans. These failure flags are then combined using bitwise logic when the validity of the result is finally checked. Here's an example of how it might work with checked arithmetics, inspired by some code in std::hash:

fn calculate_size(elem_size: usize,
                  length: usize,
                  offset: usize)
                  -> Option<usize> {
    let (acc, oflo1) = elem_size.overflowing_mul(length);
    let (acc, oflo2) = acc.overflowing_add(offset);
    if oflo1 | oflo2 {
        None
    } else {
        Some(acc)
    }
}

However, in optimized code generation at least on x86-64, the bitwise OR for booleans is sometimes decomposed into a series of checks and branches, defeating the whole purpose. Here's a condensed benchmark comparing boolean OR with integer bitwise OR, where the results of both are used as the condition for branching.

#![feature(test)]

extern crate test;

use test::Bencher;

#[inline(never)]
fn or_bools(a: bool, b: bool, c: bool) -> Option<u64> {
    if a | b | c { Some(1) } else { None }
}

#[inline(never)]
fn or_bytes(a: u8, b: u8, c: u8) -> Option<u64> {
    if (a | b | c) != 0 { Some(1) } else { None }
}

#[bench]
fn bench_or_bools(b: &mut Bencher) {
    const DATA: [(bool, bool, bool); 4]
              = [(false, false, false),
                 (true , false, false),
                 (false, true , false),
                 (false, false, true)];
    b.iter(|| {
        for i in 0 .. 4 {
            let (a, b, c) = DATA[i];
            test::black_box(or_bools(a, b, c));
        }
    })
}

#[bench]
fn bench_or_bytes(b: &mut Bencher) {
    const DATA: [(u8, u8, u8); 4]
              = [(0u8, 0u8, 0u8),
                 (1u8, 0u8, 0u8),
                 (0u8, 1u8, 0u8),
                 (0u8, 0u8, 1u8)];
    b.iter(|| {
        for i in 0 .. 4 {
            let (a, b, c) = DATA[i];
            test::black_box(or_bytes(a, b, c));
        }
    })
}

The de-optimization looks like work of LLVM, as the IR for or_bools preserves the original intent:

; Function Attrs: noinline norecurse nounwind uwtable
define internal fastcc void @_ZN8or_bools20h51bacbaed15b22f4gaaE(%"2.core::option::Option<u64>"* noalias nocapture dereferenceable(16), i1 zeroext, i1 zeroext, i1 zeroext) unnamed_addr #0 {
entry-block:
  %4 = or i1 %1, %2
  %5 = or i1 %4, %3
  %6 = bitcast %"2.core::option::Option<u64>"* %0 to i8*
  br i1 %5, label %then-block-26-, label %else-block

then-block-26-:                                   ; preds = %entry-block
  tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %6, i8* nonnull bitcast ({ i64, i64, [0 x i8] }* @const5784 to i8*), i64 16, i32 8, i1 false)
  br label %join

else-block:                                       ; preds = %entry-block
  tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %6, i8* nonnull bitcast ({ i64, [8 x i8] }* @const5785 to i8*), i64 16, i32 8, i1 false)
  br label %join

join:                                             ; preds = %else-block, %then-block-26-
  ret void
}

The text was updated successfully, but these errors were encountered:

mzabaluev · 2016-03-21T23:58:15Z

This probably should have been filed straight for LLVM, but I'm intimidated by their Bugzilla.

Rewrote the hand-unrolled code that used .overflowing_* arithmetic methods, into idiomatic Option chaining with .checked_*. Premature optimization is the root of all evil. Especially if it does not end up actually optimizing: rust-lang/rust#32414

eefriedman · 2016-03-24T02:07:46Z

Long discussion on LLVM bugtracker: https://llvm.org/bugs/show_bug.cgi?id=23827 .

mzabaluev · 2016-03-24T08:34:50Z

@eefriedman, thanks. As I understood it, the relative performance of branched vs branchless code to evaluate a boolean-ish expression tree depends on the predictability of branch predicates, and branched code may be better when: 1) the branches are highly predictable; 2) earlier ones tend to be taken often, short-circuiting the evaluation. In the specific case of checked arithmetics, the branches are very predictable (no overflows occur normally), but an evaluation normally has to check all predicates. I guess it's rather a matter of hinting the compiler on the probability of branches, the support for which is yet to be implemented (#26179).

Mark-Simulacrum · 2017-05-02T11:56:02Z

The LLVM bug was closed as invalid, and unstable expect/likely intrinsic is now implemented judging by the tracking issue. I'm going to close this, please ask us to reopen if you disagree.

steveklabnik added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Mar 24, 2016

Mark-Simulacrum closed this as completed May 2, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conditions using bitwise OR on booleans may produce de-optimized code #32414

Conditions using bitwise OR on booleans may produce de-optimized code #32414

mzabaluev commented Mar 21, 2016

mzabaluev commented Mar 21, 2016

eefriedman commented Mar 24, 2016

mzabaluev commented Mar 24, 2016

Mark-Simulacrum commented May 2, 2017

Conditions using bitwise OR on booleans may produce de-optimized code #32414

Conditions using bitwise OR on booleans may produce de-optimized code #32414

Comments

mzabaluev commented Mar 21, 2016

mzabaluev commented Mar 21, 2016

eefriedman commented Mar 24, 2016

mzabaluev commented Mar 24, 2016

Mark-Simulacrum commented May 2, 2017