Extremely bad codegen of `leading_zeros` on several architectures #85879

AaronKutch · 2021-06-01T03:29:40Z

A few months ago, I noticed some problems with the code gen of leading_zeros for RISCV and some other architectures, and tried to fix it in a compiler-builtins PR. Today, I was looking at the function for another reason and decided to see what the codegen currently looks like. I assumed that LLVM used __clzsi2 for usize::leading_zeros and then used a recursive wrapper to extend it to larger integers. It seems that it is using its own method sometimes, and other times using __clzsi2. I double checked that the codegen in compiler-builtins is still correct. To see what I am talking about, run this code in a empty library crate with cargo rustc --release --target=[target triple] -- --emit=asm and check the assembly file under target/[target triple]/release/deps/:

#![no_std]

pub fn lz_u128(x: u128) -> usize {
    x.leading_zeros() as usize
}

pub fn actual_lz(x: usize) -> usize {
    x.leading_zeros() as usize
}

fn usize_leading_zeros_riscv(x: usize) -> usize {
    let mut x = x;
    // the number of potential leading zeros
    let mut z = usize::MAX.count_ones() as usize;
    // a temporary
    let mut t: usize;
    #[cfg(target_pointer_width = "64")]
    {
        // If the upper 32 bits of `x` are not all 0, `t` is set to `1 << 5`, otherwise
        // `t` is set to 0.
        t = ((x >= (1 << 32)) as usize) << 5;
        // If `t` was set to `1 << 5`, then the upper 32 bits are shifted down for the
        // next step to process.
        x >>= t;
        // If `t` was set to `1 << 5`, then we subtract 32 from the number of potential
        // leading zeros
        z -= t;
    }
    #[cfg(any(target_pointer_width = "32", target_pointer_width = "64"))]
    {
        t = ((x >= (1 << 16)) as usize) << 4;
        x >>= t;
        z -= t;
    }
    t = ((x >= (1 << 8)) as usize) << 3;
    x >>= t;
    z -= t;
    t = ((x >= (1 << 4)) as usize) << 2;
    x >>= t;
    z -= t;
    t = ((x >= (1 << 2)) as usize) << 1;
    x >>= t;
    z -= t;
    t = (x >= (1 << 1)) as usize;
    x >>= t;
    z -= t;
    // All bits except the LSB are guaranteed to be zero for this final bisection step.
    // If `x != 0` then `x == 1` and subtracts one potential zero from `z`.
    z - x
}

fn usize_leading_zeros_default(x: usize) -> usize {
    let mut x = x;
    // the number of potential leading zeros
    let mut z = usize::MAX.count_ones() as usize;
    // a temporary
    let mut t: usize;
    #[cfg(target_pointer_width = "64")]
    {
        t = x >> 32;
        if t != 0 {
            z -= 32;
            x = t;
        }
    }
    #[cfg(any(target_pointer_width = "32", target_pointer_width = "64"))]
    {
        t = x >> 16;
        if t != 0 {
            z -= 16;
            x = t;
        }
    }
    t = x >> 8;
    if t != 0 {
        z -= 8;
        x = t;
    }
    t = x >> 4;
    if t != 0 {
        z -= 4;
        x = t;
    }
    t = x >> 2;
    if t != 0 {
        z -= 2;
        x = t;
    }
    // the last two bisections are combined into one conditional
    t = x >> 1;
    if t != 0 {
        z - 2
    } else {
        z - x
    }
}

pub fn expected_lz(x: usize) -> usize {
    if cfg!(any(target_arch = "riscv32", target_arch = "riscv64")) {
        usize_leading_zeros_riscv(x)
    } else {
        usize_leading_zeros_default(x)
    }
}

The expected_lz function is equivalent to what compiler-builtins does, while actual_lz show what a plain leading_zeros compiles down to. lz_u128 demonstrates another issue that I have known for some time, but never got fixed. I think I remember nagisa opening an LLVM issue for it, but can't find the link.

The worst behavior is on riscv32i-unknown-none-elf where expected_lz compiles down to the expected branchless 27 assembly instructions, but actual_lz has more assembly instructions and includes a call to __mulsi3 which probably results in it being about as expensive to compute as a full integer division. lz_u128 is truly horrifying.

riscv64gc-unknown-none-elf does roughly the same thing, and while the multiplication could be expected to run much faster, expected_lz is still much better. Whatever inlining lead to lz_u128 is not good. The correct lz_u128 looks like what is produced by:

pub fn lz_u128_expected(x: u128) -> usize {
    let mut x = x;
    let mut z = 0;
    if ((x >> 64) as u64) != 0 {
        z += 64;
    } else {
        x >>= 64;
    }
    // note: `usize` needs to have bitwidth 64 here
    z + expected_lz(x as usize) as usize
}

thumbv8m.base-none-eabi (I decided to use this as a non-riscv example that has no hardware clz) uses calls to __clzsi2 instead of rolling its own thing. Take a look at the number of __clzsi2 calls and extra instructions lz_u128 makes vs. what the lz_u128_expected I coded above shows should be possible. I suspect that instead of two separate bugs, there is one bug where LLVM defines larger bitwidth clz's in terms of small clz's in such a way that it generates horrible code when smaller clz's get inlined upwards into larger clz's. Whenever LLVM decides to roll its own code instead of __clzsi2, it starts with some small clz on an integer like u8, and by the time u64 is reached the code is extremely garbled.

aarch64-pc-windows-msvc and x86_64 call hardware clz and do the correct extension for lz_u128. expected_lz can be ignored, since __clzsi2 is only used when a hardware clz cannot be used.

It is important for the embedded architectures to have a better clz, because the performance of that function is the base of many other kinds of arithmetic emulation such as soft floats.

The text was updated successfully, but these errors were encountered:

inquisitivecrystal · 2021-06-05T01:16:45Z

@rustbot label +I-slow +A-codegen +C-bug +T-libs-impl +T-compiler

scottmcm · 2021-06-06T20:02:50Z

What's the LLVM-IR being produced here? If it's calling llvm.ctlz and it's having target-specific problems, then it seems to me like it should be an LLVM bug, not something that makes sense for rust itself to try to mitigate. (If it was a bad implementation in compiler-builtins then that would be something rust could fix, but it sounds like you've already done that.)

AaronKutch · 2021-06-06T21:14:25Z

On the all architectures it is stuff like

; testlib::actual_lz
; Function Attrs: nounwind readnone willreturn
define dso_local i32 @_ZN7testlib9actual_lz17hff1ee67f9ad9af94E(i32 %x) unnamed_addr #0 {
start:
  %0 = tail call i32 @llvm.ctlz.i32(i32 %x, i1 false) #3, !range !1
  ret i32 %0
}

on 64 bit

; testlib::actual_lz
; Function Attrs: nounwind readnone uwtable willreturn
define i64 @_ZN7testlib9actual_lz17h9c7db4bb56dbe73dE(i64 %x) unnamed_addr #0 {
start:
  %0 = tail call i64 @llvm.ctlz.i64(i64 %x, i1 false) #3, !range !2
  ret i64 %0
}

So I guess it is an LLVM backend lowering issue.

Amanieu added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. and removed T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Jan 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extremely bad codegen of `leading_zeros` on several architectures #85879

Extremely bad codegen of `leading_zeros` on several architectures #85879

AaronKutch commented Jun 1, 2021 •

edited by nikic

Loading

inquisitivecrystal commented Jun 5, 2021

scottmcm commented Jun 6, 2021

AaronKutch commented Jun 6, 2021

Extremely bad codegen of leading_zeros on several architectures #85879

Extremely bad codegen of leading_zeros on several architectures #85879

Comments

AaronKutch commented Jun 1, 2021 • edited by nikic Loading

inquisitivecrystal commented Jun 5, 2021

scottmcm commented Jun 6, 2021

AaronKutch commented Jun 6, 2021

Extremely bad codegen of `leading_zeros` on several architectures #85879

Extremely bad codegen of `leading_zeros` on several architectures #85879

AaronKutch commented Jun 1, 2021 •

edited by nikic

Loading