Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extremely bad codegen of leading_zeros on several architectures #85879

Open
AaronKutch opened this issue Jun 1, 2021 · 3 comments
Open

Extremely bad codegen of leading_zeros on several architectures #85879

AaronKutch opened this issue Jun 1, 2021 · 3 comments
Labels
A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@AaronKutch
Copy link
Contributor

AaronKutch commented Jun 1, 2021

A few months ago, I noticed some problems with the code gen of leading_zeros for RISCV and some other architectures, and tried to fix it in a compiler-builtins PR. Today, I was looking at the function for another reason and decided to see what the codegen currently looks like. I assumed that LLVM used __clzsi2 for usize::leading_zeros and then used a recursive wrapper to extend it to larger integers. It seems that it is using its own method sometimes, and other times using __clzsi2. I double checked that the codegen in compiler-builtins is still correct. To see what I am talking about, run this code in a empty library crate with cargo rustc --release --target=[target triple] -- --emit=asm and check the assembly file under target/[target triple]/release/deps/:

#![no_std]

pub fn lz_u128(x: u128) -> usize {
    x.leading_zeros() as usize
}

pub fn actual_lz(x: usize) -> usize {
    x.leading_zeros() as usize
}

fn usize_leading_zeros_riscv(x: usize) -> usize {
    let mut x = x;
    // the number of potential leading zeros
    let mut z = usize::MAX.count_ones() as usize;
    // a temporary
    let mut t: usize;
    #[cfg(target_pointer_width = "64")]
    {
        // If the upper 32 bits of `x` are not all 0, `t` is set to `1 << 5`, otherwise
        // `t` is set to 0.
        t = ((x >= (1 << 32)) as usize) << 5;
        // If `t` was set to `1 << 5`, then the upper 32 bits are shifted down for the
        // next step to process.
        x >>= t;
        // If `t` was set to `1 << 5`, then we subtract 32 from the number of potential
        // leading zeros
        z -= t;
    }
    #[cfg(any(target_pointer_width = "32", target_pointer_width = "64"))]
    {
        t = ((x >= (1 << 16)) as usize) << 4;
        x >>= t;
        z -= t;
    }
    t = ((x >= (1 << 8)) as usize) << 3;
    x >>= t;
    z -= t;
    t = ((x >= (1 << 4)) as usize) << 2;
    x >>= t;
    z -= t;
    t = ((x >= (1 << 2)) as usize) << 1;
    x >>= t;
    z -= t;
    t = (x >= (1 << 1)) as usize;
    x >>= t;
    z -= t;
    // All bits except the LSB are guaranteed to be zero for this final bisection step.
    // If `x != 0` then `x == 1` and subtracts one potential zero from `z`.
    z - x
}

fn usize_leading_zeros_default(x: usize) -> usize {
    let mut x = x;
    // the number of potential leading zeros
    let mut z = usize::MAX.count_ones() as usize;
    // a temporary
    let mut t: usize;
    #[cfg(target_pointer_width = "64")]
    {
        t = x >> 32;
        if t != 0 {
            z -= 32;
            x = t;
        }
    }
    #[cfg(any(target_pointer_width = "32", target_pointer_width = "64"))]
    {
        t = x >> 16;
        if t != 0 {
            z -= 16;
            x = t;
        }
    }
    t = x >> 8;
    if t != 0 {
        z -= 8;
        x = t;
    }
    t = x >> 4;
    if t != 0 {
        z -= 4;
        x = t;
    }
    t = x >> 2;
    if t != 0 {
        z -= 2;
        x = t;
    }
    // the last two bisections are combined into one conditional
    t = x >> 1;
    if t != 0 {
        z - 2
    } else {
        z - x
    }
}

pub fn expected_lz(x: usize) -> usize {
    if cfg!(any(target_arch = "riscv32", target_arch = "riscv64")) {
        usize_leading_zeros_riscv(x)
    } else {
        usize_leading_zeros_default(x)
    }
}

The expected_lz function is equivalent to what compiler-builtins does, while actual_lz show what a plain leading_zeros compiles down to. lz_u128 demonstrates another issue that I have known for some time, but never got fixed. I think I remember nagisa opening an LLVM issue for it, but can't find the link.

The worst behavior is on riscv32i-unknown-none-elf where expected_lz compiles down to the expected branchless 27 assembly instructions, but actual_lz has more assembly instructions and includes a call to __mulsi3 which probably results in it being about as expensive to compute as a full integer division. lz_u128 is truly horrifying.

riscv64gc-unknown-none-elf does roughly the same thing, and while the multiplication could be expected to run much faster, expected_lz is still much better. Whatever inlining lead to lz_u128 is not good. The correct lz_u128 looks like what is produced by:

pub fn lz_u128_expected(x: u128) -> usize {
    let mut x = x;
    let mut z = 0;
    if ((x >> 64) as u64) != 0 {
        z += 64;
    } else {
        x >>= 64;
    }
    // note: `usize` needs to have bitwidth 64 here
    z + expected_lz(x as usize) as usize
}

thumbv8m.base-none-eabi (I decided to use this as a non-riscv example that has no hardware clz) uses calls to __clzsi2 instead of rolling its own thing. Take a look at the number of __clzsi2 calls and extra instructions lz_u128 makes vs. what the lz_u128_expected I coded above shows should be possible. I suspect that instead of two separate bugs, there is one bug where LLVM defines larger bitwidth clz's in terms of small clz's in such a way that it generates horrible code when smaller clz's get inlined upwards into larger clz's. Whenever LLVM decides to roll its own code instead of __clzsi2, it starts with some small clz on an integer like u8, and by the time u64 is reached the code is extremely garbled.

aarch64-pc-windows-msvc and x86_64 call hardware clz and do the correct extension for lz_u128. expected_lz can be ignored, since __clzsi2 is only used when a hardware clz cannot be used.

It is important for the embedded architectures to have a better clz, because the performance of that function is the base of many other kinds of arithmetic emulation such as soft floats.

@inquisitivecrystal
Copy link
Contributor

@rustbot label +I-slow +A-codegen +C-bug +T-libs-impl +T-compiler

@rustbot rustbot added A-codegen Area: Code generation C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Jun 5, 2021
@scottmcm
Copy link
Member

scottmcm commented Jun 6, 2021

What's the LLVM-IR being produced here? If it's calling llvm.ctlz and it's having target-specific problems, then it seems to me like it should be an LLVM bug, not something that makes sense for rust itself to try to mitigate. (If it was a bad implementation in compiler-builtins then that would be something rust could fix, but it sounds like you've already done that.)

@AaronKutch
Copy link
Contributor Author

On the all architectures it is stuff like

; testlib::actual_lz
; Function Attrs: nounwind readnone willreturn
define dso_local i32 @_ZN7testlib9actual_lz17hff1ee67f9ad9af94E(i32 %x) unnamed_addr #0 {
start:
  %0 = tail call i32 @llvm.ctlz.i32(i32 %x, i1 false) #3, !range !1
  ret i32 %0
}

on 64 bit

; testlib::actual_lz
; Function Attrs: nounwind readnone uwtable willreturn
define i64 @_ZN7testlib9actual_lz17h9c7db4bb56dbe73dE(i64 %x) unnamed_addr #0 {
start:
  %0 = tail call i64 @llvm.ctlz.i64(i64 %x, i1 false) #3, !range !2
  ret i64 %0
}

So I guess it is an LLVM backend lowering issue.

@Amanieu Amanieu added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. and removed T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Jan 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

5 participants