Extremely bad codegen of leading_zeros
on several architectures
#85879
Labels
A-codegen
Area: Code generation
A-LLVM
Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues.
C-bug
Category: This is a bug.
I-slow
Issue: Problems and improvements with respect to performance of generated code.
T-compiler
Relevant to the compiler team, which will review and decide on the PR/issue.
A few months ago, I noticed some problems with the code gen of
leading_zeros
for RISCV and some other architectures, and tried to fix it in a compiler-builtins PR. Today, I was looking at the function for another reason and decided to see what the codegen currently looks like. I assumed that LLVM used__clzsi2
forusize::leading_zeros
and then used a recursive wrapper to extend it to larger integers. It seems that it is using its own method sometimes, and other times using__clzsi2
. I double checked that the codegen incompiler-builtins
is still correct. To see what I am talking about, run this code in a empty library crate withcargo rustc --release --target=[target triple] -- --emit=asm
and check the assembly file undertarget/[target triple]/release/deps/
:The
expected_lz
function is equivalent to whatcompiler-builtins
does, whileactual_lz
show what a plainleading_zeros
compiles down to.lz_u128
demonstrates another issue that I have known for some time, but never got fixed. I think I remember nagisa opening an LLVM issue for it, but can't find the link.The worst behavior is on
riscv32i-unknown-none-elf
whereexpected_lz
compiles down to the expected branchless 27 assembly instructions, butactual_lz
has more assembly instructions and includes a call to__mulsi3
which probably results in it being about as expensive to compute as a full integer division.lz_u128
is truly horrifying.riscv64gc-unknown-none-elf
does roughly the same thing, and while the multiplication could be expected to run much faster,expected_lz
is still much better. Whatever inlining lead tolz_u128
is not good. The correctlz_u128
looks like what is produced by:thumbv8m.base-none-eabi
(I decided to use this as a non-riscv example that has no hardware clz) uses calls to__clzsi2
instead of rolling its own thing. Take a look at the number of__clzsi2
calls and extra instructionslz_u128
makes vs. what thelz_u128_expected
I coded above shows should be possible. I suspect that instead of two separate bugs, there is one bug where LLVM defines larger bitwidth clz's in terms of small clz's in such a way that it generates horrible code when smaller clz's get inlined upwards into larger clz's. Whenever LLVM decides to roll its own code instead of__clzsi2
, it starts with some small clz on an integer likeu8
, and by the timeu64
is reached the code is extremely garbled.aarch64-pc-windows-msvc
and x86_64 call hardware clz and do the correct extension forlz_u128
.expected_lz
can be ignored, since__clzsi2
is only used when a hardware clz cannot be used.It is important for the embedded architectures to have a better clz, because the performance of that function is the base of many other kinds of arithmetic emulation such as soft floats.
The text was updated successfully, but these errors were encountered: