-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very slow decoding of paletted PNG images #393
Comments
Measured on |
|
The version in this crate handles paletted images with 1/2/4-bit indices, whereas |
Based on this comment I understand So it does handle lower bit depth indices correctly, but they're expanded to 8 bits before being used for palette lookups. |
I thought maybe the My guess is that the bottleneck is the access pattern of using the same buffer for holding the indices and expanding the colors in back-to-front order. If a separate buffer with the indices were passed in to the pub fn unpack_bits<F>(buf: &mut [u8], palette_indices: &[u8], channels: usize, bit_depth: u8, func: F) The possibility of vectorization or at least less bounds checks seems a likely benefit. L36 can be simplified to remove a shift but it won't make a difference. Lines 27 to 38 in 2f53fc4
The hot loop is calling the closure here. The indices in the closures can be replaced with Lines 856 to 863 in 2f53fc4
I also tried unpacking the iterators to understand the logic and maybe get better performance.It's definitely clumsy trying to convert that iterator chain to some loop format. I converted to this to see if there was a better way of special-casing the different depths but 8-bit will always only have a 0 for the shift and shouldn't ever have a let entry_indices = (0..entries).rev().skip(usize::from(skip != 0));
let mut j = buf.len() - channels;
let mut curr = buf[entries - 1];
if skip != 0 {
for shift in (0..8).step_by(bit_depth as usize).skip(skip) {
let pixel = (curr >> shift) & mask;
func(pixel, &mut buf[j..j + channels]);
if j < channels {
return;
}
j -= channels;
}
}
for idx in entry_indices {
curr = buf[idx];
for shift in (0..8).step_by(bit_depth as usize) {
let pixel = (curr >> shift) & mask;
func(pixel, &mut buf[j..j + channels]);
if j < channels {
return;
}
j -= channels;
}
} |
This reminded me that I started working on more refactoring which included changing the transformations to not operate in place. Just created #396 with the current state |
Looking at top-500 website, paletted images account for 35% of PNG images. Source:
OTOH so far I only had negative results when trying to improve paletted images: #416 (comment) |
It's the bit unpacking hot loop that's slowing things down. It should be fairly straightforward to integrate into the |
That code was changed in #405 and looks like this now. The distinction I'd make is that it's not the bit-unpacking as much as retrieving colors from the lookup table and filling those into the buffer. Trying to do it in 2 passes ended up slower than the current behavior of unpacking and then doing the lookup in 1 pass. The code didn't seem to get autovectorized even using I'm not sure that it's straightforward or easy to get similar results without larger architectural changes as mentioned in #416 (comment) where different several methods were tried unsuccessfully. |
It is clearly possible to do better here still, as evidenced by |
FWIW, I've tried running Maybe this means that the observed runtime delta (compared to For the record, here are the
|
|
You need to call If you set that decoder option, then 40% of the total decoding time is spent in It is measured on this image, meant to be a a more realistic use case for indexed PNG: Stadt_Onex_2021_indexed_oxipng, originally sourced from wikimedia and converted to paletted image by myself. It is possible that Chromium prefers to resolve the indices themselves using the palette chunk instead of getting RGB or RGBA output from the decoder, in which case this is bottleneck is not relevant to Chromium. This hot loop is where 40% of the decoding time for this image is spent: Lines 872 to 879 in b00fb53
Code used for measurement: https://gist.github.com/Shnatsel/ec75b05205ef55c96b9d256fdc36c2ec |
Ooops. You're right, my bad. I knew about that but forgot :-(.
Thanks for sharing.
Ack.
No, Chromium also wants to expand into RGBA (at a high-level, there are some other unimportant details like BGRA and/or alpha-premul).
I've tried to replicate your results with the code at https://github.com/anforowicz/image-png/tree/palette-measurements, but I got different results. I am very curious what may be causing the difference. Repro steps:
In the https://share.firefox.dev/3tOf0ZM that got, I can see the |
If this is a benchmark, you need to set
Speaking of tools, samply is amazing. Try this:
Its processed results are more accurate than |
Ah, I've found your problem: there are two different |
Ugh. You're right. Thank you for pointing this out. That's rather embarassing (since I've abandoned my earlier attempts at improving For now I've opened #453 to cover the I'll try to find some time next week to also work on:
|
Probably because of inlining. More on inlining in Rust You can use https://github.com/pacak/cargo-show-asm to inspect the generated assembly, or just |
Fixed by #462 |
zune-png
benchmarks show that thepng
crate is much slower than other decoders on indexed images - a whopping 3x slower thanzune-png
.CPU profile shows that 71% of the time is spent in
png::utils::unpack_bits
, specifically this hot loop.The code is full of indexing and doesn't look amenable to autovectorization. I think the entire function will have to be rewritten; it's probably a good idea to copy
zune-png
here.The text was updated successfully, but these errors were encountered: