-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Function multiversioning for a performance boost #515
Conversation
Adds some transitive Entirely opt-in for now. |
I'm not seeing end-to-end decoding speed improvements. I think the cost of running runtime feature detection for every row may outweigh the time savings from faster filtering: decode reprocessed qoi-bench images...
Encoding shows only a very slight speedup...
Edit: I should add that
|
wrt decoding: on my profile a large RGB or RGBA image with mostly Paeth filtering spends 30% of the decoding time in Paeth unfiltering, so if multiversioning introduces no overhead then the end-to-end speedup would be 4%. I am seeing exactly that speedup on a 4K RGBA image. I'm also seeing a 3% improvement on a 1024x1024 RGBA image, so the multiversioning overhead is not dramatic. Those x86-64-v3 gains are very tantalizing! I wonder what gets vectorized and where to get us these results. How are you measuring these? I'd be very interested in reproducing these. |
I cannot measure any difference in end-to-end in decoding between SSE 4.1 multiversioning and On your machine are you seeing any improvement from replacing |
I've been using my own benchmark harness for measurements. So far it hasn't been intended for public consumption, so I probably need to polish the interface and expand the readme before it makes sense to anyone other than me 😄 The test data was re-encoded versions of the qoi benchmark suite images, so many/most were far smaller than 1024x1024. Haven't looked at the full distribution, but at first glance I see a bunch of 64x64 icons included.
Nope, I see the same performance for both. |
Interesting! It seems that the performance boost on your machine comes from somewhere other than filtering. It would be interesting to profile the process with samply at a high sampling rate, say As for this PR, in its current form it's probably not worth it, and without access to a machine that benefits from v3 optimizations I will not be able to work on this any further. Since there's nothing further I can do, I'm going to go ahead and close this PR. |
Implements #514 and adds a benchmarking harness for filtering.
Unfiltering (decoding) only benefits from SSE 4.1 and even that only in Paeth. I think it's still worthwhile because unfiltering is responsible for a large % of decoding time and Paeth predictor is very common. Detailed benchmarks can be found in #514
Filtering (encoding) is already fast, and is not the bottleneck in encoding. It sees minor boosts to Paeth from SSE 4.1 and AVX each to the tune of 10% to 15%, but not enough to justify the binary size increase. However, AVX2 significantly improves performance across the board, with a 2.5x throughput improvement for Paeth. This is welcome now that we're looking into making adaptive filtering the default.
Detailed filtering benchmarks