Function multiversioning for a performance boost #515

Shnatsel · 2024-09-29T13:22:22Z

Implements #514 and adds a benchmarking harness for filtering.

Unfiltering (decoding) only benefits from SSE 4.1 and even that only in Paeth. I think it's still worthwhile because unfiltering is responsible for a large % of decoding time and Paeth predictor is very common. Detailed benchmarks can be found in #514

Filtering (encoding) is already fast, and is not the bottleneck in encoding. It sees minor boosts to Paeth from SSE 4.1 and AVX each to the tune of 10% to 15%, but not enough to justify the binary size increase. However, AVX2 significantly improves performance across the board, with a 2.5x throughput improvement for Paeth. This is welcome now that we're looking into making adaptive filtering the default.

Detailed filtering benchmarks

shnatsel@shnatsel-desktop ~/C/image-png (multiversion)> cargo +nightly bench --bench=filter --features=benchmarks,multiversioning
   Compiling png v0.17.14 (/home/shnatsel/Code/image-png)
warning: unexpected `cfg` condition name: `fuzzing`
  --> src/decoder/stream.rs:26:38
   |
26 | const CHECKSUM_DISABLED: bool = cfg!(fuzzing);
   |                                      ^^^^^^^
   |
   = help: expected names are: `clippy`, `debug_assertions`, `doc`, `docsrs`, `doctest`, `feature`, `miri`, `overflow_checks`, `panic`, `proc_macro`, `relocation_model`, `rustfmt`, `sanitize`, `sanitizer_cfi_generalize_pointers`, `sanitizer_cfi_normalize_integers`, `target_abi`, `target_arch`, `target_endian`, `target_env`, `target_family`, `target_feature`, `target_has_atomic`, `target_has_atomic_equal_alignment`, `target_has_atomic_load_store`, `target_os`, `target_pointer_width`, `target_thread_local`, `target_vendor`, `test`, `ub_checks`, `unix`, and `windows`
   = help: consider using a Cargo feature instead
   = help: or consider adding in `Cargo.toml` the `check-cfg` lint config for the lint:
            [lints.rust]
            unexpected_cfgs = { level = "warn", check-cfg = ['cfg(fuzzing)'] }
   = help: or consider adding `println!("cargo::rustc-check-cfg=cfg(fuzzing)");` to the top of the `build.rs`
   = note: see <https://doc.rust-lang.org/nightly/rustc/check-cfg/cargo-specifics.html> for more information about checking conditional configuration
   = note: `#[warn(unexpected_cfgs)]` on by default

warning: `png` (lib) generated 1 warning
    Finished `bench` profile [optimized] target(s) in 1.23s
     Running benches/filter.rs (target/release/deps/filter-aa0bbd87c0c9579e)
filter/filter=Sub/bpp=1 time:   [55.595 ns 55.803 ns 56.028 ns]
                        thrpt:  [68.086 GiB/s 68.360 GiB/s 68.616 GiB/s]
                 change:
                        time:   [-17.829% -17.667% -17.502%] (p = 0.00 < 0.05)
                        thrpt:  [+21.215% +21.458% +21.697%]
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  1 (1.00%) low mild
  13 (13.00%) high severe

filter/filter=Sub/bpp=2 time:   [93.166 ns 93.221 ns 93.289 ns]
                        thrpt:  [81.782 GiB/s 81.842 GiB/s 81.890 GiB/s]
                 change:
                        time:   [-26.224% -26.196% -26.163%] (p = 0.00 < 0.05)
                        thrpt:  [+35.433% +35.494% +35.545%]
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe

filter/filter=Sub/bpp=3 time:   [133.86 ns 133.92 ns 133.98 ns]
                        thrpt:  [85.414 GiB/s 85.452 GiB/s 85.490 GiB/s]
                 change:
                        time:   [-27.796% -27.748% -27.706%] (p = 0.00 < 0.05)
                        thrpt:  [+38.325% +38.405% +38.496%]
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  2 (2.00%) high mild

filter/filter=Sub/bpp=4 time:   [189.47 ns 189.53 ns 189.60 ns]
                        thrpt:  [80.478 GiB/s 80.509 GiB/s 80.536 GiB/s]
                 change:
                        time:   [-30.084% -30.056% -30.023%] (p = 0.00 < 0.05)
                        thrpt:  [+42.904% +42.972% +43.030%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  7 (7.00%) high mild

filter/filter=Sub/bpp=6 time:   [329.93 ns 329.98 ns 330.02 ns]
                        thrpt:  [69.354 GiB/s 69.363 GiB/s 69.373 GiB/s]
                 change:
                        time:   [-19.571% -19.537% -19.507%] (p = 0.00 < 0.05)
                        thrpt:  [+24.234% +24.281% +24.333%]
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) low mild
  2 (2.00%) high mild

filter/filter=Sub/bpp=8 time:   [443.66 ns 443.69 ns 443.73 ns]
                        thrpt:  [68.775 GiB/s 68.781 GiB/s 68.786 GiB/s]
                 change:
                        time:   [-18.826% -18.815% -18.804%] (p = 0.00 < 0.05)
                        thrpt:  [+23.158% +23.176% +23.193%]
                        Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
  5 (5.00%) low mild
  9 (9.00%) high mild
  2 (2.00%) high severe

filter/filter=Up/bpp=1  time:   [52.373 ns 52.583 ns 52.764 ns]
                        thrpt:  [72.298 GiB/s 72.546 GiB/s 72.837 GiB/s]
                 change:
                        time:   [-7.8465% -7.5406% -7.2616%] (p = 0.00 < 0.05)
                        thrpt:  [+7.8302% +8.1555% +8.5146%]
                        Performance has improved.
Found 24 outliers among 100 measurements (24.00%)
  22 (22.00%) low severe
  1 (1.00%) low mild
  1 (1.00%) high mild

filter/filter=Up/bpp=2  time:   [91.222 ns 91.275 ns 91.333 ns]
                        thrpt:  [83.534 GiB/s 83.587 GiB/s 83.636 GiB/s]
                 change:
                        time:   [-13.259% -13.215% -13.165%] (p = 0.00 < 0.05)
                        thrpt:  [+15.161% +15.227% +15.285%]
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

filter/filter=Up/bpp=3  time:   [169.43 ns 169.49 ns 169.55 ns]
                        thrpt:  [67.496 GiB/s 67.522 GiB/s 67.546 GiB/s]
                 change:
                        time:   [-14.093% -14.057% -14.021%] (p = 0.00 < 0.05)
                        thrpt:  [+16.308% +16.357% +16.405%]
                        Performance has improved.

filter/filter=Up/bpp=4  time:   [336.60 ns 336.63 ns 336.68 ns]
                        thrpt:  [45.322 GiB/s 45.327 GiB/s 45.333 GiB/s]
                 change:
                        time:   [+12.295% +12.407% +12.501%] (p = 0.00 < 0.05)
                        thrpt:  [-11.112% -11.037% -10.949%]
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  5 (5.00%) high mild

filter/filter=Up/bpp=6  time:   [469.55 ns 469.82 ns 470.14 ns]
                        thrpt:  [48.683 GiB/s 48.717 GiB/s 48.745 GiB/s]
                 change:
                        time:   [+3.5166% +3.5665% +3.6265%] (p = 0.00 < 0.05)
                        thrpt:  [-3.4996% -3.4437% -3.3972%]
                        Performance has regressed.
Found 19 outliers among 100 measurements (19.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  16 (16.00%) high severe

filter/filter=Up/bpp=8  time:   [617.16 ns 617.21 ns 617.28 ns]
                        thrpt:  [49.439 GiB/s 49.444 GiB/s 49.448 GiB/s]
                 change:
                        time:   [+3.3804% +3.6490% +3.9032%] (p = 0.00 < 0.05)
                        thrpt:  [-3.7565% -3.5206% -3.2699%]
                        Performance has regressed.
Found 14 outliers among 100 measurements (14.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  11 (11.00%) high severe

filter/filter=Avg/bpp=1 time:   [90.479 ns 90.547 ns 90.610 ns]
                        thrpt:  [42.100 GiB/s 42.129 GiB/s 42.161 GiB/s]
                 change:
                        time:   [-35.417% -35.346% -35.279%] (p = 0.00 < 0.05)
                        thrpt:  [+54.510% +54.669% +54.838%]
                        Performance has improved.

filter/filter=Avg/bpp=2 time:   [160.97 ns 161.10 ns 161.24 ns]
                        thrpt:  [47.318 GiB/s 47.359 GiB/s 47.397 GiB/s]
                 change:
                        time:   [-36.942% -36.898% -36.858%] (p = 0.00 < 0.05)
                        thrpt:  [+58.374% +58.475% +58.584%]
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

filter/filter=Avg/bpp=3 time:   [247.24 ns 247.42 ns 247.63 ns]
                        thrpt:  [46.215 GiB/s 46.254 GiB/s 46.288 GiB/s]
                 change:
                        time:   [-49.299% -49.262% -49.227%] (p = 0.00 < 0.05)
                        thrpt:  [+96.953% +97.091% +97.234%]
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

filter/filter=Avg/bpp=4 time:   [327.27 ns 327.47 ns 327.71 ns]
                        thrpt:  [46.562 GiB/s 46.596 GiB/s 46.625 GiB/s]
                 change:
                        time:   [-39.599% -39.572% -39.545%] (p = 0.00 < 0.05)
                        thrpt:  [+65.413% +65.487% +65.559%]
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) low mild
  3 (3.00%) high mild
  4 (4.00%) high severe

filter/filter=Avg/bpp=6 time:   [541.78 ns 542.10 ns 542.40 ns]
                        thrpt:  [42.198 GiB/s 42.221 GiB/s 42.246 GiB/s]
                 change:
                        time:   [-33.958% -33.922% -33.884%] (p = 0.00 < 0.05)
                        thrpt:  [+51.248% +51.335% +51.418%]
                        Performance has improved.

filter/filter=Avg/bpp=8 time:   [655.13 ns 655.64 ns 656.20 ns]
                        thrpt:  [46.506 GiB/s 46.546 GiB/s 46.582 GiB/s]
                 change:
                        time:   [-37.821% -37.784% -37.743%] (p = 0.00 < 0.05)
                        thrpt:  [+60.625% +60.730% +60.827%]
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

filter/filter=Paeth/bpp=1
                        time:   [154.59 ns 154.79 ns 155.01 ns]
                        thrpt:  [24.610 GiB/s 24.644 GiB/s 24.675 GiB/s]
                 change:
                        time:   [-62.512% -62.474% -62.436%] (p = 0.00 < 0.05)
                        thrpt:  [+166.21% +166.48% +166.75%]
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

filter/filter=Paeth/bpp=2
                        time:   [288.96 ns 289.07 ns 289.19 ns]
                        thrpt:  [26.382 GiB/s 26.393 GiB/s 26.403 GiB/s]
                 change:
                        time:   [-62.052% -61.950% -61.819%] (p = 0.00 < 0.05)
                        thrpt:  [+161.91% +162.81% +163.52%]
                        Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
  2 (2.00%) low severe
  1 (1.00%) low mild
  3 (3.00%) high mild
  10 (10.00%) high severe

filter/filter=Paeth/bpp=3
                        time:   [461.75 ns 461.98 ns 462.18 ns]
                        thrpt:  [24.761 GiB/s 24.772 GiB/s 24.784 GiB/s]
                 change:
                        time:   [-64.958% -64.940% -64.922%] (p = 0.00 < 0.05)
                        thrpt:  [+185.08% +185.22% +185.37%]
                        Performance has improved.

filter/filter=Paeth/bpp=4
                        time:   [637.02 ns 637.17 ns 637.33 ns]
                        thrpt:  [23.942 GiB/s 23.948 GiB/s 23.953 GiB/s]
                 change:
                        time:   [-65.399% -65.327% -65.269%] (p = 0.00 < 0.05)
                        thrpt:  [+187.93% +188.41% +189.01%]
                        Performance has improved.

filter/filter=Paeth/bpp=6
                        time:   [939.10 ns 939.37 ns 939.70 ns]
                        thrpt:  [24.357 GiB/s 24.365 GiB/s 24.372 GiB/s]
                 change:
                        time:   [-63.200% -63.188% -63.176%] (p = 0.00 < 0.05)
                        thrpt:  [+171.57% +171.65% +171.74%]
                        Performance has improved.
Found 20 outliers among 100 measurements (20.00%)
  4 (4.00%) high mild
  16 (16.00%) high severe

filter/filter=Paeth/bpp=8
                        time:   [1.1355 µs 1.1364 µs 1.1375 µs]
                        thrpt:  [26.828 GiB/s 26.854 GiB/s 26.876 GiB/s]
                 change:
                        time:   [-64.346% -64.330% -64.312%] (p = 0.00 < 0.05)
                        thrpt:  [+180.20% +180.35% +180.47%]
                        Performance has improved.
Found 18 outliers among 100 measurements (18.00%)
  3 (3.00%) high mild
  15 (15.00%) high severe

…ture

Shnatsel · 2024-09-29T13:24:44Z

Adds some transitive unsafe but I'm not worried about it because none of it is reachable with attacker-controlled data.

Entirely opt-in for now.

fintelia · 2024-09-29T18:34:46Z

I'm not seeing end-to-end decoding speed improvements. I think the cost of running runtime feature detection for every row may outweigh the time savings from faster filtering:

decode reprocessed qoi-bench images...

main:                  197.0 MP/s (average)  176.0 MP/s (geomean)
multiversion:          196.0 MP/s (average)  175.0 MP/s (geomean)
main-unstable:         202.7 MP/s (average)  182.9 MP/s (geomean)
multiversion-unstable: 202.5 MP/s (average)  182.7 MP/s (geomean)

Encoding shows only a very slight speedup...

main:           235.8 MP/s
multiversion:   237.4 MP/s

Edit: I should add that RUSTFLAGS='-C target-cpu=x86-64-v3' does provide a performance boost:

x86-64-v3 decode:  212.7 MP/s (average)  188.2 MP/s (geomean)

x86-64-v3 encode:  280.1 MP/s

Shnatsel · 2024-09-29T19:18:48Z

wrt decoding: on my profile a large RGB or RGBA image with mostly Paeth filtering spends 30% of the decoding time in Paeth unfiltering, so if multiversioning introduces no overhead then the end-to-end speedup would be 4%. I am seeing exactly that speedup on a 4K RGBA image. I'm also seeing a 3% improvement on a 1024x1024 RGBA image, so the multiversioning overhead is not dramatic.

Those x86-64-v3 gains are very tantalizing! I wonder what gets vectorized and where to get us these results. How are you measuring these? I'd be very interested in reproducing these.

Shnatsel · 2024-09-29T19:36:10Z

I cannot measure any difference in end-to-end in decoding between SSE 4.1 multiversioning and RUSTFLAGS='-C target-cpu=x86-64-v3'. I guess it just doesn't benefit my machine for whatever reason.

On your machine are you seeing any improvement from replacing "x86_64+sse+sse2+sse3+sse4.1+ssse3" strings in the code with "x86_64+sse+sse2+sse3+sse4.1+sse4.2+ssse3+avx+avx2+fma" ? That is more or less v3.

fintelia · 2024-09-30T01:23:46Z

I've been using my own benchmark harness for measurements. So far it hasn't been intended for public consumption, so I probably need to polish the interface and expand the readme before it makes sense to anyone other than me 😄

The test data was re-encoded versions of the qoi benchmark suite images, so many/most were far smaller than 1024x1024. Haven't looked at the full distribution, but at first glance I see a bunch of 64x64 icons included.

On your machine are you seeing any improvement from replacing "x86_64+sse+sse2+sse3+sse4.1+ssse3" strings in the code with "x86_64+sse+sse2+sse3+sse4.1+sse4.2+ssse3+avx+avx2+fma" ? That is more or less v3.

Nope, I see the same performance for both.

Shnatsel · 2024-10-05T11:53:52Z

Interesting! It seems that the performance boost on your machine comes from somewhere other than filtering.

It would be interesting to profile the process with samply at a high sampling rate, say samply record -r 5000, and compare the profiles between binaries compiled with baseline and v3 profiles. That should point us to the part that actually gets the speed boost.

As for this PR, in its current form it's probably not worth it, and without access to a machine that benefits from v3 optimizations I will not be able to work on this any further. Since there's nothing further I can do, I'm going to go ahead and close this PR.

Shnatsel added 4 commits September 29, 2024 13:29

Multiversion unfiltering for 5% to 15% speedups

4269414

Expose filtering as a benchable API and add a benchmark for it

fddcd46

Commit the rest of the benchmark

3a7a70d

Add multiversioning for filtering and make it an optional, opt-in fea…

03be367

…ture

Shnatsel changed the title ~~Multiversion~~ Function multiversioning for a performance boost Sep 29, 2024

cargo fmt

5b705e6

Shnatsel closed this Oct 5, 2024

Shnatsel mentioned this pull request Oct 5, 2024

Consider runtime CPU feature detection #514

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Function multiversioning for a performance boost #515

Function multiversioning for a performance boost #515

Shnatsel commented Sep 29, 2024 •

edited

Loading

Shnatsel commented Sep 29, 2024 •

edited

Loading

fintelia commented Sep 29, 2024 •

edited

Loading

Shnatsel commented Sep 29, 2024

Shnatsel commented Sep 29, 2024

fintelia commented Sep 30, 2024

Shnatsel commented Oct 5, 2024

Function multiversioning for a performance boost #515

Function multiversioning for a performance boost #515

Conversation

Shnatsel commented Sep 29, 2024 • edited Loading

Shnatsel commented Sep 29, 2024 • edited Loading

fintelia commented Sep 29, 2024 • edited Loading

Shnatsel commented Sep 29, 2024

Shnatsel commented Sep 29, 2024

fintelia commented Sep 30, 2024

Shnatsel commented Oct 5, 2024

Shnatsel commented Sep 29, 2024 •

edited

Loading

Shnatsel commented Sep 29, 2024 •

edited

Loading

fintelia commented Sep 29, 2024 •

edited

Loading