Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Function multiversioning for a performance boost #515

Closed
wants to merge 5 commits into from

Conversation

Shnatsel
Copy link
Contributor

@Shnatsel Shnatsel commented Sep 29, 2024

Implements #514 and adds a benchmarking harness for filtering.

Unfiltering (decoding) only benefits from SSE 4.1 and even that only in Paeth. I think it's still worthwhile because unfiltering is responsible for a large % of decoding time and Paeth predictor is very common. Detailed benchmarks can be found in #514

Filtering (encoding) is already fast, and is not the bottleneck in encoding. It sees minor boosts to Paeth from SSE 4.1 and AVX each to the tune of 10% to 15%, but not enough to justify the binary size increase. However, AVX2 significantly improves performance across the board, with a 2.5x throughput improvement for Paeth. This is welcome now that we're looking into making adaptive filtering the default.

Detailed filtering benchmarks
shnatsel@shnatsel-desktop ~/C/image-png (multiversion)> cargo +nightly bench --bench=filter --features=benchmarks,multiversioning
   Compiling png v0.17.14 (/home/shnatsel/Code/image-png)
warning: unexpected `cfg` condition name: `fuzzing`
  --> src/decoder/stream.rs:26:38
   |
26 | const CHECKSUM_DISABLED: bool = cfg!(fuzzing);
   |                                      ^^^^^^^
   |
   = help: expected names are: `clippy`, `debug_assertions`, `doc`, `docsrs`, `doctest`, `feature`, `miri`, `overflow_checks`, `panic`, `proc_macro`, `relocation_model`, `rustfmt`, `sanitize`, `sanitizer_cfi_generalize_pointers`, `sanitizer_cfi_normalize_integers`, `target_abi`, `target_arch`, `target_endian`, `target_env`, `target_family`, `target_feature`, `target_has_atomic`, `target_has_atomic_equal_alignment`, `target_has_atomic_load_store`, `target_os`, `target_pointer_width`, `target_thread_local`, `target_vendor`, `test`, `ub_checks`, `unix`, and `windows`
   = help: consider using a Cargo feature instead
   = help: or consider adding in `Cargo.toml` the `check-cfg` lint config for the lint:
            [lints.rust]
            unexpected_cfgs = { level = "warn", check-cfg = ['cfg(fuzzing)'] }
   = help: or consider adding `println!("cargo::rustc-check-cfg=cfg(fuzzing)");` to the top of the `build.rs`
   = note: see <https://doc.rust-lang.org/nightly/rustc/check-cfg/cargo-specifics.html> for more information about checking conditional configuration
   = note: `#[warn(unexpected_cfgs)]` on by default

warning: `png` (lib) generated 1 warning
    Finished `bench` profile [optimized] target(s) in 1.23s
     Running benches/filter.rs (target/release/deps/filter-aa0bbd87c0c9579e)
filter/filter=Sub/bpp=1 time:   [55.595 ns 55.803 ns 56.028 ns]
                        thrpt:  [68.086 GiB/s 68.360 GiB/s 68.616 GiB/s]
                 change:
                        time:   [-17.829% -17.667% -17.502%] (p = 0.00 < 0.05)
                        thrpt:  [+21.215% +21.458% +21.697%]
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  1 (1.00%) low mild
  13 (13.00%) high severe

filter/filter=Sub/bpp=2 time:   [93.166 ns 93.221 ns 93.289 ns]
                        thrpt:  [81.782 GiB/s 81.842 GiB/s 81.890 GiB/s]
                 change:
                        time:   [-26.224% -26.196% -26.163%] (p = 0.00 < 0.05)
                        thrpt:  [+35.433% +35.494% +35.545%]
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe

filter/filter=Sub/bpp=3 time:   [133.86 ns 133.92 ns 133.98 ns]
                        thrpt:  [85.414 GiB/s 85.452 GiB/s 85.490 GiB/s]
                 change:
                        time:   [-27.796% -27.748% -27.706%] (p = 0.00 < 0.05)
                        thrpt:  [+38.325% +38.405% +38.496%]
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  2 (2.00%) high mild

filter/filter=Sub/bpp=4 time:   [189.47 ns 189.53 ns 189.60 ns]
                        thrpt:  [80.478 GiB/s 80.509 GiB/s 80.536 GiB/s]
                 change:
                        time:   [-30.084% -30.056% -30.023%] (p = 0.00 < 0.05)
                        thrpt:  [+42.904% +42.972% +43.030%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  7 (7.00%) high mild

filter/filter=Sub/bpp=6 time:   [329.93 ns 329.98 ns 330.02 ns]
                        thrpt:  [69.354 GiB/s 69.363 GiB/s 69.373 GiB/s]
                 change:
                        time:   [-19.571% -19.537% -19.507%] (p = 0.00 < 0.05)
                        thrpt:  [+24.234% +24.281% +24.333%]
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) low mild
  2 (2.00%) high mild

filter/filter=Sub/bpp=8 time:   [443.66 ns 443.69 ns 443.73 ns]
                        thrpt:  [68.775 GiB/s 68.781 GiB/s 68.786 GiB/s]
                 change:
                        time:   [-18.826% -18.815% -18.804%] (p = 0.00 < 0.05)
                        thrpt:  [+23.158% +23.176% +23.193%]
                        Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
  5 (5.00%) low mild
  9 (9.00%) high mild
  2 (2.00%) high severe

filter/filter=Up/bpp=1  time:   [52.373 ns 52.583 ns 52.764 ns]
                        thrpt:  [72.298 GiB/s 72.546 GiB/s 72.837 GiB/s]
                 change:
                        time:   [-7.8465% -7.5406% -7.2616%] (p = 0.00 < 0.05)
                        thrpt:  [+7.8302% +8.1555% +8.5146%]
                        Performance has improved.
Found 24 outliers among 100 measurements (24.00%)
  22 (22.00%) low severe
  1 (1.00%) low mild
  1 (1.00%) high mild

filter/filter=Up/bpp=2  time:   [91.222 ns 91.275 ns 91.333 ns]
                        thrpt:  [83.534 GiB/s 83.587 GiB/s 83.636 GiB/s]
                 change:
                        time:   [-13.259% -13.215% -13.165%] (p = 0.00 < 0.05)
                        thrpt:  [+15.161% +15.227% +15.285%]
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

filter/filter=Up/bpp=3  time:   [169.43 ns 169.49 ns 169.55 ns]
                        thrpt:  [67.496 GiB/s 67.522 GiB/s 67.546 GiB/s]
                 change:
                        time:   [-14.093% -14.057% -14.021%] (p = 0.00 < 0.05)
                        thrpt:  [+16.308% +16.357% +16.405%]
                        Performance has improved.

filter/filter=Up/bpp=4  time:   [336.60 ns 336.63 ns 336.68 ns]
                        thrpt:  [45.322 GiB/s 45.327 GiB/s 45.333 GiB/s]
                 change:
                        time:   [+12.295% +12.407% +12.501%] (p = 0.00 < 0.05)
                        thrpt:  [-11.112% -11.037% -10.949%]
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  5 (5.00%) high mild

filter/filter=Up/bpp=6  time:   [469.55 ns 469.82 ns 470.14 ns]
                        thrpt:  [48.683 GiB/s 48.717 GiB/s 48.745 GiB/s]
                 change:
                        time:   [+3.5166% +3.5665% +3.6265%] (p = 0.00 < 0.05)
                        thrpt:  [-3.4996% -3.4437% -3.3972%]
                        Performance has regressed.
Found 19 outliers among 100 measurements (19.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  16 (16.00%) high severe

filter/filter=Up/bpp=8  time:   [617.16 ns 617.21 ns 617.28 ns]
                        thrpt:  [49.439 GiB/s 49.444 GiB/s 49.448 GiB/s]
                 change:
                        time:   [+3.3804% +3.6490% +3.9032%] (p = 0.00 < 0.05)
                        thrpt:  [-3.7565% -3.5206% -3.2699%]
                        Performance has regressed.
Found 14 outliers among 100 measurements (14.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  11 (11.00%) high severe

filter/filter=Avg/bpp=1 time:   [90.479 ns 90.547 ns 90.610 ns]
                        thrpt:  [42.100 GiB/s 42.129 GiB/s 42.161 GiB/s]
                 change:
                        time:   [-35.417% -35.346% -35.279%] (p = 0.00 < 0.05)
                        thrpt:  [+54.510% +54.669% +54.838%]
                        Performance has improved.

filter/filter=Avg/bpp=2 time:   [160.97 ns 161.10 ns 161.24 ns]
                        thrpt:  [47.318 GiB/s 47.359 GiB/s 47.397 GiB/s]
                 change:
                        time:   [-36.942% -36.898% -36.858%] (p = 0.00 < 0.05)
                        thrpt:  [+58.374% +58.475% +58.584%]
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

filter/filter=Avg/bpp=3 time:   [247.24 ns 247.42 ns 247.63 ns]
                        thrpt:  [46.215 GiB/s 46.254 GiB/s 46.288 GiB/s]
                 change:
                        time:   [-49.299% -49.262% -49.227%] (p = 0.00 < 0.05)
                        thrpt:  [+96.953% +97.091% +97.234%]
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

filter/filter=Avg/bpp=4 time:   [327.27 ns 327.47 ns 327.71 ns]
                        thrpt:  [46.562 GiB/s 46.596 GiB/s 46.625 GiB/s]
                 change:
                        time:   [-39.599% -39.572% -39.545%] (p = 0.00 < 0.05)
                        thrpt:  [+65.413% +65.487% +65.559%]
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) low mild
  3 (3.00%) high mild
  4 (4.00%) high severe

filter/filter=Avg/bpp=6 time:   [541.78 ns 542.10 ns 542.40 ns]
                        thrpt:  [42.198 GiB/s 42.221 GiB/s 42.246 GiB/s]
                 change:
                        time:   [-33.958% -33.922% -33.884%] (p = 0.00 < 0.05)
                        thrpt:  [+51.248% +51.335% +51.418%]
                        Performance has improved.

filter/filter=Avg/bpp=8 time:   [655.13 ns 655.64 ns 656.20 ns]
                        thrpt:  [46.506 GiB/s 46.546 GiB/s 46.582 GiB/s]
                 change:
                        time:   [-37.821% -37.784% -37.743%] (p = 0.00 < 0.05)
                        thrpt:  [+60.625% +60.730% +60.827%]
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

filter/filter=Paeth/bpp=1
                        time:   [154.59 ns 154.79 ns 155.01 ns]
                        thrpt:  [24.610 GiB/s 24.644 GiB/s 24.675 GiB/s]
                 change:
                        time:   [-62.512% -62.474% -62.436%] (p = 0.00 < 0.05)
                        thrpt:  [+166.21% +166.48% +166.75%]
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

filter/filter=Paeth/bpp=2
                        time:   [288.96 ns 289.07 ns 289.19 ns]
                        thrpt:  [26.382 GiB/s 26.393 GiB/s 26.403 GiB/s]
                 change:
                        time:   [-62.052% -61.950% -61.819%] (p = 0.00 < 0.05)
                        thrpt:  [+161.91% +162.81% +163.52%]
                        Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
  2 (2.00%) low severe
  1 (1.00%) low mild
  3 (3.00%) high mild
  10 (10.00%) high severe

filter/filter=Paeth/bpp=3
                        time:   [461.75 ns 461.98 ns 462.18 ns]
                        thrpt:  [24.761 GiB/s 24.772 GiB/s 24.784 GiB/s]
                 change:
                        time:   [-64.958% -64.940% -64.922%] (p = 0.00 < 0.05)
                        thrpt:  [+185.08% +185.22% +185.37%]
                        Performance has improved.

filter/filter=Paeth/bpp=4
                        time:   [637.02 ns 637.17 ns 637.33 ns]
                        thrpt:  [23.942 GiB/s 23.948 GiB/s 23.953 GiB/s]
                 change:
                        time:   [-65.399% -65.327% -65.269%] (p = 0.00 < 0.05)
                        thrpt:  [+187.93% +188.41% +189.01%]
                        Performance has improved.

filter/filter=Paeth/bpp=6
                        time:   [939.10 ns 939.37 ns 939.70 ns]
                        thrpt:  [24.357 GiB/s 24.365 GiB/s 24.372 GiB/s]
                 change:
                        time:   [-63.200% -63.188% -63.176%] (p = 0.00 < 0.05)
                        thrpt:  [+171.57% +171.65% +171.74%]
                        Performance has improved.
Found 20 outliers among 100 measurements (20.00%)
  4 (4.00%) high mild
  16 (16.00%) high severe

filter/filter=Paeth/bpp=8
                        time:   [1.1355 µs 1.1364 µs 1.1375 µs]
                        thrpt:  [26.828 GiB/s 26.854 GiB/s 26.876 GiB/s]
                 change:
                        time:   [-64.346% -64.330% -64.312%] (p = 0.00 < 0.05)
                        thrpt:  [+180.20% +180.35% +180.47%]
                        Performance has improved.
Found 18 outliers among 100 measurements (18.00%)
  3 (3.00%) high mild
  15 (15.00%) high severe

@Shnatsel Shnatsel changed the title Multiversion Function multiversioning for a performance boost Sep 29, 2024
@Shnatsel
Copy link
Contributor Author

Shnatsel commented Sep 29, 2024

Adds some transitive unsafe but I'm not worried about it because none of it is reachable with attacker-controlled data.

Entirely opt-in for now.

@fintelia
Copy link
Contributor

fintelia commented Sep 29, 2024

I'm not seeing end-to-end decoding speed improvements. I think the cost of running runtime feature detection for every row may outweigh the time savings from faster filtering:

decode reprocessed qoi-bench images...

main:                  197.0 MP/s (average)  176.0 MP/s (geomean)
multiversion:          196.0 MP/s (average)  175.0 MP/s (geomean)
main-unstable:         202.7 MP/s (average)  182.9 MP/s (geomean)
multiversion-unstable: 202.5 MP/s (average)  182.7 MP/s (geomean)

Encoding shows only a very slight speedup...

main:           235.8 MP/s
multiversion:   237.4 MP/s

Edit: I should add that RUSTFLAGS='-C target-cpu=x86-64-v3' does provide a performance boost:

x86-64-v3 decode:  212.7 MP/s (average)  188.2 MP/s (geomean)
x86-64-v3 encode:  280.1 MP/s

@Shnatsel
Copy link
Contributor Author

wrt decoding: on my profile a large RGB or RGBA image with mostly Paeth filtering spends 30% of the decoding time in Paeth unfiltering, so if multiversioning introduces no overhead then the end-to-end speedup would be 4%. I am seeing exactly that speedup on a 4K RGBA image. I'm also seeing a 3% improvement on a 1024x1024 RGBA image, so the multiversioning overhead is not dramatic.

Those x86-64-v3 gains are very tantalizing! I wonder what gets vectorized and where to get us these results. How are you measuring these? I'd be very interested in reproducing these.

@Shnatsel
Copy link
Contributor Author

I cannot measure any difference in end-to-end in decoding between SSE 4.1 multiversioning and RUSTFLAGS='-C target-cpu=x86-64-v3'. I guess it just doesn't benefit my machine for whatever reason.

On your machine are you seeing any improvement from replacing "x86_64+sse+sse2+sse3+sse4.1+ssse3" strings in the code with "x86_64+sse+sse2+sse3+sse4.1+sse4.2+ssse3+avx+avx2+fma" ? That is more or less v3.

@fintelia
Copy link
Contributor

I've been using my own benchmark harness for measurements. So far it hasn't been intended for public consumption, so I probably need to polish the interface and expand the readme before it makes sense to anyone other than me 😄

The test data was re-encoded versions of the qoi benchmark suite images, so many/most were far smaller than 1024x1024. Haven't looked at the full distribution, but at first glance I see a bunch of 64x64 icons included.

On your machine are you seeing any improvement from replacing "x86_64+sse+sse2+sse3+sse4.1+ssse3" strings in the code with "x86_64+sse+sse2+sse3+sse4.1+sse4.2+ssse3+avx+avx2+fma" ? That is more or less v3.

Nope, I see the same performance for both.

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Oct 5, 2024

Interesting! It seems that the performance boost on your machine comes from somewhere other than filtering.

It would be interesting to profile the process with samply at a high sampling rate, say samply record -r 5000, and compare the profiles between binaries compiled with baseline and v3 profiles. That should point us to the part that actually gets the speed boost.

As for this PR, in its current form it's probably not worth it, and without access to a machine that benefits from v3 optimizations I will not be able to work on this any further. Since there's nothing further I can do, I'm going to go ahead and close this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants