Skip to content

Latest commit

 

History

History
659 lines (547 loc) · 29.9 KB

GPU_Benchmarks.md

File metadata and controls

659 lines (547 loc) · 29.9 KB

GPUs:

Other:

Comparison of Results

Subgroups

  • Some GPU has early termination for helper invocation if derivatives are not used.
  • TBR architectures can fill multiple triangles with a single subgroup, but only with the same instance.
  • TBDR architectures can fill multiple triangles and instances with a single subgroup.
  • TB* architectures can fill triangles with a single subgroup only inside tile region.
GPU subgroup size tile size helper invocation early termination merge triangles merge between instances always full subgroup in FS
Adreno 5xx ? as large as possible - - -
Adreno 6xx 64/128 as large as possible yes yes no
AMD GCN4 64 - no yes no ?
Apple M1 32 16x16 no yes no no
ARM Mali Midgard gen4 (4) 16x16 - - - -
ARM Mali Valhall gen1 16 16x16 yes yes yes (rare) no
Intel UHD 6xx 9.5gen 16 - no no no ?
NV RTX 20xx 32 16x16 no yes no no
PowerVR B‑Series 128 32x32? no yes no yes

Shader instructions

  • FMA and MAD has 2 operations (Mul, Add) but can execute at 1 cycle, some old GPUs has fast MAD and slow FMA.
  • Some GPUs supports 1 cycle HFMA2 - FMA for half2 with 2x performance (2 instructions for 2 half types - 4 flops/cycle).
  • Some GPUs supports FAdd with 2x performance.
  • FMA can be implemented only for fp32 type, fp16 will lost performance when used this F32FMA, so MAD should be used instead.
  • GPU has parallel datapath for fp32 and i32, scheduler can execute i32 instruction in parallel with fp32 without performance lost.
    • NV Turing has 1:1 fp32:i32 config.
    • NV Ampere has 1 full fp32 and 1 fp32:i32, so it can not execute i32 in parallel without fp32 performance lost.
GPU fp32 FMA/MAD fp16x2 FMA/MAD fp16 FMA/MAD FAdd rate parallel fp32 & i32
(confirmed in specs)
parallel fp16 & i16
(confirmed in specs)
Adreno 5xx fma/mad mad - 1 no no
Adreno 6xx fma/mad mad - 1 2:1 2:1
AMD GCN4 fma/mad - - 1 no no
Apple M1 fma/mad no fma/mad 1 2:1 2:1
ARM Mali Midgard gen4 mad no mad 1 no no
ARM Mali Valhall gen1 fma/mad mad - 1 2:1 2:1
Intel UHD 6xx 9.5gen fma/mad fma - 2 2:1 no
NV RTX 20xx (Turing) fma/mad fma - 2 1:1 (specs) 2:1
PowerVR B‑Series fma/mad no mad 1 1:1 no

Shader instructions performance groups

| GPU | | |----------|---|---|---|---|---| | Adreno 5xx | | Adreno 6xx | | AMD GCN4 | | Apple M1 | | ARM Mali Midgard gen4 | | ARM Mali Valhall gen1 | | Intel UHD 6xx 9.5gen | | NV RTX 20xx (Turing) | | PowerVR B‑Series |

Branching

How match Mul and Matrix variants are slower than uniform Branch. [12]

  • Uniform branching is faster on most GPUs.
  • GPU with vector architecture has faster Matrix uniform version.
  • If Branch non-uniform < 2 it indicates that GPU can not optimize short branches.
  • If Branch non-uniform is much greater than Mul non-uniform it indicates that non-uniform branches has additional cost.
GPU Mul uniform Matrix uniform Mul non-uniform Branch non-uniform Matrix non-uniform Mul avg Branch avg Matrix avg
Adreno 5xx 1.6 0.88 1.9 2.1 2.7 1.72 1.54 1.78
Adreno 6xx 1.6 1.0 2.3 1.8 3.0 1.95 1.4 2.0
AMD GCN4 1.7 0.94 2.3 1.6 2.6 2.0 1.3 1.8
Apple M1 1.1 0.8 1.4 1.1 1.8 1.24 1.03 1.26
ARM Mali Midgard gen4 1.5 0.7 1.8 1.3 2.4 1.64 1.1 1.57
ARM Mali Valhall gen1 2.1 1.4 2.3 2.1 3.5 2.18 1.56 2.45
Intel UHD 6xx 9.5gen 1.3 0.87 1.9 1.2 2.6 1.59 1.07 1.71
NV RTX 20xx (Turing) 2.1 1.5 2.4 3.1 3.0 2.1 2.1 2.1
PowerVR B‑Series 2.3 1.5 2.6 3.5 3.1 2.46 2.25 2.33

Subgroup threads order

GPU graphics (quads) graphics (image) compute wg:8x8 (threads) compute (image)
Adreno 5xx ?
Adreno 6xx grid of 4 large quads (4x4 threads) with 4 quads, row major row major 8x8
AMD GCN4 grid of 4 large quads (4x4 threads) with 4 quads, row major column major 8x4, 2 threads in row per column
Apple M1 row major 4x2 row major 8x4
ARM Mali Valhall gen1 random row major 8x2
Intel UHD 6xx 9.5gen grid of 4 quads, row major column major 4x4
NV RTX 20xx (Turing) column major 2x4 row major 8x4
PowerVR B‑Series [_]-curve, row major 8x4 (Hilbert curve?) row major 8x16

NaN

FP32

op \ type nan1 nan2 nan3 nan4 inf -inf max -max
x nan nan nan nan inf -inf max -max
Min(x,0) 0 0 0 0 0 -inf 0 -max
Min(0,x) 0 0 0 0 0 -inf 0 -max
Max(x,0) 0 0 0 0 inf 0 max 0
Max(0,x) 0 0 0 0 inf 0 max 0
Clamp(x,0,1) 0 0 0 0 1 0 1 0
Clamp(x,-1,1) -1 -1 -1 -1 1 -1 1 -1
IsNaN 1 1 1 1 0 0 0 0
IsInfinity 0 0 0 0 1 1 0 0
bool(x) 1 1 1 1 1 1 1 1
x != x 0 0 0 0
Step(0,x) 1 0 1 0
Step(x,0) 0 1 0 1
Step(0,-x) 0 1 0 1
Step(-x,0) 1 0 1 0
SignOrZero(x) 1 -1 1 -1
SignOrZero(‑x) -1 1 -1 1
SmoothStep(x,0,1) 0 0 0 0 1 0 1 0
Normalize(x) nan nan nan nan
differences
  • FP32 on NV Turing, Adreno 5xx/6xx

    op \ type nan1 nan2 nan3 nan4 inf -inf max -max
    x != x 1 1 1 1
    Step(0,x) 1 1 1 1
    Step(x,0) 1 1 1 1
    Step(0,-x) 1 1 1 1
    Step(-x,0) 1 1 1 1
    SignOrZero(x) 0 0 0 0
    SignOrZero(‑x) 0 0 0 0
    Normalize(x) nan nan 0 -0
  • FP32 on Intel gen 9

    op \ type nan1 nan2 nan3 nan4 inf -inf max -max
    x != x 1 1 1 1
    Step(0,x) 1 1 1 1
    Step(x,0) 1 1 1 1
    Step(0,-x) 1 1 1 1
    Step(-x,0) 1 1 1 1
    SignOrZero(x) -1 -1 -1 -1
    SignOrZero(‑x) -1 -1 -1 -1
    Normalize(x) nan nan 0 -0
  • FP32 on Mali Valhall gen1

    op \ type nan1 nan2 nan3 nan4 inf -inf max -max
    x != x 1 1 1 1
    Step(0,x) 0 0 0 0
    Step(x,0) 0 0 0 0
    Step(0,-x) 0 0 0 0
    Step(-x,0) 0 0 0 0
    SignOrZero(x) 0 0 0 0
    SignOrZero(‑x) 0 0 0 0
    Normalize(x) 0 -0 0 -0
  • FP32 on Mali Midgard gen4

    op \ type nan1 nan2 nan3 nan4 inf -inf max -max
    x != x 1 1 1 1
    Step(0,x) 1 1 1 1
    Step(x,0) 1 1 1 1
    Step(0,-x) 1 1 1 1
    Step(-x,0) 1 1 1 1
    SignOrZero(x) 0 0 0 0
    SignOrZero(‑x) 0 0 0 0
    SmoothStep(x,0,1) 0 0 0 0
    Normalize(x) 0 -0 0 -0
  • FP32 on PowerVR B-Series

    op \ type nan1 nan2 nan3 nan4 inf -inf max -max
    x != x 0 0 0 0
    Step(0,x) 1 1 1 1
    Step(x,0) 1 1 1 1
    Step(0,-x) 1 1 1 1
    Step(-x,0) 1 1 1 1
    SignOrZero(x) nan nan nan nan
    SignOrZero(‑x) nan nan nan nan
    Normalize(x) nan nan 18446742974197923840 -18446742974197923840
  • FP32 on AMD GCN4

    op \ type nan1 nan2 nan3 nan4 inf -inf max -max
    x != x 1 1 1 1
    Step(0,x) 1 1 1 1
    Step(x,0) 1 1 1 1
    Step(0,-x) 1 1 1 1
    Step(-x,0) 1 1 1 1
    SignOrZero(x) 1 1 1 1
    SignOrZero(‑x) 1 1 1 1
    Normalize(x) 0 0 0 0
  • FP32 on Apple M1

    op \ type nan1 nan2 nan3 nan4 inf -inf max -max
    x != x 0 0 0 0
    Step(0,x) 0 0 0 0
    Step(x,0) 0 0 0 0
    Step(0,-x) 0 0 0 0
    Step(-x,0) 0 0 0 0
    SignOrZero(x) 0 0 0 0
    SignOrZero(‑x) 0 0 0 0
    Normalize(x) nan nan 0 -0

FP16

op \ type nan1 nan2 nan3 nan4 inf -inf max -max
x nan nan nan nan inf -inf max -max
Min(x,0) 0 0 0 0 0 -inf 0 -max
Min(0,x) 0 0 0 0 0 -inf 0 -max
Max(x,0) 0 0 0 0 inf 0 max 0
Max(0,x) 0 0 0 0 inf 0 max 0
Clamp(x,0,1) 0 0 0 0 1 0 1 0
Clamp(x,-1,1) -1 -1 -1 -1 1 -1 1 -1
IsNaN 1 1 1 1 0 0 0 0
IsInfinity 0 0 0 0 1 1 0 0
bool(x) 1 1 1 1 1 1 1 1
x != x 0 0 0 0
Step(0,x) 1 0 1 0
Step(x,0) 0 1 0 1
Step(0,-x) 0 1 0 1
Step(-x,0) 1 0 1 0
SignOrZero(x) 1 -1 1 -1
SignOrZero(‑x) -1 1 -1 1
SmoothStep(x,0,1) 0 0 0 0 1 0 1 0
Normalize(x) 0 -0
differences
  • FP16 is not supported on Adreno 5xx, Mali Midgard gen4, AMD GCN4

  • FP16 on NV Turing, Adreno 6xx

    op \ type nan1 nan2 nan3 nan4 inf -inf max -max
    x != x 1 1 1 1
    Step(0,x) 1 1 1 1
    Step(x,0) 1 1 1 1
    Step(0,-x) 1 1 1 1
    Step(-x,0) 1 1 1 1
    SignOrZero(x) 0 0 0 0
    SignOrZero(‑x) 0 0 0 0
    Normalize(x) nan nan nan nan nan nan
  • FP16 on Intel gen 9

    op \ type nan1 nan2 nan3 nan4 inf -inf max -max
    x != x 1 1 1 1
    Step(0,x) 1 1 1 1
    Step(x,0) 1 1 1 1
    Step(0,-x) 1 1 1 1
    Step(-x,0) 1 1 1 1
    SignOrZero(x) -1 -1 -1 -1
    SignOrZero(‑x) -1 -1 -1 -1
    Normalize(x) nan nan nan nan nan nan
  • FP16 on Mali Valhall gen1

    op \ type nan1 nan2 nan3 nan4 inf -inf max -max
    x != x 1 1 1 1
    Step(0,x) 0 0 0 0
    Step(x,0) 0 0 0 0
    Step(0,-x) 0 0 0 0
    Step(-x,0) 0 0 0 0
    SignOrZero(x) 0 0 0 0
    SignOrZero(‑x) 0 0 0 0
    Normalize(x) -1 -1 -1 -1 1 -1
  • FP16 on PowerVR B-Series

    op \ type nan1 nan2 nan3 nan4 inf -inf max -max
    x != x 0 0 0 0
    Step(0,x) 1 1 1 1
    Step(x,0) 1 1 1 1
    Step(0,-x) 1 1 1 1
    Step(-x,0) 1 1 1 1
    SignOrZero(x) nan nan nan nan
    SignOrZero(‑x) nan nan nan nan
    Normalize(x) nan nan nan nan nan nan
  • FP16 on Apple M1

    op \ type nan1 nan2 nan3 nan4 inf -inf max -max
    x != x 0 0 0 0
    Step(0,x) 0 0 0 0
    Step(x,0) 0 0 0 0
    Step(0,-x) 0 0 0 0
    Step(-x,0) 0 0 0 0
    SignOrZero(x) 0 0 0 0
    SignOrZero(‑x) 0 0 0 0
    Normalize(x) nan nan nan nan nan nan 0 -0

FP Mediump

op \ type nan1 nan2 nan3 nan4 inf -inf max -max
x nan nan nan nan inf -inf
Min(x,0) 0 0 0 0 0 -inf 0
Min(0,x) 0 0 0 0 0 -inf 0
Max(x,0) 0 0 0 0 inf 0 0
Max(0,x) 0 0 0 0 inf 0 0
Clamp(x,0,1) 0 0 0 0 1 0 1 0
Clamp(x,-1,1) -1 -1 -1 -1 1 -1 1 -1
IsNaN 1 1 1 1 0 0 0 0
IsInfinity 0 0 0 0 1 1 0 0
bool(x) 1 1 1 1 1 1 1 1
x != x 0 0 0 0
Step(0,x) 1 0 1 0
Step(x,0) 0 1 0 1
Step(0,-x) 0 1 0 1
Step(-x,0) 1 0 1 0
SignOrZero(x) 1 -1 1 -1
SignOrZero(‑x) -1 1 -1 1
SmoothStep(x,0,1) 0 0 0 0 1 0 1 0
Normalize(x)
differences
  • FP Mediump on NV Turing

    op \ type nan1 nan2 nan3 nan4 inf -inf max -max
    x max -max
    Min(x,0) -max
    Min(0,x) -max
    Max(x,0) max
    Max(0,x) max
    x != x 1 1 1 1
    Step(0,x) 1 1 1 1
    Step(x,0) 1 1 1 1
    Step(0,-x) 1 1 1 1
    Step(-x,0) 1 1 1 1
    SignOrZero(x) 0 0 0 0
    SignOrZero(‑x) 0 0 0 0
    Normalize(x) nan nan nan nan nan nan 0 -0
  • FP Mediump on Adreno 5xx/6xx

    op \ type nan1 nan2 nan3 nan4 inf -inf max -max
    x 65504 -65504
    Min(x,0) -65504
    Min(0,x) -65504
    Max(x,0) 65504
    Max(0,x) 65504
    x != x 1 1 1 1
    Step(0,x) 1 1 1 1
    Step(x,0) 1 1 1 1
    Step(0,-x) 1 1 1 1
    Step(-x,0) 1 1 1 1
    SignOrZero(x) 0 0 0 0
    SignOrZero(‑x) 0 0 0 0
    Normalize(x) nan nan nan nan nan nan 255 -255
  • FP Mediump on Intel gen9

    op \ type nan1 nan2 nan3 nan4 inf -inf max -max
    x max -max
    Min(x,0) -max
    Min(0,x) -max
    Max(x,0) max
    Max(0,x) max
    x != x 1 1 1 1
    Step(0,x) 1 1 1 1
    Step(x,0) 1 1 1 1
    Step(0,-x) 1 1 1 1
    Step(-x,0) 1 1 1 1
    SignOrZero(x) -1 -1 -1 -1
    SignOrZero(‑x) -1 -1 -1 -1
    Normalize(x) nan nan nan nan nan nan nan nan
  • FP Mediump on Mali Valhall gen1

    op \ type nan1 nan2 nan3 nan4 inf -inf max -max
    x max -max
    Min(x,0) -max
    Min(0,x) -max
    Max(x,0) max
    Max(0,x) max
    x != x 1 1 1 1
    Step(0,x) 0 0 0 0
    Step(x,0) 0 0 0 0
    Step(0,-x) 0 0 0 0
    Step(-x,0) 0 0 0 0
    SignOrZero(x) 0 0 0 0
    SignOrZero(‑x) 0 0 0 0
    Normalize(x) -1 -1 -1 -1 1 -1 1 -1
  • FP Mediump on Mali Midgard gen4

    op \ type nan1 nan2 nan3 nan4 inf -inf max -max
    x max -max
    Min(x,0) -max
    Min(0,x) -max
    Max(x,0) max
    Max(0,x) max
    x != x 1 1 1 1
    Step(0,x) 1 1 1 1
    Step(x,0) 1 1 1 1
    Step(0,-x) 1 1 1 1
    Step(-x,0) 1 1 1 1
    SignOrZero(x) 0 0 0 0
    SignOrZero(‑x) 0 0 0 0
    SmoothStep(x,0,1) 0 0 0 0
    Normalize(x) nan nan nan nan 0 -0 0 -0
  • FP Mediump on PowerVR B-Series

    op \ type nan1 nan2 nan3 nan4 inf -inf max -max
    x max -max
    Min(x,0) -max
    Min(0,x) -max
    Max(x,0) max
    Max(0,x) max
    x != x 0 0 0 0
    Step(0,x) 1 1 1 1
    Step(x,0) 1 1 1 1
    Step(0,-x) 1 1 1 1
    Step(-x,0) 1 1 1 1
    SignOrZero(x) nan nan nan nan
    SignOrZero(‑x) nan nan nan nan
    Normalize(x) nan nan nan nan nan nan 18446742974197923840 -18446742974197923840
  • FP Mediump on AMD GCN4

    op \ type nan1 nan2 nan3 nan4 inf -inf max -max
    x max -max
    Min(x,0) -max
    Min(0,x) -max
    Max(x,0) max
    Max(0,x) max
    x != x 1 1 1 1
    Step(0,x) 1 1 1 1
    Step(x,0) 1 1 1 1
    Step(0,-x) 1 1 1 1
    Step(-x,0) 1 1 1 1
    SignOrZero(x) 1 1 1 1
    SignOrZero(‑x) 1 1 1 1
    Normalize(x) nan nan nan nan 0 0 0 0
  • FP Mediump on Apple M1

    op \ type nan1 nan2 nan3 nan4 inf -inf max -max
    x max -max
    Min(x,0) -max
    Min(0,x) -max
    Max(x,0) max
    Max(0,x) max
    x != x 0 0 0 0
    Step(0,x) 0 0 0 0
    Step(x,0) 0 0 0 0
    Step(0,-x) 0 0 0 0
    Step(-x,0) 0 0 0 0
    SignOrZero(x) 0 0 0 0
    SignOrZero(‑x) 0 0 0 0
    Normalize(x) nan nan nan nan nan nan 0 -0

Memory

RAM, VRAM

GPU VRAM bandwidth from specs (GB/s) VRAM bandwidth measured (GB/s) RAM to VRAM bandwidth from specs (GB/s) RAM to VRAM bandwidth measured (GB/s) VRAM to RAM bandwidth measured (GB/s) RAM to RAM bandwidth measured (GB/s)
Adreno 505 6.4 5
Adreno 660 51.2 34
AMD RX570 (GCN4) 224.0 86
Apple M1 68.25
ARM Mali T830 (Midgard gen4) 14.9 4
ARM Mali G57 (Valhall gen1) 17.07 14.2
Intel UHD 620 (9.5gen) 29.8 23
NV RTX 2080 (Turing) 448.0 403
PowerVR BXM‑8‑256 51.2 14.2

Cache

  • GMem - part of L2 cache which is used to store attachments for TBDR.
    • Adreno has dedicated memory.
    • Mali use L2 and some times attachment can be evicted from L2 to RAM.
GPU GMem (KB) L2 cache per SM (KB) L2 bandwidth (GB/s) L2 cache line (bytes) L1 cache per SM (KB) Texture cache - part of L1 (KB) L1 bandwidth (GB/s)
Adreno 505 128
Adreno 660 1536 128 ? 4? 2? ?
AMD RX570 (GCN4) -
Apple M1 ?
ARM Mali T830 (Midgard gen4) 4 64
ARM Mali G57 (Valhall gen1) 8 512 49 64 32? 32 ?
Intel UHD 620 (9.5gen) - 128 48? 8? 8? 112?
NV RTX 2080 (Turing) - 4096 ? 64 32 ?
PowerVR BXM‑8‑256 ? 1024 ? ? 256? 256 ?

Render target compression

block - compare compression between 1x1 noise and block size (4x4 or 8x8) noise.
max - compare compression between 1x1 noise and solid color.

GPU block size block RGBA8_UNorm max RGBA8_UNorm block RGBA16_UNorm max RGBA16_UNorm method comments
Adreno 5xx 4x4 2.5 2.7 ? ? exec time
Adreno 6xx 16x16 1.9 6.9 ? 3.3 exec time
AMD GCN4 4x4 2.3 3 2.3 3 exec time
Apple M1 8x8 3.4 3.4 6.8 6.8 exec time
Intel UHD 6xx 9.5gen 8x8 1.6 1.8 1.8 1.85 exec time
NV RTX 20xx 4x4 3 3.2 4.1 4.1 exec time
ARM Mali Valhall gen1 4x4 1.9 3.9 1.9 3.7 exec time only 32bit formats, V2
ARM Mali Valhall gen1 4x4 5.9 19 5.7 20 mem traffic used performance counters
PowerVR B‑Series 8x8 23 134 24 134 mem traffic used performance counters

Draw Indirect

GPU direct vs indirect performance
Adreno 5xx direct is faster (20ms vs 41ms)
Adreno 6xx same
AMD GCN4
Apple M1
Intel UHD 6xx 9.5gen
NV RTX 20xx same
ARM Mali Midgard gen4 maxDrawIndirectCount = 1, used instancing instead of multiDraw, indirect is faster (120ms vs 130ms)
ARM Mali Valhall gen1 maxDrawIndirectCount = 1, used instancing instead of multiDraw, indirect is faster (12ms vs 15ms)
ARM Mali Valhall gen3 indirect is faster (25ms vs 31ms)
PowerVR B‑Series same

Test Sources

1. fp16 instruction performance

Has much accurate results than Shader instruction benchmark.
code

2. fp32 instruction performance

Has much accurate results than Shader instruction benchmark.
code

3. Render target compression

code

4. Shader instruction benchmark

code

5. Texture lookup performance

  • sequential access - UV coordinates multiplied by scale and added bias.
    • scale < 1 has better texture cache usage.
    • scale > 1 has high cache misses.
    • scale > 1 in practice used for noise texture in procedural generation.
  • 'noise NxN' - screen divided into blocks with NxN size, each block has unique offset for texture lookup, each pixel in block has 1px offset from nearest pixels.
    • offset with 1px used to find case where nearest warp can not use cached texel.
    • in practice this method is used for packed 2D sprites and textures for meshes.

code

6. Subgroups

7. Buffer/Image storage access

9. Texture cache

Find texture size where performance has near to 2x degradation this indicates a lot of cache misses and bottleneck in high level cache or external memory (RAM/VRAM).
Expected hierarchy:

  • texture cache (L1)
  • L2 cache
  • RAM / VRAM

code

10. Shared memory

code

11. NaN

code

12. Branching

Transform 2D vector into 3D cube face. Uniform version has same cube face per warp. Non-uniform version has unique cube face per thread.
Used 6 branches.

code

13. Circle geometry

Small circles
Large circles blending