GPUs:
- Adreno 660
- Adreno 505
- AMD RX 570
- Apple M1
- Intel UHD 620
- Mali G57
- Mali G610
- Mali T830
- NVidia RTX 2080
- PowerVR BXM-8-256
Other:
- Some GPU has early termination for helper invocation if derivatives are not used.
- TBR architectures can fill multiple triangles with a single subgroup, but only with the same instance.
- TBDR architectures can fill multiple triangles and instances with a single subgroup.
- TB* architectures can fill triangles with a single subgroup only inside tile region.
GPU | subgroup size | tile size | helper invocation early termination | merge triangles | merge between instances | always full subgroup in FS |
---|---|---|---|---|---|---|
Adreno 5xx | ? | as large as possible | - | - | - | |
Adreno 6xx | 64/128 | as large as possible | yes | yes | no | |
AMD GCN4 | 64 | - | no | yes | no | ? |
Apple M1 | 32 | 16x16 | no | yes | no | no |
ARM Mali Midgard gen4 | (4) | 16x16 | - | - | - | - |
ARM Mali Valhall gen1 | 16 | 16x16 | yes | yes | yes (rare) | no |
Intel UHD 6xx 9.5gen | 16 | - | no | no | no | ? |
NV RTX 20xx | 32 | 16x16 | no | yes | no | no |
PowerVR B‑Series | 128 | 32x32? | no | yes | no | yes |
- FMA and MAD has 2 operations (Mul, Add) but can execute at 1 cycle, some old GPUs has fast MAD and slow FMA.
- Some GPUs supports 1 cycle HFMA2 - FMA for half2 with 2x performance (2 instructions for 2 half types - 4 flops/cycle).
- Some GPUs supports FAdd with 2x performance.
- FMA can be implemented only for fp32 type, fp16 will lost performance when used this F32FMA, so MAD should be used instead.
- GPU has parallel datapath for fp32 and i32, scheduler can execute i32 instruction in parallel with fp32 without performance lost.
- NV Turing has 1:1 fp32:i32 config.
- NV Ampere has 1 full fp32 and 1 fp32:i32, so it can not execute i32 in parallel without fp32 performance lost.
GPU | fp32 FMA/MAD | fp16x2 FMA/MAD | fp16 FMA/MAD | FAdd rate | parallel fp32 & i32 (confirmed in specs) |
parallel fp16 & i16 (confirmed in specs) |
---|---|---|---|---|---|---|
Adreno 5xx | fma/mad | mad | - | 1 | no | no |
Adreno 6xx | fma/mad | mad | - | 1 | 2:1 | 2:1 |
AMD GCN4 | fma/mad | - | - | 1 | no | no |
Apple M1 | fma/mad | no | fma/mad | 1 | 2:1 | 2:1 |
ARM Mali Midgard gen4 | mad | no | mad | 1 | no | no |
ARM Mali Valhall gen1 | fma/mad | mad | - | 1 | 2:1 | 2:1 |
Intel UHD 6xx 9.5gen | fma/mad | fma | - | 2 | 2:1 | no |
NV RTX 20xx (Turing) | fma/mad | fma | - | 2 | 1:1 (specs) | 2:1 |
PowerVR B‑Series | fma/mad | no | mad | 1 | 1:1 | no |
| GPU | | |----------|---|---|---|---|---| | Adreno 5xx | | Adreno 6xx | | AMD GCN4 | | Apple M1 | | ARM Mali Midgard gen4 | | ARM Mali Valhall gen1 | | Intel UHD 6xx 9.5gen | | NV RTX 20xx (Turing) | | PowerVR B‑Series |
How match Mul and Matrix variants are slower than uniform Branch. [12]
- Uniform branching is faster on most GPUs.
- GPU with vector architecture has faster
Matrix uniform
version. - If
Branch non-uniform < 2
it indicates that GPU can not optimize short branches. - If
Branch non-uniform
is much greater thanMul non-uniform
it indicates that non-uniform branches has additional cost.
GPU | Mul uniform | Matrix uniform | Mul non-uniform | Branch non-uniform | Matrix non-uniform | Mul avg | Branch avg | Matrix avg |
---|---|---|---|---|---|---|---|---|
Adreno 5xx | 1.6 | 0.88 | 1.9 | 2.1 | 2.7 | 1.72 | 1.54 | 1.78 |
Adreno 6xx | 1.6 | 1.0 | 2.3 | 1.8 | 3.0 | 1.95 | 1.4 | 2.0 |
AMD GCN4 | 1.7 | 0.94 | 2.3 | 1.6 | 2.6 | 2.0 | 1.3 | 1.8 |
Apple M1 | 1.1 | 0.8 | 1.4 | 1.1 | 1.8 | 1.24 | 1.03 | 1.26 |
ARM Mali Midgard gen4 | 1.5 | 0.7 | 1.8 | 1.3 | 2.4 | 1.64 | 1.1 | 1.57 |
ARM Mali Valhall gen1 | 2.1 | 1.4 | 2.3 | 2.1 | 3.5 | 2.18 | 1.56 | 2.45 |
Intel UHD 6xx 9.5gen | 1.3 | 0.87 | 1.9 | 1.2 | 2.6 | 1.59 | 1.07 | 1.71 |
NV RTX 20xx (Turing) | 2.1 | 1.5 | 2.4 | 3.1 | 3.0 | 2.1 | 2.1 | 2.1 |
PowerVR B‑Series | 2.3 | 1.5 | 2.6 | 3.5 | 3.1 | 2.46 | 2.25 | 2.33 |
op \ type | nan1 | nan2 | nan3 | nan4 | inf | -inf | max | -max |
---|---|---|---|---|---|---|---|---|
x | nan | nan | nan | nan | inf | -inf | max | -max |
Min(x,0) | 0 | 0 | 0 | 0 | 0 | -inf | 0 | -max |
Min(0,x) | 0 | 0 | 0 | 0 | 0 | -inf | 0 | -max |
Max(x,0) | 0 | 0 | 0 | 0 | inf | 0 | max | 0 |
Max(0,x) | 0 | 0 | 0 | 0 | inf | 0 | max | 0 |
Clamp(x,0,1) | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
Clamp(x,-1,1) | -1 | -1 | -1 | -1 | 1 | -1 | 1 | -1 |
IsNaN | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 |
IsInfinity | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
bool(x) | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
x != x | 0 | 0 | 0 | 0 | ||||
Step(0,x) | 1 | 0 | 1 | 0 | ||||
Step(x,0) | 0 | 1 | 0 | 1 | ||||
Step(0,-x) | 0 | 1 | 0 | 1 | ||||
Step(-x,0) | 1 | 0 | 1 | 0 | ||||
SignOrZero(x) | 1 | -1 | 1 | -1 | ||||
SignOrZero(‑x) | -1 | 1 | -1 | 1 | ||||
SmoothStep(x,0,1) | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
Normalize(x) | nan | nan | nan | nan |
differences
-
FP32 on NV Turing, Adreno 5xx/6xx
op \ type nan1 nan2 nan3 nan4 inf -inf max -max x != x 1 1 1 1 Step(0,x) 1 1 1 1 Step(x,0) 1 1 1 1 Step(0,-x) 1 1 1 1 Step(-x,0) 1 1 1 1 SignOrZero(x) 0 0 0 0 SignOrZero(‑x) 0 0 0 0 Normalize(x) nan nan 0 -0 -
FP32 on Intel gen 9
op \ type nan1 nan2 nan3 nan4 inf -inf max -max x != x 1 1 1 1 Step(0,x) 1 1 1 1 Step(x,0) 1 1 1 1 Step(0,-x) 1 1 1 1 Step(-x,0) 1 1 1 1 SignOrZero(x) -1 -1 -1 -1 SignOrZero(‑x) -1 -1 -1 -1 Normalize(x) nan nan 0 -0 -
FP32 on Mali Valhall gen1
op \ type nan1 nan2 nan3 nan4 inf -inf max -max x != x 1 1 1 1 Step(0,x) 0 0 0 0 Step(x,0) 0 0 0 0 Step(0,-x) 0 0 0 0 Step(-x,0) 0 0 0 0 SignOrZero(x) 0 0 0 0 SignOrZero(‑x) 0 0 0 0 Normalize(x) 0 -0 0 -0 -
FP32 on Mali Midgard gen4
op \ type nan1 nan2 nan3 nan4 inf -inf max -max x != x 1 1 1 1 Step(0,x) 1 1 1 1 Step(x,0) 1 1 1 1 Step(0,-x) 1 1 1 1 Step(-x,0) 1 1 1 1 SignOrZero(x) 0 0 0 0 SignOrZero(‑x) 0 0 0 0 SmoothStep(x,0,1) 0 0 0 0 Normalize(x) 0 -0 0 -0 -
FP32 on PowerVR B-Series
op \ type nan1 nan2 nan3 nan4 inf -inf max -max x != x 0 0 0 0 Step(0,x) 1 1 1 1 Step(x,0) 1 1 1 1 Step(0,-x) 1 1 1 1 Step(-x,0) 1 1 1 1 SignOrZero(x) nan nan nan nan SignOrZero(‑x) nan nan nan nan Normalize(x) nan nan 18446742974197923840 -18446742974197923840 -
FP32 on AMD GCN4
op \ type nan1 nan2 nan3 nan4 inf -inf max -max x != x 1 1 1 1 Step(0,x) 1 1 1 1 Step(x,0) 1 1 1 1 Step(0,-x) 1 1 1 1 Step(-x,0) 1 1 1 1 SignOrZero(x) 1 1 1 1 SignOrZero(‑x) 1 1 1 1 Normalize(x) 0 0 0 0 -
FP32 on Apple M1
op \ type nan1 nan2 nan3 nan4 inf -inf max -max x != x 0 0 0 0 Step(0,x) 0 0 0 0 Step(x,0) 0 0 0 0 Step(0,-x) 0 0 0 0 Step(-x,0) 0 0 0 0 SignOrZero(x) 0 0 0 0 SignOrZero(‑x) 0 0 0 0 Normalize(x) nan nan 0 -0
op \ type | nan1 | nan2 | nan3 | nan4 | inf | -inf | max | -max |
---|---|---|---|---|---|---|---|---|
x | nan | nan | nan | nan | inf | -inf | max | -max |
Min(x,0) | 0 | 0 | 0 | 0 | 0 | -inf | 0 | -max |
Min(0,x) | 0 | 0 | 0 | 0 | 0 | -inf | 0 | -max |
Max(x,0) | 0 | 0 | 0 | 0 | inf | 0 | max | 0 |
Max(0,x) | 0 | 0 | 0 | 0 | inf | 0 | max | 0 |
Clamp(x,0,1) | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
Clamp(x,-1,1) | -1 | -1 | -1 | -1 | 1 | -1 | 1 | -1 |
IsNaN | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 |
IsInfinity | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
bool(x) | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
x != x | 0 | 0 | 0 | 0 | ||||
Step(0,x) | 1 | 0 | 1 | 0 | ||||
Step(x,0) | 0 | 1 | 0 | 1 | ||||
Step(0,-x) | 0 | 1 | 0 | 1 | ||||
Step(-x,0) | 1 | 0 | 1 | 0 | ||||
SignOrZero(x) | 1 | -1 | 1 | -1 | ||||
SignOrZero(‑x) | -1 | 1 | -1 | 1 | ||||
SmoothStep(x,0,1) | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
Normalize(x) | 0 | -0 |
differences
-
FP16 is not supported on Adreno 5xx, Mali Midgard gen4, AMD GCN4
-
FP16 on NV Turing, Adreno 6xx
op \ type nan1 nan2 nan3 nan4 inf -inf max -max x != x 1 1 1 1 Step(0,x) 1 1 1 1 Step(x,0) 1 1 1 1 Step(0,-x) 1 1 1 1 Step(-x,0) 1 1 1 1 SignOrZero(x) 0 0 0 0 SignOrZero(‑x) 0 0 0 0 Normalize(x) nan nan nan nan nan nan -
FP16 on Intel gen 9
op \ type nan1 nan2 nan3 nan4 inf -inf max -max x != x 1 1 1 1 Step(0,x) 1 1 1 1 Step(x,0) 1 1 1 1 Step(0,-x) 1 1 1 1 Step(-x,0) 1 1 1 1 SignOrZero(x) -1 -1 -1 -1 SignOrZero(‑x) -1 -1 -1 -1 Normalize(x) nan nan nan nan nan nan -
FP16 on Mali Valhall gen1
op \ type nan1 nan2 nan3 nan4 inf -inf max -max x != x 1 1 1 1 Step(0,x) 0 0 0 0 Step(x,0) 0 0 0 0 Step(0,-x) 0 0 0 0 Step(-x,0) 0 0 0 0 SignOrZero(x) 0 0 0 0 SignOrZero(‑x) 0 0 0 0 Normalize(x) -1 -1 -1 -1 1 -1 -
FP16 on PowerVR B-Series
op \ type nan1 nan2 nan3 nan4 inf -inf max -max x != x 0 0 0 0 Step(0,x) 1 1 1 1 Step(x,0) 1 1 1 1 Step(0,-x) 1 1 1 1 Step(-x,0) 1 1 1 1 SignOrZero(x) nan nan nan nan SignOrZero(‑x) nan nan nan nan Normalize(x) nan nan nan nan nan nan -
FP16 on Apple M1
op \ type nan1 nan2 nan3 nan4 inf -inf max -max x != x 0 0 0 0 Step(0,x) 0 0 0 0 Step(x,0) 0 0 0 0 Step(0,-x) 0 0 0 0 Step(-x,0) 0 0 0 0 SignOrZero(x) 0 0 0 0 SignOrZero(‑x) 0 0 0 0 Normalize(x) nan nan nan nan nan nan 0 -0
op \ type | nan1 | nan2 | nan3 | nan4 | inf | -inf | max | -max |
---|---|---|---|---|---|---|---|---|
x | nan | nan | nan | nan | inf | -inf | ||
Min(x,0) | 0 | 0 | 0 | 0 | 0 | -inf | 0 | |
Min(0,x) | 0 | 0 | 0 | 0 | 0 | -inf | 0 | |
Max(x,0) | 0 | 0 | 0 | 0 | inf | 0 | 0 | |
Max(0,x) | 0 | 0 | 0 | 0 | inf | 0 | 0 | |
Clamp(x,0,1) | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
Clamp(x,-1,1) | -1 | -1 | -1 | -1 | 1 | -1 | 1 | -1 |
IsNaN | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 |
IsInfinity | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
bool(x) | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
x != x | 0 | 0 | 0 | 0 | ||||
Step(0,x) | 1 | 0 | 1 | 0 | ||||
Step(x,0) | 0 | 1 | 0 | 1 | ||||
Step(0,-x) | 0 | 1 | 0 | 1 | ||||
Step(-x,0) | 1 | 0 | 1 | 0 | ||||
SignOrZero(x) | 1 | -1 | 1 | -1 | ||||
SignOrZero(‑x) | -1 | 1 | -1 | 1 | ||||
SmoothStep(x,0,1) | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
Normalize(x) |
differences
-
FP Mediump on NV Turing
op \ type nan1 nan2 nan3 nan4 inf -inf max -max x max -max Min(x,0) -max Min(0,x) -max Max(x,0) max Max(0,x) max x != x 1 1 1 1 Step(0,x) 1 1 1 1 Step(x,0) 1 1 1 1 Step(0,-x) 1 1 1 1 Step(-x,0) 1 1 1 1 SignOrZero(x) 0 0 0 0 SignOrZero(‑x) 0 0 0 0 Normalize(x) nan nan nan nan nan nan 0 -0 -
FP Mediump on Adreno 5xx/6xx
op \ type nan1 nan2 nan3 nan4 inf -inf max -max x 65504 -65504 Min(x,0) -65504 Min(0,x) -65504 Max(x,0) 65504 Max(0,x) 65504 x != x 1 1 1 1 Step(0,x) 1 1 1 1 Step(x,0) 1 1 1 1 Step(0,-x) 1 1 1 1 Step(-x,0) 1 1 1 1 SignOrZero(x) 0 0 0 0 SignOrZero(‑x) 0 0 0 0 Normalize(x) nan nan nan nan nan nan 255 -255 -
FP Mediump on Intel gen9
op \ type nan1 nan2 nan3 nan4 inf -inf max -max x max -max Min(x,0) -max Min(0,x) -max Max(x,0) max Max(0,x) max x != x 1 1 1 1 Step(0,x) 1 1 1 1 Step(x,0) 1 1 1 1 Step(0,-x) 1 1 1 1 Step(-x,0) 1 1 1 1 SignOrZero(x) -1 -1 -1 -1 SignOrZero(‑x) -1 -1 -1 -1 Normalize(x) nan nan nan nan nan nan nan nan -
FP Mediump on Mali Valhall gen1
op \ type nan1 nan2 nan3 nan4 inf -inf max -max x max -max Min(x,0) -max Min(0,x) -max Max(x,0) max Max(0,x) max x != x 1 1 1 1 Step(0,x) 0 0 0 0 Step(x,0) 0 0 0 0 Step(0,-x) 0 0 0 0 Step(-x,0) 0 0 0 0 SignOrZero(x) 0 0 0 0 SignOrZero(‑x) 0 0 0 0 Normalize(x) -1 -1 -1 -1 1 -1 1 -1 -
FP Mediump on Mali Midgard gen4
op \ type nan1 nan2 nan3 nan4 inf -inf max -max x max -max Min(x,0) -max Min(0,x) -max Max(x,0) max Max(0,x) max x != x 1 1 1 1 Step(0,x) 1 1 1 1 Step(x,0) 1 1 1 1 Step(0,-x) 1 1 1 1 Step(-x,0) 1 1 1 1 SignOrZero(x) 0 0 0 0 SignOrZero(‑x) 0 0 0 0 SmoothStep(x,0,1) 0 0 0 0 Normalize(x) nan nan nan nan 0 -0 0 -0 -
FP Mediump on PowerVR B-Series
op \ type nan1 nan2 nan3 nan4 inf -inf max -max x max -max Min(x,0) -max Min(0,x) -max Max(x,0) max Max(0,x) max x != x 0 0 0 0 Step(0,x) 1 1 1 1 Step(x,0) 1 1 1 1 Step(0,-x) 1 1 1 1 Step(-x,0) 1 1 1 1 SignOrZero(x) nan nan nan nan SignOrZero(‑x) nan nan nan nan Normalize(x) nan nan nan nan nan nan 18446742974197923840 -18446742974197923840 -
FP Mediump on AMD GCN4
op \ type nan1 nan2 nan3 nan4 inf -inf max -max x max -max Min(x,0) -max Min(0,x) -max Max(x,0) max Max(0,x) max x != x 1 1 1 1 Step(0,x) 1 1 1 1 Step(x,0) 1 1 1 1 Step(0,-x) 1 1 1 1 Step(-x,0) 1 1 1 1 SignOrZero(x) 1 1 1 1 SignOrZero(‑x) 1 1 1 1 Normalize(x) nan nan nan nan 0 0 0 0 -
FP Mediump on Apple M1
op \ type nan1 nan2 nan3 nan4 inf -inf max -max x max -max Min(x,0) -max Min(0,x) -max Max(x,0) max Max(0,x) max x != x 0 0 0 0 Step(0,x) 0 0 0 0 Step(x,0) 0 0 0 0 Step(0,-x) 0 0 0 0 Step(-x,0) 0 0 0 0 SignOrZero(x) 0 0 0 0 SignOrZero(‑x) 0 0 0 0 Normalize(x) nan nan nan nan nan nan 0 -0
GPU | VRAM bandwidth from specs (GB/s) | VRAM bandwidth measured (GB/s) | RAM to VRAM bandwidth from specs (GB/s) | RAM to VRAM bandwidth measured (GB/s) | VRAM to RAM bandwidth measured (GB/s) | RAM to RAM bandwidth measured (GB/s) |
---|---|---|---|---|---|---|
Adreno 505 | 6.4 | 5 | ||||
Adreno 660 | 51.2 | 34 | ||||
AMD RX570 (GCN4) | 224.0 | 86 | ||||
Apple M1 | 68.25 | |||||
ARM Mali T830 (Midgard gen4) | 14.9 | 4 | ||||
ARM Mali G57 (Valhall gen1) | 17.07 | 14.2 | ||||
Intel UHD 620 (9.5gen) | 29.8 | 23 | ||||
NV RTX 2080 (Turing) | 448.0 | 403 | ||||
PowerVR BXM‑8‑256 | 51.2 | 14.2 |
- GMem - part of L2 cache which is used to store attachments for TBDR.
- Adreno has dedicated memory.
- Mali use L2 and some times attachment can be evicted from L2 to RAM.
GPU | GMem (KB) | L2 cache per SM (KB) | L2 bandwidth (GB/s) | L2 cache line (bytes) | L1 cache per SM (KB) | Texture cache - part of L1 (KB) | L1 bandwidth (GB/s) |
---|---|---|---|---|---|---|---|
Adreno 505 | 128 | ||||||
Adreno 660 | 1536 | 128 | ? | 4? | 2? | ? | |
AMD RX570 (GCN4) | - | ||||||
Apple M1 | ? | ||||||
ARM Mali T830 (Midgard gen4) | 4 | 64 | |||||
ARM Mali G57 (Valhall gen1) | 8 | 512 | 49 | 64 | 32? | 32 | ? |
Intel UHD 620 (9.5gen) | - | 128 | 48? | 8? | 8? | 112? | |
NV RTX 2080 (Turing) | - | 4096 | ? | 64 | 32 | ? | |
PowerVR BXM‑8‑256 | ? | 1024 | ? | ? | 256? | 256 | ? |
block - compare compression between 1x1 noise and block size (4x4 or 8x8) noise.
max - compare compression between 1x1 noise and solid color.
GPU | block size | block RGBA8_UNorm | max RGBA8_UNorm | block RGBA16_UNorm | max RGBA16_UNorm | method | comments |
---|---|---|---|---|---|---|---|
Adreno 5xx | 4x4 | 2.5 | 2.7 | ? | ? | exec time | |
Adreno 6xx | 16x16 | 1.9 | 6.9 | ? | 3.3 | exec time | |
AMD GCN4 | 4x4 | 2.3 | 3 | 2.3 | 3 | exec time | |
Apple M1 | 8x8 | 3.4 | 3.4 | 6.8 | 6.8 | exec time | |
Intel UHD 6xx 9.5gen | 8x8 | 1.6 | 1.8 | 1.8 | 1.85 | exec time | |
NV RTX 20xx | 4x4 | 3 | 3.2 | 4.1 | 4.1 | exec time | |
ARM Mali Valhall gen1 | 4x4 | 1.9 | 3.9 | 1.9 | 3.7 | exec time | only 32bit formats, V2 |
ARM Mali Valhall gen1 | 4x4 | 5.9 | 19 | 5.7 | 20 | mem traffic | used performance counters |
PowerVR B‑Series | 8x8 | 23 | 134 | 24 | 134 | mem traffic | used performance counters |
GPU | direct vs indirect performance |
---|---|
Adreno 5xx | direct is faster (20ms vs 41ms) |
Adreno 6xx | same |
AMD GCN4 | |
Apple M1 | |
Intel UHD 6xx 9.5gen | |
NV RTX 20xx | same |
ARM Mali Midgard gen4 | maxDrawIndirectCount = 1, used instancing instead of multiDraw, indirect is faster (120ms vs 130ms) |
ARM Mali Valhall gen1 | maxDrawIndirectCount = 1, used instancing instead of multiDraw, indirect is faster (12ms vs 15ms) |
ARM Mali Valhall gen3 | indirect is faster (25ms vs 31ms) |
PowerVR B‑Series | same |
Has much accurate results than Shader instruction benchmark.
code
Has much accurate results than Shader instruction benchmark.
code
- sequential access - UV coordinates multiplied by scale and added bias.
- scale < 1 has better texture cache usage.
- scale > 1 has high cache misses.
- scale > 1 in practice used for noise texture in procedural generation.
- 'noise NxN' - screen divided into blocks with NxN size, each block has unique offset for texture lookup, each pixel in block has 1px offset from nearest pixels.
- offset with 1px used to find case where nearest warp can not use cached texel.
- in practice this method is used for packed 2D sprites and textures for meshes.
- Image/Buffer common cases
- Buffer with variable data size
- Image with thread/group reorder
- Image with RT compression, 4xRGBA8
- Image with RT compression, 2xRGBA16
- Image with RT compression, 1xR32
Find texture size where performance has near to 2x degradation this indicates a lot of cache misses and bottleneck in high level cache or external memory (RAM/VRAM).
Expected hierarchy:
- texture cache (L1)
- L2 cache
- RAM / VRAM
Transform 2D vector into 3D cube face. Uniform version has same cube face per warp. Non-uniform version has unique cube face per thread.
Used 6 branches.