NVidia RTX 2080 (Turing)

Specs

Pixel Rate: 109.4 GPixel/s
Texture Rate: 314.6 GTexel/s
Clock stable: 1515 MHz, boost from specs: 1710 MHz, boost measured: 1900+ MHz
shaderSMCount: 46 [vk/specs]
shaderWarpsPerSM: 32 [vk] - this is register capacity, not a thread count
Warp size: 32 [vk]

Memory

Memory: 8GB, GDDR6, 256 bit, 1750 MHz, 448.0 GB/s (403 GB/s from tests (at 1515 MHz ?))
Memory max power consumption: 25W (7.5 pJ/bit, 0.06 J/GB) [calc]
L2 cache: 4MB
L1 Cache: 64 KB (per SM)

Float point performance

FP16: 20.14 TFLOPS at 1710 MHz
FP32: 10.07 TFLOPS at 1710 MHz
FP64: 314.6 GFLOPS at 1710 MHz
ops per clock per SM: 64 fp32 FMA [compute capability 7.5]
ops per clock per SM: 128 fp16 FMA [compute capability 7.5]
FP32 FMA perf: 4.46 TOp/s at 1515 MHz (4.4 TOp/S from tests)
FP16 FMA perf: 8.9 TOp/s at 1515 MHz (8.9 TOp/S from tests)
Total ALUs: 2 944 [calc] - number of simultaneously executing threads

Theoretical performance:

FLOPS v1 = clock * ops_per_clock_per_SM * SMCount
FLOPS v2 = clock * SMCount * WarpSize * 4 (warp-scheduler units) / 2 (perform cycles)
specs:
	- Each Turing SM includes 4 warp-scheduler units.
	- Instructions are performed over two cycles.

v1: 1515M * 64 * 46 = 4.46T FMA ops per second = 8.9 TFLOPS
v2: 1515M * 46 * 32 * 4 / 2 = 4.46T FMA ops per second = 8.9 TFLOPS

Tensor Core performance

Tensor Cores: 368 (46 SM * 8 perSM)
FP16 FLOPS/cy per Core: 128 (64 FMA from specs)
INT8 IPS/cy per Core: 256
INT4 IPS/cy per Core: 512
FP16 TFLOPS: 71 at 1515 MHz (69.5 from tests)
INT8 TIPS: 142 at 1515 MHz
INT4 TIPS: 285 at 1515 MHz

Ray tracing performance

RT Cores: 46 (46 SM * 1 perSM)
Giga Rays/s: 69.7 at 1515 MHz

Shader

Quads

Test subgroupQuadBroadcast( gl_HelperInvocation ) with/without texturing - helper invocations are executed. [6]
Test subgroupQuadBroadcast( constant ) with/without texturing - helper invocations are executed. [6]

Subgroups

Result of Rainbow( Hash( subgroupAdd( gl_FragCoord.xy ))) for 4 quads without instancing. [6]
SM can fill multiple triangles with single subgroup inside tile (16x16 pix)

Subgroup occupancy, red - full subgroup.
Result of Rainbow( Hash( subgroupAdd( gl_FragCoord.xy ))) for 4 quads with instancing. [6]
SM can fill multiple triangles with the same gl_InstanceIndex with single subgroup inside tile (16x16 pix)

Subgroup occupancy, red - full subgroup, blue - very low number of threads per subgroup
Subgroup occupancy for single triangle with texturing. Helper invocations are executed and included as active thread. Red color - full subgroup. [6]
Subgroup occupancy for too small triangles. Red color - full subgroup. [6]
Subgroup occupancy for too small triangles with instancing. Red color - full subgroup. [6]

Subgroup threads order

Result of Rainbow( gl_SubgroupInvocationID / gl_SubgroupSize ) in fragment shader, gl_SubgroupSize: 32. [6]

Result of Rainbow( gl_SubgroupInvocationID / gl_SubgroupSize ) in compute shader, gl_SubgroupSize: 32, workgroup size: 8x8. [6]

SM order

Result of Rainbow( gl_SMIDNV / gl_SMCountNV ) in fragment shader.
Tile size: 16x16, image size: 102x53, gl_SMCountNV: 46, gl_SMIDNV: 0 and 1 are bound to the first tile and changed every frame, same for other tiles. [6]

Result of Rainbow( gl_SMIDNV / gl_SMCountNV ) in compute shader.
Workgroup size: 8x8, image size: 102x53, gl_SMCountNV: 46. First set (from red to violet) has gl_SMIDNV = 0,2,4..., next set has gl_SMIDNV = 1,3,5... and next - again 0,2,4... [6]

SM tile size depends on register count

SM supports limited number of registers, but must run multiple warps to hide memory latency.
SM will decrease tile size to execute minimal required number of warps which has a large number of registers.
Maximal tile size is 16x16 pix (left side in image).

Instruction cost

Shader instruction benchmark results: [4]

fp32 & i32 datapaths can execute in parallel in 1:1 rate.
fp16x2 FMA is used, scalar FMA doesn't have x2 performance
base rate: 8.9 TOp/s at 1515 MHz
float point

op \ type	fp16	fp32	fp64
Add	0.5	1	40
Mul	1	2	80
FMA	1	2	80
MulAdd	2	2	80
Lerp	1	2	80
Length	2	3	480
Normalize	2	3	280
Distance	3	5	480
Dot	3	4	160
Cross	4	4	160
Min/Max	1.5	1	120
Clamp(x,0,1)	0.5	1	200
Clamp(x,-1,1)	3	3	200
Clamp	3	3	200
Step	2.4	1.1	120
SmoothStep	3	5	440
Abs	0.5	1	40
SignOrZero	8	8	280
BitCast	4	1	40
FloatToInt	6	6	80
IntToFloat	6	6	80
Ceil, Floor, Trunc, Round, RoundEven	7	7	120
Fract	7	7	200
Div	7	7	800
Mod	15	15	1000
Exp, Exp2	7	7	-
InvSqrt	7	7	800
Sqrt	7	7	1680
Log, Log2	7	7	-
Sin, Cos	7	7	-
Pow	16	16	-
Tan	22	22	-
ASin, ACos	24	24	-
ATan	56	56	-

float point fast math

op \ type	fp16	fp32	fp64
fast Sign	4	1.2	280
Cbrt	16	16
sRGB	16	17
fast sRGB	9	19
fast Cos	10	18
fast Sin	12	20
fast Tan	10	18
fast ASin	9	16
fast ACos	10	17
fast ASin v2	10	24
fast ACos v2	16	26
fast ATan v2	18	36
fast ATan	34	44

integer

op \ type	i32	u32	i64	u64	i16	u16	i8	u8
Add	1	0.9	2.5	2	1	0.9	0.1 ?	0.1 ?
Mul	2	2	8	8	2	3	2	3
MulAdd	2	2	8	8	2	3	0.2 ?	0.2 ?
Div	54	48	180	140	52	52	0.1 ?	0.1 ?
Mod	48	54	180	140	52	52	0.1 ?	0.1 ?
Min/Max	1	1	8	8	3	4	4	4
Clamp const	3	1	16	8	8	4	8	4
Clamp	1	1	8	8	3	4	4	4
Abs	1.5	-	8	-	5	-	0.1 ?	-
SignOrZero
Shift const	1	1	2	2	1	1	0.1 ?	0.1 ?
Shift	2	2	4	4	4	4	0.1 ?	0.1 ?
And	1	1	4	4	1	1	0.1 ?	0.1 ?
Or	1	1	4	4	1	1	0.1 ?	0.1 ?
Xor	1	1	4	4	1	1	0.1 ?	0.1 ?
BitCount	8	8	-	-	-	-	-	-
FindLSB	16	16	16	16	16	16	16	16
FindMSB	8	8	-	-	-	-	-	-
AddCarry, SubBorrow	-	6	-	-	-	-	-	-
MulExtended	-	7	-	-	-	-	-	-

FP32 instruction performance: [2]
- Loop unrolling can double performance.
- Loop unrolling works for less than 1536 count, on 2048 it lose 2.5x of performance.
- Loop unrolling is too slow at pipeline creation stage.
- Benchmarking in compute shader is only 1% faster.
- Minimal dispatch size: 256x276 (1.5 of total thread count), lower size will lost performance.
- Measured with fixed clock at 1515 MHz.
- minimal workgroup size 32x2, because FMA perform over 2 cycles (like a SIMD16 with dual issue).
TOp/s ops max TFLOPS

8.8 Add 8.8

4.4 MulAdd, FMA 8.8

4.4 Mul 4.4
FP16 instruction performance: [1]
- Measured with fixed clock at 1515 MHz.
TOp/s ops max TFLOPS

17.8 Add 17.8

8.9 Mul, Add with deps 8.9

8.9 MulAdd, FMA 17.8

4.4 MulAdd with deps 8.8

NaN / Inf

FP32, Mediump, FP16, FP64. [11]

op \ type	nan1	nan2	nan3	nan4	inf	-inf	max	-max
x	nan	nan	nan	nan	inf	-inf	max	-max
Min(x,0)	0	0	0	0	0	-inf	0	-max
Min(0,x)	0	0	0	0	0	-inf	0	-max
Max(x,0)	0	0	0	0	inf	0	max	0
Max(0,x)	0	0	0	0	inf	0	max	0
Clamp(x,0,1)	0	0	0	0	1	0	1	0
Clamp(x,-1,1)	-1	-1	-1	-1	1	-1	1	-1
IsNaN	1	1	1	1	0	0	0	0
IsInfinity	0	0	0	0	1	1	0	0
bool(x)	1	1	1	1	1	1	1	1
x != x	1	1	1	1	0	0	0	0
Step(0,x)	1	1	1	1	1	0	1	0
Step(x,0)	1	1	1	1	0	1	0	1
Step(0,-x)	1	1	1	1	0	1	0	1
Step(-x,0)	1	1	1	1	1	0	1	0
SignOrZero(x)	0	0	0	0	1	-1	1	-1
SignOrZero(-x)	0	0	0	0	-1	1	-1	1
SmoothStep(x,0,1)	0	0	0	0	1	0	1	0
Normalize(x)	nan	nan	nan	nan	nan	nan	0	-0

Shared memory

TODO

Noise performance

TODO

Circle performance

small circles. [13]
- 65K objects
- 23 MPix
shape exec time (ms) diff (%)

quad 0.52 -

fan 0.67 29

strip 0.68 31

max area 0.59 13
4x4 circles with blending. [13]
- 3200x1800
- 128 layers
shape exec time (ms) diff (%)

quad 5.03 -

fan 5.27 4.8

strip 5.20 3.4

max area 5.25 4.4

Branching

Mul vs Branch vs Matrix [12]

16.7MPix, 128 iter, 6 mul/branch ops.

op	exec time (ms)	diff
Mul uniform	3.06	2.1
Branch uniform	1.48	-
Matrix uniform	2.19	1.5
-
Mul non-uniform	3.57	2.4
Branch non-uniform	4.58	3.1
Matrix non-uniform	4.37	3.0
-
Mul avg	3.3	2.1
Branch avg	3.3	2.1
Matrix avg	3.3	2.1

Blending

TODO

Resource access

Texture access 105MPix: [5]

expected read: 419MB per frame.
UV bias has no effect.

diff	exec time (ms)	approx traffic (GB/s)	name	comments
0.43	0.55	761	sequential access, scale x0.5	used texture cache
1	1.28	327	sequential access, scale x1	near to VRAM bandwidth
1.15	1.47	285	random access, noise 16x16
1.19	1.52	276	random access, noise 16x16, off 1	1px offset has effect only for 16x16 block size
1.52	1.94	216	random access, noise 8x8
2.1	2.64	159	sequential access, scale x1.5
2.2	2.83	148	random access, noise 4x4
3.5	4.44	94	sequential access, scale x2
5	6.4	65	random access, noise 2x2
12.5	16	26	random access, noise 1x1

Buffer/Image storage 16bpp 67.1MPix 2x1.073GB [7]

image with 1GB size doesn't have RT compression. Because metadata is too large?
image input attachment is preferred because you don't need to reorder threads and RT compression is used to minimize bandwidth.

diff (%)	exec time (ms)	approx traffic (GB/s)	name	comments
26	6.7	320	Buffer load/store in FS, 16 bytes	cache misses because of non-sequential read/write (?)
19	6.3	340	Buffer load/store, 128 bytes
7.5	5.7	376	Image load/store, workgroup 8x8, row major
7	5.66	379	Image load/store, workgroup 8x8, column major
3	5.45	394	Buffer load/store, 16 bytes
3	5.45	394	Image load/store, workgroup 16x16, column major
2	5.4	397	Image load/store, workgroup 16x16, row major
2	5.4	397	Image read/write input attachment RGBA32F, 1x1 noise	RT compression is not enabled because of > 1GB size
2	2.7	397	Image read/write input attachment 2xRGBA8, 1x1 noise	has RT compression, but performance is low because of 8bpp
1	5.35	401	Image load/store, workgroup 16x16, group reorder, row major
1	5.35	401	Buffer load/store, 32 bytes
0	5.3	405	Buffer load/store, 64 bytes	64 byte L2 cache line, stable for any workgroup size and group order
-10	4.8	447	Image read/write input attachment 2xRG32F, 1x1 noise
-23	4.3	499	Image read/write input attachment 4xRGBA8, 1x1 noise	better compression for RGBA8 ?
-72	2.35	699	Image read/write input attachment RGBA32F, 2x2 noise, 7K	speedup on RT compression
-77	3.0	715	Image read/write input attachment 2xRG32F, 2x2 noise	speedup on RT compression
-77	3.0	715	Image read/write input attachment 4xRGBA8, 2x2 noise	speedup on RT compression

Render target compression

RGBA8 205MPix downsample 1/2, compressed/uncompressed access rate: [3]

read: 822MB, write: 205MB, total: 1027MB per frame.
linear: 6.5ms, fetch: 6.6ms, nearest: 7.3ms.
image storage: load: 8ms, linear/fetch: 7.2ms. Texture sampling is a bit faster because of texture cache.
Compression disabled when used storage usage flag.

diff	exec time (ms)	approx traffic (GB/s)	name	comments
0.97	2.79	368	image storage 1x1 noise
1	2.72	377	image storage (other modes)
1.07	2.53	405	1x1 noise
1.78	1.53	671	2x2 noise
3.2	0.84	1223	4x4 noise	same as block size
3.3	0.81	1268	gradient
3.4	0.79	1300	8x8 noise	better compression for output (4x4 block)
3.4	0.79	1300	16x16 noise
3.4	0.79	1300	solid color

RGBA16_UNorm 104.8MPix downsample 1/2, compressed/uncompressed access rate: [3]

read: 838MB, write: 209MB, total: 1048MB per frame.

diff	exec time (ms)	approx traffic (GB/s)	name	comments
0.97	2.71	387	image storage 1x1 noise, gradient	???
1	2.63	398	image storage 8x8 noise, solid color	???
1.02	2.57	408	1x1 noise
1.79	1.47	713	2x2 noise
2.0	1.30	806	gradient	less compression rate than in RGBA16F because of higher precision
4.2	0.62	1690	4x4 noise	same as block size
4.2	0.62	1690	8x8 noise
4.2	0.62	1690	16x16 noise
4.2	0.62	1690	solid color

RGBA16F 104.8MPix downsample 1/2, compressed/uncompressed access rate: [3]

read: 838MB, write: 209MB, total: 1048MB per frame.

diff	exec time (ms)	approx traffic (GB/s)	name	comments
0.95	2.75	381	image storage 1x1 noise	???
1	2.63	398	image storage 4x4 noise, gradient, solid color	???
1.03	2.55	411	1x1 noise
1.8	1.46	718	2x2 noise
3.4	0.77	1361	gradient
4.2	0.62	1690	4x4 noise	same as block size
4.2	0.62	1690	8x8 noise
4.2	0.62	1690	16x16 noise
4.2	0.62	1690	solid color

RGBA32F 37.7MPix downsample 1/2, compressed/uncompressed access rate: [3]

read: 604MB, write: 151MB, total: 755MB per frame.

diff	exec time (ms)	approx traffic (GB/s)	name	comments
1	1.89	399	image storage 1x1 noise
1.03	1.84	410	gradient	low compression rate because of high precision
1.03	1.84	410	1x1 noise
2.4	0.79	956	2x2 noise
3.7	0.51	1480	4x4 noise	same as block size
3.7	0.51	1480	8x8 noise
3.7	0.51	1480	16x16 noise
3.7	0.51	1480	solid color

Texture cache

RGBA8_UNorm texture with random access [9]

Measured cache size: 32 KB, 1 MB, 4MB.
8 texels per pixel, 5.76MPix, 737MB.
from specs: only 32KB of L1 cache is reserved for texture cache.
textureGather() has same performance as texture() with linear sampling.

size (B)	dimension (px)	exec time (ms)	diff	approx bandwidth (GB/s)	VRAM utilization (%)	comments
128	4x8	0.18	-	4096	17	L1 cache line?
256	8x8	0.28	1.6	2630	12
512	8x16	0.33	1.18	2233	11
1K	16x16	0.42	1.27	1755	9
2K	16x32	0.48	1.14	1535	8
4K	32x32	0.50	1.04	1474	7
8K	32x64	0.52	1.04	1417	7
16K	64x64	0.53	1.02	1390	7
32K	64x128	0.60	1.13	1228	5	L1 cache size
64K	128x128	1.5	2.5	491	3	L1 cache size from specs, not enough space to store unique 64 KB or not whole cache line are used
1M	512x512	1.9	-	387	3
4M	1024x1024	4.07	2.1	181	8	L2 cache
8M	2048x1024	10	2.5	74	73

Vertex Cache

| vertices | triangles | VS invocations | overhead (invocations, %) | repeat vertIDs | comment | |---|---|---|---|---| | 20 | 24 | 20 | 0 | - | | 24 | 30 | 24 | 0 | - | | 28 | 36 | 32 | 4, 14% | 21-23, 25 | | 32 | 42 | 37 | 5, 15.6% | 21-25 | | 36 | 48 | 41 | 5, 13.8% | 21-25 | | 40 | 54 | 45 | 5, 12.5% | 21-25 | | 44 | 60 | 49 | 5, 11.3% | 21-25 | | 48 | 66 | 56 | 7, 14.5% | 21-25, 42-43 | | 52 | 72 | 62 | 10, 19.2% | 21-25, 42-46 | | 64 | 90 | 74 | 10, 15.6% | | 64 | 72 | 72 | 8, 12.5% | | 4 instances | | 17600 | 24000 | 20800 | 18.2% | ... | 400 instances | | 24000 | 33600 | 28600 | 19.2% | ... | 400 instances | | 27200 | 38400 | 31200 | 14.7% | ... | 400 instances |

Triangle Clipping

When part of a single triangle clipped by depth test or gl_ClipDistance the resulting rectangle rasterized as 2 triangles with helper invocations in diagonal.
Possible explanation: triangle is clipped, but hardware can not rasterize rectangle, so it divide rectangle on 2 new triangles. It is needed to avoid rasterization of hidden parts of triangle.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NVidia_RTX2080.md

NVidia_RTX2080.md

NVidia RTX 2080 (Turing)

Specs

Memory

Float point performance

Tensor Core performance

Ray tracing performance

Shader

Quads

Subgroups

Subgroup threads order

SM order

SM tile size depends on register count

Instruction cost

NaN / Inf

Shared memory

Noise performance

Circle performance

Branching

Blending

Resource access

Render target compression

Texture cache

Vertex Cache

Triangle Clipping

TOp/s	ops	max TFLOPS
17.8	Add	17.8
8.9	Mul, Add with deps	8.9
8.9	MulAdd, FMA	17.8
4.4	MulAdd with deps	8.8

shape	exec time (ms)	diff (%)
quad	0.52	-
fan	0.67	29
strip	0.68	31
max area	0.59	13

shape	exec time (ms)	diff (%)
quad	5.03	-
fan	5.27	4.8
strip	5.20	3.4
max area	5.25	4.4

Files

NVidia_RTX2080.md

Latest commit

History

NVidia_RTX2080.md

File metadata and controls

NVidia RTX 2080 (Turing)

Specs

Memory

Float point performance

Tensor Core performance

Ray tracing performance

Shader

Quads

Subgroups

Subgroup threads order

SM order

SM tile size depends on register count

Instruction cost

NaN / Inf

Shared memory

Noise performance

Circle performance

Branching

Blending

Resource access

Render target compression

Texture cache

Vertex Cache

Triangle Clipping