This repository has been archived by the owner on Jan 3, 2020. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 15
/
Copy pathREADME
219 lines (161 loc) · 8.21 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
Fast GPU Ray Traversal 1.4
--------------------------
Implementation by Tero Karras, Timo Aila, and Samuli Laine
Copyright 2009-2012 NVIDIA Corporation
This package contains full source code for the fast GPU-based ray traversal
routines used in the following paper:
"Understanding the Efficiency of Ray Traversal on GPUs",
Timo Aila and Samuli Laine,
Proc. High-Performance Graphics 2009
http://www.tml.tkk.fi/~timo/publications/aila2009hpg_paper.pdf
In addition to the original kernels that were optimized for NVIDIA GTX 285,
the package also includes kernels specifically hand-tuned for GTX 480
(GF100/Fermi) and GTX 680 (GK104/Kepler). The results for these GPUs have
been published in the following technical report:
"Understanding the Efficiency of Ray Traversal on GPUs - Kepler and Fermi Addendum",
Timo Aila, Samuli Laine, and Tero Karras,
NVIDIA Technical Report NVR-2012-02,
http://research.nvidia.com/publication/understanding-efficiency-ray-traversal-gpus-kepler-and-fermi-addendum
The accompanying benchmark application and test scenes aim to replicate the
published results as accurately as possible, although there are slight
differences in the test setup (e.g. BVH builder, CUDA version).
See results.txt for details.
The source code is licensed under New BSD License (see LICENSE), and
hosted by Google Code:
http://code.google.com/p/understanding-the-efficiency-of-ray-traversal-on-gpus/
System requirements
-------------------
- Microsoft Windows XP, Vista, or 7.
- At least 1GB of system memory.
- NVIDIA CUDA-compatible GPU with compute capability 1.2 and at least 1.5
gigabytes of RAM. GeForce GTX 480 or GTX 680 is recommended.
- Microsoft Visual Studio 2010. Required even if you do not plan to build
the source code, as the runtime CUDA compilation mechanism depends on it.
Instructions
------------
1. Install Visual Studio 2010. The Express edition can be downloaded from:
http://www.microsoft.com/visualstudio/en-us/products/2010-editions/visual-cpp-express
2. Install the latest NVIDIA GPU drivers and CUDA Toolkit.
http://developer.nvidia.com/object/cuda_archive.html
3. Run rt.exe to start the application in interactive mode. The first
run executes certain initialization tasks that may take a while to
complete.
4. If you get an error during initialization, the most probable explanation
is that the application is unable to launch nvcc.exe contained in the
CUDA Toolkit. In this case, you should:
- Set CUDA_BIN_PATH to point to the CUDA Toolkit "bin" directory, e.g.
"set CUDA_BIN_PATH=C:\Program Files (x86)\NVIDIA GPU Computing Toolkit\CUDA\v4.2\bin".
- Set CUDA_INC_PATH to point to the CUDA Toolkit "include" directory, e.g.
"set CUDA_INC_PATH=C:\Program Files (x86)\NVIDIA GPU Computing Toolkit\CUDA\v4.2\include".
- Run vcvars32.bat to setup Visual Studio paths, e.g.
"C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\bin\vcvars32.bat".
5. Run benchmark.cmd to measure the performance of all test scenes with
the same settings that were used in the paper. Expected performance
numbers for different GPUs are listed in "results.txt". Note that a
64-bit build is required to benchmark the San Miguel scene.
6. Optional: Build the application yourself.
- Open rt.sln in Visual Studio 2010.
- Right-click the "rt" project and select "Set as StartUp Project".
- Select Release build. Debug build is very slow, especially when sorting
secondary rays during the benchmark.
- Build and run.
Test setup
----------
Camera positions:
- Results of each scene are averaged over 5 distinct camera positions.
- Camera positions are specified on the command line using
"signature strings".
- To generate a signature string, click "Show camera controls" in the
interactive mode and then "Export camera signature..."
Ray generation:
- Viewport of 1024x768 pixels.
- Primary rays that hit scene geometry generate 32 secondary rays (AO or
diffuse) distributed according to a cosine-weighted Halton sequence on
the hemisphere.
- Primary rays that miss the geometry generate dummy secondary rays that
are ignored by the ray traversal kernel and excluded from the
rays-per-second figures.
- AO rays are of limited length and terminate immediately after encountering
an intersection.
- Diffuse interreflection rays are very long and continue the traversal
until they find the closest intersection.
- See src/rt/ray/RayGen.cpp
Batches and sorting:
- All primary rays are traced in a single launch.
- Secondary rays are divided into batches of 2^20 rays, and each batch
is traced in a separate launch.
- Primary rays are generated according to 2D Morton order in screen space.
- Each batch of secondary rays is sorted according to 6D Morton order based
on ray origin and direction vectors.
- The sorting of is performed by the CPU and is a quite heavy operation,
taking roughly one second per batch. It is disabled in the interactive
mode.
- See src/rt/ray/PixelTable.cpp and src/rt/ray/RayBuffer.cpp
Acceleration structure:
- AABB-based binary bounding volume hierarchy.
- Built using spatial triangle splits (Stich et al.) to improve quality:
http://www.nvidia.com/docs/IO/77714/sbvh.pdf
- Triangles are represented using Woop's affine triangle transformation:
http://www.sven-woop.de/publications/Diplom_SvenWoop_Final.pdf
- Memory layout varies between individual traversal kernels.
- See src/rt/bvh/SplitBVHBuilder.cpp and src/cuda/CudaBVH.cpp
Ray traversal:
- Performance results are based on the time spent in ray traversal for
the selected ray type. Ray generation and sorting are excluded from
the measurements.
- The code for launching the traversal kernels can be found in
src/cuda/CudaTracer.cpp
- The kernels themselves are located in src/rt/kernels:
fermi_speculative_while_while
Hand-tuned to yield the best performance on GTX 480.
Works on older GPUs as well, but is not optimal.
kepler_dynamic_fetch
Hand-tuned to yield the best performance on GTX 680.
Works on older GPUs as well, but is not optimal.
tesla_persistent_packet
"Persistent packet" kernel from the paper.
tesla_persistent_speculative_while_while
"Persistent speculative while-while" kernel from the paper.
This is the fastest kernel on GTX 285.
tesla_persistent_while_while
"Persistent while-while" kernel from the paper.
Version history
---------------
Version 1.4, May 22, 2012
- Include hand-tuned kernels for Kepler-based GPUs.
- Improve fermi_speculative_while_while perf using vmin/vmax PTX instructions.
- Include San Miguel test scene in the package.
- Improve robustness of the BVH builder with degenerate input.
- Switch to New BSD License (previously Apache License 2.0).
- Upgrade to Visual Studio 2010 (previously 2008).
- Fix a CUDA compilation issue with Visual Studio Express.
- General bugfixes and improvements to framework.
Version 1.3, Jul 08, 2011
- Fix compatibility issues with CUDA 4.0.
Version 1.2, Dec 17, 2010
- Fix issues with nvcc path autodetection with CUDA 3.2.
Version 1.1, Dec 01, 2010
- Update the codebase to support GF104 and CUDA 3.2.
- Speed up ray sorting significantly by utilizing all available CPU cores.
- Minor stability improvements.
Version 1.0, Jun 29, 2010
- Initial release.
Known issues
------------
- When using CUDA 3.2 or later, the performance of device-side code drops
slightly in 64-bit builds. This is because CUDA 3.2 disallows "mixed-bitness
mode", which we utilize on earlier CUDA versions to get maximum performance.
With CUDA 3.2, we must always compile device code with the same bitness
as host code, which generally results in higher register pressure in 64-bit
builds.
For more information, see:
http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_3.2_Readiness_Tech_Brief.pdf
- The mesh importer only supports a limited subset of the Wavefront OBJ
file format. If you have trouble importing a mesh, you may want to try
enabling WAVEFRONT_DEBUG in src/framework/io/MeshWavefrontIO.cpp.
Acknowledgements
----------------
Anat Grynberg and Greg Ward for the Conference room model.
University of Utah for the Fairy scene.
Marko Dabrovic (www.rna.hr) for the Sibenik cathedral model.
Guillermo M. Leal Llaguno (www.evvisual.com) for the San Miguel model.