introduction to realtime ray tracing course 41 philipp slusallek peter shirley bill mark gordon...
TRANSCRIPT
Introduction to Realtime Ray Tracing
Course 41
Introduction to Realtime Ray Tracing
Course 41
Philipp Slusallek Peter Shirley
Bill Mark Gordon Stoll Ingo Wald
Hardware for Realtime Ray TracingHardware for Realtime Ray Tracing
• Custom Hardware for Realtime Ray Tracing
– Characteristics and requirements
– RPU Design and Implementation
• GPU + Recursion + Custom Traversal HW
– Programming Model
– FPGA Prototype
– Performance and Scalability
Ray Tracing on CPUsRay Tracing on CPUs
• Characteristics
– Commodity, well understood HW
– High FP performance, yet still too slow
– Limited parallelism, bulky clusters
– Poor silicon usage (e.g. cache)
• Outlook
– Multi-core designs are coming
– Will still take too long
Ray Tracing on GPUsRay Tracing on GPUs
• Characteristics
– Very high raw FP performance
– High degree of parallelism
– Fast development cycle
• Stream programming model
– Still too limited for efficient ray tracing
• No support for recursion
• Limited memory access
Ray Tracing Characteristics: kd-Tree TraversalRay Tracing Characteristics: kd-Tree Traversal
• One-dimensional computation along ray
– Compute location of d relative to t_min / t_max
– Iterate or recurse with updated t_max / t_max
t_min
t_max
d
t_min
t_max
dsplit
t_min
t_max d
split
Near: t_min< t_max < d Both: t_min < d < t_max Far: d < t_min < t_max
Ray Tracing Characteristics: kd-Tree TraversalRay Tracing Characteristics: kd-Tree Traversal
• Inner traversal looptmp = node.split – ray.origin
d = tmp * 1/ray.direction
near = d > t_min
far = d < t_max
if (near & far) push(node.far, d, t_max)
if (near) iterate(node.near, t_min, d)
else iterate(node.far, d, t_max)
• Advantages of using kd-trees
– Simple and fast traversal & building algorithm
– Robust & very good handling of large scenes
t_min
t_max
d
split
Ray Tracing Characteristics: kd-Tree TraversalRay Tracing Characteristics: kd-Tree Traversal
• Traversal Processing
– 50-80 k-D steps per ray @ 10 instructions/step many instructions many clock cycles
– Serial dependency low pipeline efficiency, stalls, latency
– Limited but flexible control flow and memory access
Custom HW unit
– One clock tick per traversal step (fully pipelined)
– Up to 100:1 improvement
Ray Tracing Characteristics: IntersectionRay Tracing Characteristics: Intersection
• Intersection computation
– Triggered by traversal at every leaf node
• Called with: ray and address of geometry
– Option 1: Custom hardware [SaarCOR’05]
– Option 2: Software on programmable processor
• Can be implemented efficiently
• Enables arbitrary programmable primitives
Do not use costly dedicated hardware
Ray Tracing Characteristics: ShadingRay Tracing Characteristics: Shading
• Shading computation
– Triggered by finished ray traversal
• Called with: ray, hit point, shader-id, address of parameters
– Characteristics:
• General-purpose computation, many 3-/4-vectors
• Needs support for efficient texture and memory access
• Needs support for arbitrary recursive tracing rays
– E.g. support dependent ray tracing
Main feature of ray tracing: Do not put limits on it
Ray Tracing Characteristics: CoherenceRay Tracing Characteristics: Coherence
• Ray coherence
– Neighboring primary rays
• Traverse highly similar kd-node in same order
• Often hit same geometric primitives
• Often execute the same shader, access same textures, …
– Similar for shadow rays to one light source
– Often (but not always) applies for secondary rays
HW should take advantage of this coherence
Previous WorkPrevious Work
• SaarCOR I
– Fixed function ray tracing chip [GH’05]
RPU ApproachRPU Approach
• Take GPUs as basis and core component
– Highly parallel, highly efficient
• Improve programming model
– Add efficient recursion, conditionals
– Add memory access options
• Add custom traversal unit
– Slave to RPU
– Performs indirect, data dependent functions calls
RPU DesignRPU Design
Shader Processing Units (SPU)- General purpose computation
- For shading, geometry, lighting computations
- Operates on 4-component vectors
- Integer and float
- Dual issue, split vector
- GPU-like instruction set
- Arbitrary read/write
- Texture addressing mode
- No texture filtering SW
RPU DesignRPU Design
Shader Processing Units (SPU) Custom Ray Traversal Unit (TPU)
- Efficient traversal of k-D trees
- Communicates with SPU over dedicated registers
RPU DesignRPU Design
Shader Processing Units (SPU) Custom Ray Traversal Unit (TPU) Multi-Threading
- Increases usage of HW resources
- Hides latency due to
- Memory access
- Instruction dependencies
- Long traversal operations
- Separate thread pool for SPU & TPU
- Software scheduling (compiler)
- No overhead for switching threads
- Increases resources (mainly register file)
RPU DesignRPU Design
Shader Processing Units (SPU) Custom Ray Traversal Unit (TPU) Multi-Threading Chunking
- SIMD execution (SPUs & TPUs)
- Takes advantage of coherence
- Reduces hardware complexity
- Can combine of memory requests
- Reduces external bandwidth
- Must allow for incoherence
- Chunks may split at conditionals
- Inactive sub-chunk put on stack
- Masked execution
- Worst case: serial computation
RPU DesignRPU Design
Shader Processing Units (SPU) Custom Ray Traversal Unit (TPU) Multi-Threading Chunking Mailbox Processing (MPU)
Per thread caching mechanism Avoids multiple processing
of same kd-tree entry (e.g. triangle) 10x performance for some scenes
RPU ArchitectureRPU Architecture
SPUSPUSPU
SPUSPU
PCI/AGP/PCI-XINT ERFACE
SPU
DDR-RAM 1
VM U
T PUT -CACHE
READ ONLY
SPUR4-CACHE
R/W
M PUL-CACHE
READ ONLY
M EM -IF(...)
DDR-RAM N
T hread GeneratorDM A-IF
RPU 1
(...)
RPU Chip 1FastChip Inter-
Connect
RPU 2
SPUSPUSPU
SPUSPUSPU
T PUT -CACHE
READ ONLY
SPUR4-CACHE
R/W
M PUL-CACHE
READ ONLY
RPU N
RegisterStack
VEC4-ALU,Mul, Add, ...
I0 to I3 A =(A0,A1,A2,A3)
R0 to R15
ORG, DIR,(SID,MIN, MAX, ROOT)
C
C0 ... C15
C0 ... C15
P0 to P15
set of parameter registers
hardware managed register stack, only topmost registers are visible
output registers to control the TPU
input, output and address registers for memory control
register stack frame
SPU Vector RegistersSPU Vector Registers
• All registers have 4- component (float or integer)
• R0 to R15: General registers
– Index into a HW managed register stack
– Allows for single-cycle function call
• P0 to P15: shader parameters
• I0 to I3: data read from memory
• A = (A0,A1,A2,A3)
– Memory addressing
• ORG, DIR, ...
– TPU communication registers
Instruction Set of SPUInstruction Set of SPU
Ray traversal operation trace
Conditional instructions (paired) if <condition> jmp label if <condition> call <fun> If <condition> return
Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump,
call, return
Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4
Input modifiers Swizzeling, negation, masking Multiply with power of 2
Special operations (modifiers) rcp, rsq, sat
Fast 2D texture lookups texload, texload4x
Read from and write to memory load, load4x, store
Instruction Set of SPUInstruction Set of SPU
Ray traversal operation trace
Conditional instructions (paired) if <condition> jmp label if <condition> call <fun> If <condition> return
Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump,
call, return
Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4
Input modifiers Swizzeling, negation, masking Multiply with power of 2
Special operations (modifiers) rcp, rsq, sat
Fast 2D texture lookups texload, texload4x
Read from and write to memory load, load4x, store
Instruction Set of SPUInstruction Set of SPU
Ray traversal operation trace
Conditional instructions (paired) if <condition> jmp label if <condition> call <fun> If <condition> return
Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump,
call, return
Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4
Input modifiers Swizzeling, negation, masking Multiply with power of 2
Special operations (modifiers) rcp, rsq, sat
Fast 2D texture lookups texload, texload4x
Read from and write to memory load, load4x, store
Instruction Set of SPUInstruction Set of SPU
Ray traversal operation trace
Conditional instructions (paired) if <condition> jmp label if <condition> call <fun> If <condition> return
Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump,
call, return
Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4
Input modifiers Swizzeling, negation, masking Multiply with power of 2
Special operations (modifiers) rcp, rsq, sat
Fast 2D texture lookups texload, texload4x
Read from and write to memory load, load4x, store
Instruction Set of SPUInstruction Set of SPU
Ray traversal operation trace
Conditional instructions (paired) if <condition> jmp label if <condition> call <fun> If <condition> return
Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump,
call, return
Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4
Input modifiers Swizzeling, negation, masking Multiply with power of 2
Special operations (modifiers) rcp, rsq, sat
Fast 2D texture lookups texload, texload4x
Read from and write to memory load, load4x, store
Instruction Set of SPUInstruction Set of SPU
Ray traversal operation trace
Conditional instructions (paired) if <condition> jmp label if <condition> call <fun> If <condition> return
Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump,
call, return
Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4
Input modifiers Swizzeling, negation, masking Multiply with power of 2
Special operations (modifiers) rcp, rsq, sat
Fast 2D texture lookups texload, texload4x
Read from and write to memory load, load4x, store
Ray Triangle IntersectionUnit-Triangle TestRay Triangle IntersectionUnit-Triangle Test
; load triangle transformation load4x A.y,0
; transform ray dp3_rcp R7.z,I2,R3 dp3 R7.y,I1,R3 dp3 R7.x,I0,R3 dph3 R6.x,I0,R2 dph3 R6.y,I1,R2 dph3 R6.z,I2,R2
; compute hit distance mul R8.z,-R6.z,S.z + if z <0 return
; barycentric coordinates mad R8.xy,R8.z,R7,R6 + if or xy (<0 or >=1) return
; hit if u + v < 1 add R8.w,R8.x,R8.y + if w >=1 return
; hit distance closer than last one? add R8.w,R8.z,-R4.z + if w >=0 return
; save hit information mov SID,I3.x + mov MAX,R8.z mov R4.xyz,R8 + return
InputArithmetic (dot products)Multi-issue (arith. & cond.)
Read Instruction
Read 3 Source Registers
Swizzeling mov R0,R1* mov R2,R3
* mov R0,R2
Masking
Writeback
*
Memory Access
Writeback I0 – I3
* * *+ + +
+
ClampThread ControlBranching
StackControl
RCP,RSQ
Writeback
Masking
Shader Processing UnitPipeliningShader Processing UnitPipelining
RPU Programming ModelRPU Programming Model
• ↨: Direct function calls
• ↔: Indirect function calls via TPU
...
PrimaryRay
Shader
TPU/MPU
Top-LevelObject
Intersector
TPU/MPU
Surface/BRDFShader
LightSourceShader
LightingShader
TPU/MPU
LightSourceShader
GeometryIntersector
primary
ray
secondary
raysSPU Processing
TPU / MPU Processing
... TPU/MPU
shadow
rays
RPU Programming ModelRPU Programming Model
PrimaryRay
Shader
TPU/MPU
Top-LevelObject
Intersector
TPU/MPU
Surface/BRDFShader
LightSourceShader
LightingShader
TPU/MPU
LightSourceShader
GeometryIntersector
primary
ray
secondary
rays
TPU/MPU
shadow
rays
RPU Programming ModelRPU Programming Model
• Threads are started for each pixel
• Registers initialized from an input stream
– 2D Hilbert curve generator sampling the screen
– Memory stream for multi-pass
• Shader computes ray
PrimaryRay
Shader
TPU/MPU
Top-LevelObject
Intersector
TPU/MPU
Surface/BRDFShader
LightSourceShader
LightingShader
TPU/MPU
LightSourceShader
GeometryIntersector
primary
ray
secondary
rays
TPU/MPU
shadow
rays
RPU Programming ModelRPU Programming Model
• Threads are started
• Registers initialized from an input stream
– 2D Hilbert curve generator sampling the screen
– Memory stream for multi-pass
PrimaryRay
Shader
TPU/MPU
Top-LevelObject
Intersector
TPU/MPU
Surface/BRDFShader
LightSourceShader
LightingShader
TPU/MPU
LightSourceShader
GeometryIntersector
primary
ray
secondary
rays
TPU/MPU
shadow
rays
RPU Programming ModelRPU Programming Model
• Shooting Primary Rays
– Ray traversal performed onthe TPU
– Started in top-level kd-tree
– Intersector transforms ray into local coordinate system
PrimaryRay
Shader
TPU/MPU
Top-LevelObject
Intersector
TPU/MPU
Surface/BRDFShader
LightSourceShader
LightingShader
TPU/MPU
LightSourceShader
GeometryIntersector
primary
ray
secondary
rays
TPU/MPU
shadow
rays
top-level kd-tree
RPU Programming ModelRPU Programming Model
PrimaryRay
Shader
TPU/MPU
Top-LevelObject
Intersector
TPU/MPU
Surface/BRDFShader
LightSourceShader
LightingShader
TPU/MPU
LightSourceShader
GeometryIntersector
primary
ray
secondary
rays
TPU/MPU
shadow
rays
top-level kd-tree
RPU Programming ModelRPU Programming Model
PrimaryRay
Shader
TPU/MPU
Top-LevelObject
Intersector
TPU/MPU
Surface/BRDFShader
LightSourceShader
LightingShader
TPU/MPU
LightSourceShader
GeometryIntersector
primary
ray
secondary
rays
TPU/MPU
shadow
rays
top-level kd-tree
RPU Programming ModelRPU Programming Model
• Shooting Primary Rays (II)
– Transformed ray traversed through object kd-tree on TPU
– Geometry intersection performed on programmable SPU
– Programmable geometry: triangles, spheres, bicubic splines, quadrics, …
PrimaryRay
Shader
TPU/MPU
Top-LevelObject
Intersector
TPU/MPU
Surface/BRDFShader
LightSourceShader
LightingShader
TPU/MPU
LightSourceShader
GeometryIntersector
primary
ray
secondary
rays
TPU/MPU
shadow
rays
object-level kd-tree
RPU Programming ModelRPU Programming Model
PrimaryRay
Shader
TPU/MPU
Top-LevelObject
Intersector
TPU/MPU
Surface/BRDFShader
LightSourceShader
LightingShader
TPU/MPU
LightSourceShader
GeometryIntersector
primary
ray
secondary
rays
TPU/MPU
shadow
rays
object-level kd-tree
RPU Programming ModelRPU Programming Model
• Surface shading performed on programmable SPU
– Surface shader is called directly from primary shader
– Arguments passed on HW stack
– May trace secondary rays at any time: reflection, refraction, …
– Writing shaders is easy due to global access to the scene and physically-based computation
PrimaryRay
Shader
TPU/MPU
Top-LevelObject
Intersector
TPU/MPU
Surface/BRDFShader
LightSourceShader
LightingShader
TPU/MPU
LightSourceShader
GeometryIntersector
primary
ray
secondary
rays
TPU/MPU
shadow
rays
RPU Programming ModelRPU Programming Model
• Light properties and illumination can be abstracted using function calls
• Illumination shader iterates over all light sources
• For each light source a Light source shader is called
PrimaryRay
Shader
TPU/MPU
Top-LevelObject
Intersector
TPU/MPU
Surface/BRDFShader
LightSourceShader
LightingShader
TPU/MPU
LightSourceShader
GeometryIntersector
primary
ray
secondary
rays
TPU/MPU
shadow
rays
Prototype ImplementationPrototype Implementation
SPUSPUSPU
SPUSPU
PCI/AGP/PCI-XINT ERFACE
SPU
DDR-RAM 1
VM U
T PUT -CACHE
READ ONLY
SPUR4-CACHE
R/W
M PUL-CACHE
READ ONLY
M EM -IF(...)
DDR-RAM N
T hread GeneratorDM A-IF
RPU 1
(...)
RPU Chip 1FastChip Inter-
Connect
RPU 2
SPUSPUSPU
SPUSPUSPU
T PUT -CACHE
READ ONLY
SPUR4-CACHE
R/W
M PUL-CACHE
READ ONLY
RPU N
PrototypePerformancePrototypePerformance
• FPGA prototype
– Xilinx Virtex II 6000
– 128 MB DDR-RAM at 350 MB/s
– PCI bus for up-/download (no VGA)
• Single RPU at only 66 MHz
– Up to 4 million rays per second
– Up to 20 fps @ 512x384
– Same ray tracing performance as Intel P4 @ 2.66 GHz
ScalabilityScalability
• Larger Chunk Size
– Less ray coherence
– More data is accessed
– Increased cache bandwidth
– Larger caches
ScalabilityScalability
• Larger Chunk Size
• Multiple RPUs on a Chip
– Limited by
• VLSI technology
• Memory bandwidth
– FPGA prototype versus current GPUs
• Floating point units 50x
• Memory bandwidth 100x
• Clock rate 7x
RPU
RPU RPU
RPU
RAMRAM
ScalabilityScalability
• Larger Chunk Size
• Multiple RPUs on a Chip
• Multiple chips on a board
– Fast interconnect for data exchange
– Cache sizes accumulate
– Managed through virtual memory [Schmittler’2003]
– Limited through external bandwidth due to scene changes
RAMRAM RPU C HIP RAMRAM RPU C HIP
RPU board
ScalabilityScalability
• Larger Chunk Size
• Multiple RPUs on a Chip
• Multiple chips on a board
• Multiple boards in a PC
– Similar to today’s PC clusters in a much smaller form factor
RAMRAM RPU C HIP RAMRAM RPU C HIP
RPU board
RAMRAM RPU C HIP RAMRAM RPU C HIP
RPU board
RAMRAM RPU C HIP RAMRAM RPU C HIP
RPU board
RAMRAM RPU C HIP RAMRAM RPU C HIP
RPU board
VideoVideo
Future WorkFuture Work
• Support for fully dynamic scenes
– Vertex shader + building kd-trees
• Efficient photon mapping
– kd-tree construction + kNN filtering
• OpenRT-API [Dietrich’03]
• ASIC prototype
Questions?Questions?
http://graphics.cs.uni-sb.de
http://www.OpenRT.de
http://www.SaarCOR.de