introduction to realtime ray tracing course 41 philipp slusallek peter shirley bill mark gordon...

Introduction to Realtime Ray Tracing

Course 41

Introduction to Realtime Ray Tracing

Course 41

Philipp Slusallek Peter Shirley

Bill Mark Gordon Stoll Ingo Wald

Hardware for Realtime Ray TracingHardware for Realtime Ray Tracing

• Custom Hardware for Realtime Ray Tracing

– Characteristics and requirements

– RPU Design and Implementation

• GPU + Recursion + Custom Traversal HW

– Programming Model

– FPGA Prototype

– Performance and Scalability

Ray Tracing on CPUsRay Tracing on CPUs

• Characteristics

– Commodity, well understood HW

– High FP performance, yet still too slow

– Limited parallelism, bulky clusters

– Poor silicon usage (e.g. cache)

• Outlook

– Multi-core designs are coming

– Will still take too long

Ray Tracing on GPUsRay Tracing on GPUs

• Characteristics

– Very high raw FP performance

– High degree of parallelism

– Fast development cycle

• Stream programming model

– Still too limited for efficient ray tracing

• No support for recursion

• Limited memory access

Ray Tracing Characteristics: kd-Tree TraversalRay Tracing Characteristics: kd-Tree Traversal

• One-dimensional computation along ray

– Compute location of d relative to t_min / t_max

– Iterate or recurse with updated t_max / t_max

t_min

t_max

d

t_min

t_max

dsplit

t_min

t_max d

split

Near: t_min< t_max < d Both: t_min < d < t_max Far: d < t_min < t_max


• Inner traversal looptmp = node.split – ray.origin

d = tmp * 1/ray.direction

near = d > t_min

far = d < t_max

if (near & far) push(node.far, d, t_max)

if (near) iterate(node.near, t_min, d)

else iterate(node.far, d, t_max)

• Advantages of using kd-trees

– Simple and fast traversal & building algorithm

– Robust & very good handling of large scenes

t_min

t_max

d

split


• Traversal Processing

– 50-80 k-D steps per ray @ 10 instructions/step many instructions many clock cycles

– Serial dependency low pipeline efficiency, stalls, latency

– Limited but flexible control flow and memory access

Custom HW unit

– One clock tick per traversal step (fully pipelined)

– Up to 100:1 improvement

Ray Tracing Characteristics: IntersectionRay Tracing Characteristics: Intersection

• Intersection computation

– Triggered by traversal at every leaf node

• Called with: ray and address of geometry

– Option 1: Custom hardware [SaarCOR’05]

– Option 2: Software on programmable processor

• Can be implemented efficiently

• Enables arbitrary programmable primitives

Do not use costly dedicated hardware

Ray Tracing Characteristics: ShadingRay Tracing Characteristics: Shading

• Shading computation

– Triggered by finished ray traversal

• Called with: ray, hit point, shader-id, address of parameters

– Characteristics:

• General-purpose computation, many 3-/4-vectors

• Needs support for efficient texture and memory access

• Needs support for arbitrary recursive tracing rays

– E.g. support dependent ray tracing

Main feature of ray tracing: Do not put limits on it

Ray Tracing Characteristics: CoherenceRay Tracing Characteristics: Coherence

• Ray coherence

– Neighboring primary rays

• Traverse highly similar kd-node in same order

• Often hit same geometric primitives

• Often execute the same shader, access same textures, …

– Similar for shadow rays to one light source

– Often (but not always) applies for secondary rays

HW should take advantage of this coherence

Previous WorkPrevious Work

• SaarCOR I

– Fixed function ray tracing chip [GH’05]

RPU ApproachRPU Approach

• Take GPUs as basis and core component

– Highly parallel, highly efficient

• Improve programming model

– Add efficient recursion, conditionals

– Add memory access options

• Add custom traversal unit

– Slave to RPU

– Performs indirect, data dependent functions calls

RPU DesignRPU Design

Shader Processing Units (SPU)- General purpose computation

- For shading, geometry, lighting computations

- Operates on 4-component vectors

- Integer and float

- Dual issue, split vector

- GPU-like instruction set

- Arbitrary read/write

- Texture addressing mode

- No texture filtering SW


Shader Processing Units (SPU) Custom Ray Traversal Unit (TPU)

- Efficient traversal of k-D trees

- Communicates with SPU over dedicated registers


Shader Processing Units (SPU) Custom Ray Traversal Unit (TPU) Multi-Threading

- Increases usage of HW resources

- Hides latency due to

- Memory access

- Instruction dependencies

- Long traversal operations

- Separate thread pool for SPU & TPU

- Software scheduling (compiler)

- No overhead for switching threads

- Increases resources (mainly register file)


Shader Processing Units (SPU) Custom Ray Traversal Unit (TPU) Multi-Threading Chunking

- SIMD execution (SPUs & TPUs)

- Takes advantage of coherence

- Reduces hardware complexity

- Can combine of memory requests

- Reduces external bandwidth

- Must allow for incoherence

- Chunks may split at conditionals

- Inactive sub-chunk put on stack

- Masked execution

- Worst case: serial computation


Shader Processing Units (SPU) Custom Ray Traversal Unit (TPU) Multi-Threading Chunking Mailbox Processing (MPU)

Per thread caching mechanism Avoids multiple processing

of same kd-tree entry (e.g. triangle) 10x performance for some scenes

RPU ArchitectureRPU Architecture

SPUSPUSPU

SPUSPU

PCI/AGP/PCI-XINT ERFACE

SPU

DDR-RAM 1

VM U

T PUT -CACHE

READ ONLY

SPUR4-CACHE

R/W

M PUL-CACHE

READ ONLY

M EM -IF(...)

DDR-RAM N

T hread GeneratorDM A-IF

RPU 1

(...)

RPU Chip 1FastChip Inter-

Connect

RPU 2

SPUSPUSPU

SPUSPUSPU

T PUT -CACHE

READ ONLY

SPUR4-CACHE

R/W

M PUL-CACHE

READ ONLY

RPU N

RegisterStack

VEC4-ALU,Mul, Add, ...

I0 to I3 A =(A0,A1,A2,A3)

R0 to R15

ORG, DIR,(SID,MIN, MAX, ROOT)

C

C0 ... C15

C0 ... C15

P0 to P15

set of parameter registers

hardware managed register stack, only topmost registers are visible

output registers to control the TPU

input, output and address registers for memory control

register stack frame

SPU Vector RegistersSPU Vector Registers

• All registers have 4- component (float or integer)

• R0 to R15: General registers

– Index into a HW managed register stack

– Allows for single-cycle function call

• P0 to P15: shader parameters

• I0 to I3: data read from memory

• A = (A0,A1,A2,A3)

– Memory addressing

• ORG, DIR, ...

– TPU communication registers

Instruction Set of SPUInstruction Set of SPU

Ray traversal operation trace

Conditional instructions (paired) if <condition> jmp label if <condition> call <fun> If <condition> return

Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump,

call, return

Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4

Input modifiers Swizzeling, negation, masking Multiply with power of 2

Special operations (modifiers) rcp, rsq, sat

Fast 2D texture lookups texload, texload4x

Read from and write to memory load, load4x, store

Ray Triangle IntersectionUnit-Triangle TestRay Triangle IntersectionUnit-Triangle Test

; load triangle transformation load4x A.y,0

; transform ray dp3_rcp R7.z,I2,R3 dp3 R7.y,I1,R3 dp3 R7.x,I0,R3 dph3 R6.x,I0,R2 dph3 R6.y,I1,R2 dph3 R6.z,I2,R2

; compute hit distance mul R8.z,-R6.z,S.z + if z <0 return

; barycentric coordinates mad R8.xy,R8.z,R7,R6 + if or xy (<0 or >=1) return

; hit if u + v < 1 add R8.w,R8.x,R8.y + if w >=1 return

; hit distance closer than last one? add R8.w,R8.z,-R4.z + if w >=0 return

; save hit information mov SID,I3.x + mov MAX,R8.z mov R4.xyz,R8 + return

InputArithmetic (dot products)Multi-issue (arith. & cond.)

Read Instruction

Read 3 Source Registers

Swizzeling mov R0,R1* mov R2,R3

* mov R0,R2

Masking

Writeback

*

Memory Access

Writeback I0 – I3

* * *+ + +

+

ClampThread ControlBranching

StackControl

RCP,RSQ

Writeback

Masking

Shader Processing UnitPipeliningShader Processing UnitPipelining

RPU Programming ModelRPU Programming Model

• ↨: Direct function calls

• ↔: Indirect function calls via TPU

...

PrimaryRay

Shader

TPU/MPU

Top-LevelObject

Intersector

TPU/MPU

Surface/BRDFShader

LightSourceShader

LightingShader

TPU/MPU

LightSourceShader

GeometryIntersector

primary

ray

secondary

raysSPU Processing

TPU / MPU Processing

... TPU/MPU

shadow

rays


PrimaryRay

Shader

TPU/MPU

Top-LevelObject

Intersector

TPU/MPU

Surface/BRDFShader

LightSourceShader

LightingShader

TPU/MPU

LightSourceShader

GeometryIntersector

primary

ray

secondary

rays

TPU/MPU

shadow

rays


• Threads are started for each pixel

• Registers initialized from an input stream

– 2D Hilbert curve generator sampling the screen

– Memory stream for multi-pass

• Shader computes ray

PrimaryRay

Shader

TPU/MPU

Top-LevelObject

Intersector

TPU/MPU

Surface/BRDFShader

LightSourceShader

LightingShader

TPU/MPU

LightSourceShader

GeometryIntersector

primary

ray

secondary

rays

TPU/MPU

shadow

rays


• Threads are started

• Registers initialized from an input stream

– 2D Hilbert curve generator sampling the screen

– Memory stream for multi-pass

PrimaryRay

Shader

TPU/MPU

Top-LevelObject

Intersector

TPU/MPU

Surface/BRDFShader

LightSourceShader

LightingShader

TPU/MPU

LightSourceShader

GeometryIntersector

primary

ray

secondary

rays

TPU/MPU

shadow

rays


• Shooting Primary Rays

– Ray traversal performed onthe TPU

– Started in top-level kd-tree

– Intersector transforms ray into local coordinate system

PrimaryRay

Shader

TPU/MPU

Top-LevelObject

Intersector

TPU/MPU

Surface/BRDFShader

LightSourceShader

LightingShader

TPU/MPU

LightSourceShader

GeometryIntersector

primary

ray

secondary

rays

TPU/MPU

shadow

rays

top-level kd-tree


PrimaryRay

Shader

TPU/MPU

Top-LevelObject

Intersector

TPU/MPU

Surface/BRDFShader

LightSourceShader

LightingShader

TPU/MPU

LightSourceShader

GeometryIntersector

primary

ray

secondary

rays

TPU/MPU

shadow

rays

top-level kd-tree


• Shooting Primary Rays (II)

– Transformed ray traversed through object kd-tree on TPU

– Geometry intersection performed on programmable SPU

– Programmable geometry: triangles, spheres, bicubic splines, quadrics, …

PrimaryRay

Shader

TPU/MPU

Top-LevelObject

Intersector

TPU/MPU

Surface/BRDFShader

LightSourceShader

LightingShader

TPU/MPU

LightSourceShader

GeometryIntersector

primary

ray

secondary

rays

TPU/MPU

shadow

rays

object-level kd-tree


PrimaryRay

Shader

TPU/MPU

Top-LevelObject

Intersector

TPU/MPU

Surface/BRDFShader

LightSourceShader

LightingShader

TPU/MPU

LightSourceShader

GeometryIntersector

primary

ray

secondary

rays

TPU/MPU

shadow

rays

object-level kd-tree


• Surface shading performed on programmable SPU

– Surface shader is called directly from primary shader

– Arguments passed on HW stack

– May trace secondary rays at any time: reflection, refraction, …

– Writing shaders is easy due to global access to the scene and physically-based computation

PrimaryRay

Shader

TPU/MPU

Top-LevelObject

Intersector

TPU/MPU

Surface/BRDFShader

LightSourceShader

LightingShader

TPU/MPU

LightSourceShader

GeometryIntersector

primary

ray

secondary

rays

TPU/MPU

shadow

rays


• Light properties and illumination can be abstracted using function calls

• Illumination shader iterates over all light sources

• For each light source a Light source shader is called

PrimaryRay

Shader

TPU/MPU

Top-LevelObject

Intersector

TPU/MPU

Surface/BRDFShader

LightSourceShader

LightingShader

TPU/MPU

LightSourceShader

GeometryIntersector

primary

ray

secondary

rays

TPU/MPU

shadow

rays

Prototype ImplementationPrototype Implementation

SPUSPUSPU

SPUSPU

PCI/AGP/PCI-XINT ERFACE

SPU

DDR-RAM 1

VM U

T PUT -CACHE

READ ONLY

SPUR4-CACHE

R/W

M PUL-CACHE

READ ONLY

M EM -IF(...)

DDR-RAM N

T hread GeneratorDM A-IF

RPU 1

(...)

RPU Chip 1FastChip Inter-

Connect

RPU 2

SPUSPUSPU

SPUSPUSPU

T PUT -CACHE

READ ONLY

SPUR4-CACHE

R/W

M PUL-CACHE

READ ONLY

RPU N

PrototypePerformancePrototypePerformance

• FPGA prototype

– Xilinx Virtex II 6000

– 128 MB DDR-RAM at 350 MB/s

– PCI bus for up-/download (no VGA)

• Single RPU at only 66 MHz

– Up to 4 million rays per second

– Up to 20 fps @ 512x384

– Same ray tracing performance as Intel P4 @ 2.66 GHz

ScalabilityScalability

• Larger Chunk Size

– Less ray coherence

– More data is accessed

– Increased cache bandwidth

– Larger caches



• Multiple RPUs on a Chip

– Limited by

• VLSI technology

• Memory bandwidth

– FPGA prototype versus current GPUs

• Floating point units 50x

• Memory bandwidth 100x

• Clock rate 7x

RPU

RPU RPU

RPU

RAMRAM




• Multiple chips on a board

– Fast interconnect for data exchange

– Cache sizes accumulate

– Managed through virtual memory [Schmittler’2003]

– Limited through external bandwidth due to scene changes

RAMRAM RPU C HIP RAMRAM RPU C HIP

RPU board




• Multiple chips on a board

• Multiple boards in a PC

– Similar to today’s PC clusters in a much smaller form factor


RPU board


RPU board


RPU board


RPU board

VideoVideo

Future WorkFuture Work

• Support for fully dynamic scenes

– Vertex shader + building kd-trees

• Efficient photon mapping

– kd-tree construction + kNN filtering

• OpenRT-API [Dietrich’03]

• ASIC prototype

Questions?Questions?

http://graphics.cs.uni-sb.de

http://www.OpenRT.de

http://www.SaarCOR.de

introduction to realtime ray tracing course 41 philipp slusallek peter shirley bill mark gordon...

Documents

updated t

split ray

ray traversalcalled

scalabilityray tracing

longray tracing

maxray tracing characteristics

realtime ray tracingcourse

d steps