architecture considerations for tracing incoherent rays · architecture considerations for tracing...

10
High Performance Graphics (2010) M. Doggett, S. Laine, and W. Hunt (Editors) Architecture Considerations for Tracing Incoherent Rays Timo Aila Tero Karras NVIDIA Research Abstract This paper proposes a massively parallel hardware architecture for efficient tracing of incoherent rays, e.g. for global illumination. The general approach is centered around hierarchical treelet subdivision of the acceleration structure and repeated queueing/postponing of rays to reduce cache pressure. We describe a heuristic algorithm for determining the treelet subdivision, and show that our architecture can reduce the total memory bandwidth requirements by up to 90% in difficult scenes. Furthermore the architecture allows submitting rays in an arbitrary order with practically no performance penalty. We also conclude that scheduling algorithms can have an important effect on results, and that using fixed-size queues is not an appealing design choice. Increased auxiliary traffic, including traversal stacks, is identified as the foremost remaining challenge of this architecture. Categories and Subject Descriptors (according to ACM CCS): I.3.1 [Computer Graphics]: Hardware Architecture— Graphics Processors I.3.7 [Computer Graphics]: Three-Dimensional Graphics and Realism—Raytracing 1. Introduction We study how GPUs and other massively parallel comput- ing architectures could evolve to more efficiently trace in- coherent rays in massive scenes. In particular our focus is on how architectural as well as algorithmic decisions affect the memory bandwidth requirements in cases where memory traffic is the primary factor limiting performance. In these cases, the concurrently executing rays tend to access differ- ent portions of the scene data, and their collective working set is too large to fit into caches. To alleviate this effect, the execution needs to be modified in some non-trivial ways. It is somewhat poorly understood when exactly the mem- ory bandwidth is the primary performance bottleneck of ray tracing. It was widely believed that that was already the case on GPUs but Aila and Laine [AL09] recently concluded that in simple scenes that was not the case, even without caches. Yet, those scenes were very simple. We believe that applica- tions will generally follow the trend seen in rendered movies, for example Avatar already has billions of geometric primi- tives in most complex shots. If we furthermore target inco- herent rays arising from global illumination computations, it seems unlikely that currently available caches would be large enough to absorb a significant portion of the traffic. Also, the processing cores of current massively parallel computing systems have not been tailored for ray tracing computations, and significant performance improvements could be obtained by using e.g. fixed-function hardware for traversal and intersection [SWS02, SWW * 04, WSS05]. However, for such custom units to be truly useful, we must be able to feed them enough data, and eventually we will need to sustain immense data rates when the rays are highly incoherent, perhaps arising from global illumination compu- tation. In this paper we seek architectural solutions for situ- ations where memory traffic is the primary bottleneck, with- out engaging into further discussion about when and where these conditions might arise. We study this problem on a hypothetical parallel com- puting architecture, which is sized according to NVIDIA Fermi [NVI10]. We observe that with current approaches the memory bandwidth requirements can be well over hun- dred times higher than the theoretical lower bound, and that traversal stack traffic is a significant issue. We build on the ray scheduling work of Pharr et al. [PKGH97] and Navratil et al. [NFLM07], and extend their concept of queue-based scheduling to massively parallel architectures. We analyze the primary architectural requirements as well as the bottle- necks in terms of memory traffic. Our results indicate that scheduling algorithms can have an important effect on results, and that using fixed-size c The Eurographics Association 2010.

Upload: others

Post on 21-May-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Architecture Considerations for Tracing Incoherent Rays · Architecture Considerations for Tracing Incoherent Rays Timo Aila Tero Karras NVIDIA Research Abstract This paper proposes

High Performance Graphics (2010)M. Doggett, S. Laine, and W. Hunt (Editors)

Architecture Considerations for Tracing Incoherent Rays

Timo Aila Tero Karras

NVIDIA Research

Abstract

This paper proposes a massively parallel hardware architecture for efficient tracing of incoherent rays, e.g. forglobal illumination. The general approach is centered around hierarchical treelet subdivision of the accelerationstructure and repeated queueing/postponing of rays to reduce cache pressure. We describe a heuristic algorithmfor determining the treelet subdivision, and show that our architecture can reduce the total memory bandwidthrequirements by up to 90% in difficult scenes. Furthermore the architecture allows submitting rays in an arbitraryorder with practically no performance penalty. We also conclude that scheduling algorithms can have an importanteffect on results, and that using fixed-size queues is not an appealing design choice. Increased auxiliary traffic,including traversal stacks, is identified as the foremost remaining challenge of this architecture.

Categories and Subject Descriptors (according to ACM CCS): I.3.1 [Computer Graphics]: Hardware Architecture—Graphics Processors I.3.7 [Computer Graphics]: Three-Dimensional Graphics and Realism—Raytracing

1. Introduction

We study how GPUs and other massively parallel comput-ing architectures could evolve to more efficiently trace in-coherent rays in massive scenes. In particular our focus ison how architectural as well as algorithmic decisions affectthe memory bandwidth requirements in cases where memorytraffic is the primary factor limiting performance. In thesecases, the concurrently executing rays tend to access differ-ent portions of the scene data, and their collective workingset is too large to fit into caches. To alleviate this effect, theexecution needs to be modified in some non-trivial ways.

It is somewhat poorly understood when exactly the mem-ory bandwidth is the primary performance bottleneck of raytracing. It was widely believed that that was already the caseon GPUs but Aila and Laine [AL09] recently concluded thatin simple scenes that was not the case, even without caches.Yet, those scenes were very simple. We believe that applica-tions will generally follow the trend seen in rendered movies,for example Avatar already has billions of geometric primi-tives in most complex shots. If we furthermore target inco-herent rays arising from global illumination computations,it seems unlikely that currently available caches would belarge enough to absorb a significant portion of the traffic.

Also, the processing cores of current massively parallel

computing systems have not been tailored for ray tracingcomputations, and significant performance improvementscould be obtained by using e.g. fixed-function hardwarefor traversal and intersection [SWS02, SWW∗04, WSS05].However, for such custom units to be truly useful, we mustbe able to feed them enough data, and eventually we willneed to sustain immense data rates when the rays are highlyincoherent, perhaps arising from global illumination compu-tation. In this paper we seek architectural solutions for situ-ations where memory traffic is the primary bottleneck, with-out engaging into further discussion about when and wherethese conditions might arise.

We study this problem on a hypothetical parallel com-puting architecture, which is sized according to NVIDIAFermi [NVI10]. We observe that with current approachesthe memory bandwidth requirements can be well over hun-dred times higher than the theoretical lower bound, and thattraversal stack traffic is a significant issue. We build on theray scheduling work of Pharr et al. [PKGH97] and Navratilet al. [NFLM07], and extend their concept of queue-basedscheduling to massively parallel architectures. We analyzethe primary architectural requirements as well as the bottle-necks in terms of memory traffic.

Our results indicate that scheduling algorithms can havean important effect on results, and that using fixed-size

c© The Eurographics Association 2010.

Page 2: Architecture Considerations for Tracing Incoherent Rays · Architecture Considerations for Tracing Incoherent Rays Timo Aila Tero Karras NVIDIA Research Abstract This paper proposes

Aila and Karras / Architecture Considerations for Tracing Incoherent Rays

VEGETATION HAIRBALL VEYRON

Number of triangles 1.1M 2.9M 1.3MNumber of BVH nodes 629K 1089K 751KTotal memory footprint 86MB 209MB 104MBTraffic lower bound 158MB 104MB 47MB

Table 1: Key statistics of the test scenes. Half of the BVH nodes are leaves. VEGETATION and HAIRBALL are difficult scenesconsidering their triangle count, whereas VEYRON has very simple structure.

queues is not an appealing design choice. While our ap-proach can reduce the scene data-related transfers by up to95%, the overall savings are somewhat smaller due to traver-sal stacks and newly introduced queue traffic. An importantfeature of the proposed architecture is that the memory traf-fic is virtually independent of ray ordering.

2. Test setup

We developed a custom architecture simulator for the mea-surements. In order to have a realistic amount of parallelism,we sized our ray tracing hardware roughly according toNVIDIA Fermi: 16 processors, each 32-wide SIMD, 32×32threads/processor for latency hiding, each processor coupledwith a private L1 cache (48KB, 128B lines, 6-way), and asingle L2 cache (768KB, 128B lines, 16-way) shared amongthe processors. Throughout this paper we will use "B" todenote byte. Additionally, we chose round robin schedulingand implemented the L1 and L2 as conventional write-backcaches with LRU eviction policy.

We assume that the processors are very fast in accel-eration structure traversal and primitive intersection tasks,possibly due to a highly specialized instruction set or ded-icated fixed-function units, similar to the RPU architec-ture [WSS05]. Whether the processors are programmable orhardwired logic is not relevant for the rest of the paper.

Our primary focus will be on the amount of data trans-ferred between the chip and external DRAM. We count thenumber of DRAM atoms read and written. The size of theatoms is 32B, which gives good performance with currentlyavailable memories. We ignore several intricate details ofDRAM, such as page activations or bus flips (read/write),because existing chips already have formidable machineryfor optimizing the request streams for these aspects.

We assume that the memory transfer bottleneck is eitherbetween L2 and DRAM or between L1 and L2, and not else-where in the memory hierarchy. As a consequence, we re-quire that each L1 can access multiple cachelines per clock

(e.g. [Kra07] p. 110), as otherwise the L1 fetches could be-come a considerable bottleneck. We also acknowledge thatL2 caches are not infinitely fast compared to DRAM — forexample AMD HD 5870’s L2 is only about 3× faster thanDRAM — so we need to keep an eye on the ratio betweenthe total L2 and DRAM bandwidth requirements.

2.1. Rays

We are primarily interested in incoherent rayloads that couldarise in global illumination computations. These rays typi-cally start from surfaces and need to find the closest intersec-tion; furthermore any larger collection of rays tends to con-tain rays going to almost all directions and covering largeportions of the scene. The exact source of such rays is notimportant for our study, and we choose to work with diffuseinterreflection rays. We first determine the primary hit pointfor each pixel of a 512×384 viewport, and then generate 16diffuse interreflection rays per pixel, distributed according tothe Halton sequence [Hal60] on a hemisphere and further-more randomly rotated around the normal vector. This givesus 3 million diffuse rays, and we submit these in batches of1 million rays to the simulator. Each batch corresponds to arectangle on the screen (256×256, 256×256, 512×128); wedid not explore other batch sizes or shapes.

To gauge the effect of ray ordering we sort the rays insidea batch in two ways prior to sending them to the simulator.MORTON sorting orders the rays according to a 6D (originand direction) space-filling curve. This order is probably bet-ter than what one could realistically accomplish in practice.RANDOM sorting orders the rays randomly, arriving at prettymuch the worst possible ordering. From a usability point ofview, an ideal ray tracing platform would allow submittingthe rays in a more or less arbitrary order without compromis-ing the performance. Also, the benefit of static reorderingcan be expected to diminish with increasing scene complex-ity as the nodes visited by two neighboring rays will gener-ally be further apart in the tree.

c© The Eurographics Association 2010.

Page 3: Architecture Considerations for Tracing Incoherent Rays · Architecture Considerations for Tracing Incoherent Rays Timo Aila Tero Karras NVIDIA Research Abstract This paper proposes

Aila and Karras / Architecture Considerations for Tracing Incoherent Rays

Processor

L1cache

L2cache

DRAM

Launcher

rays enteringthe treelet

compacted

work

new

work

bypassedwork fromother processors

rays exitingthe treelet

newlylaunchedrays

stacktopcachemisses

RaysScenedata

0: active

Threads 0 ... 31Warps

1: active

3: active

4: vacant

5: compacting

...

31: active

2: launching

mux

nodesandtriangles

Queuesbound

Stacks

Figure 1: High-level view of our architecture along with themost important aspects of data flow.

2.2. Scenes

Ideally, we would conduct experiments with very large anddifficult scenes. Due to simulator infrastructure weaknesses(execution speed, memory usage) we had to settle for muchsmaller scenes, in the range of 1–5M triangles, but we wereable to select scenes with fairly difficult structure. In fact,the number of triangles in a scene is a poor predictor forthe amount of memory traffic. With highly tessellated com-pact objects such as cars and statues, rays tend to terminatequickly after having descended to a leaf node. With veg-etation and other organic shapes, on the other hand, it iscommon for rays to graze multiple surfaces before finallyintersecting one. We therefore selected two organic scenes,VEGETATION and HAIRBALL, in addition to a car interiorVEYRON. Key statistics of the scenes are shown in Table 1.Traffic lower bound is the total amount of scene-related traf-fic when each node and triangle is fetched exactly once perbatch. It can exceed the footprint of the scene because wetrace the rays in three batches.

We use bounding volume hierarchies (BVH) built usinggreedy surface area heuristic (SAH) without triangle split-ting. The maximum size of a leaf node is set to 8 triangles.

3. Architecture

We will start by explaining some basic concepts and thefunctionality of our architecture. We then run an existingGPU ray tracer on our architecture and analyze its memorybandwidth requirements. It turns out that stack-related trafficis an important subproblem, and we introduce a simple ap-proach for avoiding most of it. The ideas in this section canbe employed independently from the more advanced aspectsof our architecture that will be explained in Section 4.

Figure 1 illustrates the basic concepts of our architecture.Each processor hosts 32 warps and each warp consists of astatic collection of 32 threads that execute simultaneously

Data type Size ContentsRay 32B Origin, direction, tmin, tmaxRay state 16B Ray index (24 bits), stack pointer (8 bits),

current node index, closest hit (index, t)Node 32B AABB, child node indicesTriangle 32B Three vertex positions, paddingStack entry 4B Node index

Table 2: The most common data types.

in SIMD fashion. We assume one-to-one mapping betweenrays and threads. Traversal and intersection require approx-imately 30 registers for each thread (32 bits per register);about 12 for storing the ray and ray state (Table 2), and therest for intermediate results during computation. Each pro-cessor contains a launcher unit, that is responsible for fetch-ing work from a queue the processor is currently bound to.The queue stores ray states, and the ray itself is fetched laterdirectly from DRAM. Launcher copies the data from thequeue to a warp’s registers. Once the warp gets full, it isready for execution, and the launcher starts filling the nextvacant warp if one exists.

At this point, it is sufficient to only consider a single queue(i.e. the input queue), to which all processors are bound fromthe beginning. The queue is initialized to contain all the raysof a batch, and processors consume rays from it wheneverone or more of their warps become vacant. Since ray statesin the input queue always represent new rays that start fromthe root, it would be possible to optimize their memory foot-print by omitting the unnecessary fields. However, the totaltraffic generated by the input queue is negligible in any case,so we choose to ignore such optimizations for simplicity. Ascenario with multiple queues is introduced in Section 4.

3.1. Work compaction

Whenever a warp notices that more than 50% of its rays haveterminated, it triggers compaction if the launcher grants apermission for it. The terminated rays write their result tomemory and are ejected from the processor. Non-terminatedrays are sent back to the launcher, which then copies theirdata to the warp it is currently filling. The process is similarto fetching new rays from a queue, but in case of compactionwe also copy the ray to avoid having to read it again. Sincecompaction is a relatively rare operation, it does not have tohappen instantaneously.

Compaction is very important with highly divergent rayloads. Without it the percentage of non-terminated raysdropped to as low as 25% in our tests. With this kind of sim-ple compaction we can sustain around 60–75% even in diffi-cult scenarios, which suggests over 2× potential increase inoverall performance. However, since compaction increasesthe number of concurrently executing rays, their collectiveworking set grows as well. While this inevitably puts morepressure on the caches, it is hard to imagine a situation where

c© The Eurographics Association 2010.

Page 4: Architecture Considerations for Tracing Incoherent Rays · Architecture Considerations for Tracing Incoherent Rays Timo Aila Tero Karras NVIDIA Research Abstract This paper proposes

Aila and Karras / Architecture Considerations for Tracing Incoherent Rays

Test setup Total Scene Stack L1↔L2 Threadstraffic traffic traffic vs. alive

L2↔DR. L2↔DR. L2↔DR.(GB) (GB) (GB) required (%)

VEGETATION M 13.6 5.6 7.7 2.3× 75VEGETATION R 24.9 12.5 12.1 1.6× 73HAIRBALL M 11.2 5.0 6.0 2.0× 75HAIRBALL R 18.5 9.2 9.1 1.6× 65VEYRON M 1.9 0.6 1.1 6.9× 68VEYRON R 9.2 3.9 5.1 2.5× 73

Table 3: Measurements for the baseline method [AL09] withcompaction. The remaining ∼0.2GB of traffic is due to rayfetches and result write. M=MORTON, R=RANDOM.

the resulting increase in memory traffic would nullify thebenefits of increased parallelism.

Compaction is turned on in all tests so that the numbersare comparable.

3.2. Baseline ray tracing method

We take the persistent while-while GPU tracer [AL09] as ourbaseline method. In their method processors are pointed toa pool of rays and autonomously fetch new rays whenever awarp terminates. During execution a ray always fetches andintersects the two child nodes together, proceeds to the closerintersected child, and pushes the other intersected child’s in-dex to a per-ray traversal stack.

Table 3 shows the baseline method’s memory traffic in ourscenes with MORTON and RANDOM sorting of rays. The to-tal L2↔DRAM bandwidth requirements are very large com-pared to the lower bound shown in Table 1. With RANDOM

we see 150–200× more traffic than theoretically necessary.MORTON sorting reduced the traffic almost 5× in VEYRON,but in other scenes the effect was surprisingly small giventhat this is basically the worst-to-best case jump. In VEYRON

we can furthermore see that while DRAM traffic was greatlyreduced, the required L2 bandwidth compared to DRAMbandwidth was rather large (6.9×), implying that L1 cacheswere not effective even in this simple case.

3.3. Stack top cache

Our baseline method uses the the stack memory layout ofAila and Laine [AL09], who kept them in CUDA’s [NVI08]thread-local memory. This memory space is interleaved sothat the first stack entry of 32 adjacent threads (i.e. warp) areconsecutive in memory, same for the second entry, and soon. This is an attractive memory layout as long as the stackaccesses of a warp stay approximately in sync, but with inco-herent rays this is no longer the case and reordering of raysdue to compaction further amplifies the problem. An addi-tional test that ignored all traversal stack traffic revealed thatstacks are an important problem and responsible for approx-imately half of the total traffic in all six cases of Table 3.

Test setup Total Stack Total Stacktraffic traffic traffic trafficN = 4 N = 4 N = 8 N = 8(GB) (GB) (GB) (GB)

VEGETATION MORTON 6.0 0.186 5.9 0.016VEGETATION RANDOM 12.9 0.186 12.8 0.016HAIRBALL MORTON 5.3 0.104 5.2 0.009HAIRBALL RANDOM 9.6 0.104 9.5 0.009VEYRON MORTON 0.9 0.119 0.8 0.006VEYRON RANDOM 4.2 0.119 4.1 0.006

Table 4: Total and stack traffic measurements for baselineaugmented with an N-entry stack top cache.

An alternative layout is to dedicate a linear chunk of stackmemory for each ray. While this approach is immune tocompaction and incoherence between rays, all rays are con-stantly accessing distinct cache lines. Since this exceeds thecapacity of our caches, thrashing follows.

Horn et al. [HSHH07] describe a shortstack that main-tains only a few topmost (latest) stack entries, and restarts(kd-tree) traversal from root node with a shortened ray whenthe top is exhausted. This policy can lead to a very signifi-cant amount of redundant work when the missing stack en-tries are reconstructed. We use a related approach that keepsthe full per-ray traversal stack in memory and accesses itthrough a stack top cache, whose logic is optimized for di-rect DRAM communication. Our implementation uses a ringbuffer with capacity of N entries (the size is dynamic [0,N]),each of which consists of 4B data and a dirty flag. The ringbuffer is initially empty. Push and pop are implemented asfollows:

• push: Add a new entry to the end of the ring buffer, markit as dirty, and increment the stack pointer. In case of over-flow, evict the first entry. If the evicted entry was markedas dirty, write all entries belonging to the same atom toDRAM. Mark them as non-dirty and keep them in cache.

• pop: Remove an entry from the end of the ring buffer anddecrement the stack pointer. In case of underflow, fetchthe required entry as well as all earlier stack entries thatbelong to the same DRAM atom, and add them to the ringbuffer as non-dirty. If the number of entries exceeds N,discard the earliest ones.

We maintain the ring buffer for each thread in the registerfile because a tiny stack top of 4 entries typically suffices,and it would be wasteful to allocate a 128B cacheline forit when the register file can store it in 16B. The stack topabsorbs almost all traversal stack traffic and consequentlyreduces the total traffic by approximately 50% in our tests,as shown in Table 4. In the rest of the paper we will refer tothe baseline with stack top as improved baseline.

To avoid unnecessary flushes during compaction (Sec-tion 3.1), we copy the cached stack entries similarly to therest of the state. However, if we push a ray to a queue, we

c© The Eurographics Association 2010.

Page 5: Architecture Considerations for Tracing Incoherent Rays · Architecture Considerations for Tracing Incoherent Rays Timo Aila Tero Karras NVIDIA Research Abstract This paper proposes

Aila and Karras / Architecture Considerations for Tracing Incoherent Rays

T0

T1T2 T3

T4 T5

internal node

leaf node

treeletTi

Figure 2: Example treelet subdivision of an accelerationstructure.

must flush dirty stack entries, as we have chosen not to storecached stack entries in the queue. This design gave slightlylower memory traffic on our system but if the DRAM atomwas any larger, it would be better to store the cached entriesto the queue.

4. Treelets

The primary problem with the baseline method is that raysexplore the scene independently of each other, and thus theconcurrent working set can grow very large. This problemwas approached by Pharr et al. [PKGH97] in the context ofout-of-core ray tracing. They collected rays that enter nodesof a uniform space subdivision, and later on processed therays as batches in order to reduce swapping. More recently,Navratil et al. [NFLM07] optimized L2↔DRAM traffic ofa sequential ray tracer with a setup that collected rays intoqueue points, and then processed all rays in one queue pointtogether. Each queue point was placed so that the entire sub-tree under it fits into cache. All data between the root andqueue points was assumed to stay in cache, and thus therelated memory traffic was not tracked. There was also nostack traffic since whenever a ray exited a queue point, itwas shortened and traversal restarted from the root node. Weextend this work to hierarchical queue points and massiveparallelism.

Figure 2 shows a simple tree that has been partitioned intosix treelets. Nodes are statically assigned to treelets when theBVH, or any other acceleration structure, is built (Section4.1). We store this assignment into node indices that encodeboth a treelet index and a node index inside the treelet. A raystarts the traversal from the root and proceeds as usual untilit encounters a treelet boundary. At this point the processingis suspended and the ray’s state is pushed to a queue thatcorresponds to the target treelet. At some point a scheduler(Section 4.2) decides that it is time to start processing raysfrom the target treelet. Hopefully by this time many more

1: // Start with an empty treelet.2: cut← treeletRoot3: bytesRemaining← maxTreeletFootprint4: bestCost[treeletRoot]←∞5:6: // Grow the treelet until it is full.7: loop8: // Select node from the cut.9: (bestNode, bestScore)← (∅,−∞)

10: for all n ∈ cut do11: if footprint[n] ≤ bytesRemaining then12: gain← area[n] + ε

13: price← min(subtreeFootprint[n], bytesRemaining)14: score← gain / price15: if score > bestScore then16: (bestNode, bestScore)← (n, score)17: end if18: end if19: end for20: if bestNode = ∅ then break21:22: // Add to the treelet and compute cost.23: cut← (cut \ bestNode) ∪ children[bestNode]24: bytesRemaining← bytesRemaining - footprint[bestNode]25: cost← (area[treeletRoot] + ε) + ∑n∈cutbestCost[n]26: bestCost[treeletRoot]← min(bestCost[treeletRoot], cost)27: end loop

Figure 3: Pseudocode for treelet assignment using dynamicprogramming. This code is executed for each node in reversedepth-first order.

rays have been collected so that a of number rays will berequesting the same data (nodes and triangles) and thus theL1 cache can absorb most of the requests. The tracing ofthe rays then continues as usual, and when another treeletboundary is crossed the rays are queued again.

While a treelet subdivision can be arbitrary, it should gen-erally serve two goals. We want to improve cache hitratesand therefore should aim for treelets whose memory foot-print approximately matches the size of L1. We should alsoplace the treelet boundaries so that the expected number oftreelet transitions per ray is minimized because each transi-tion causes memory traffic.

The rest of the paper assumes one-to-one mapping be-tween treelets and queues, and that a ray can reside in atmost one queue at any given time. While one could specula-tively push a ray to multiple queues, we have found that thisis not beneficial in practice.

4.1. Treelet assignment

The goal of our treelet assignment algorithm is to min-imize the expected number of treelets visited by a randomray. Disregarding occlusion, the probability of a long ran-dom ray intersecting a treelet is proportional to its surfacearea. We therefore chose to optimize the total surface area of

c© The Eurographics Association 2010.

Page 6: Architecture Considerations for Tracing Incoherent Rays · Architecture Considerations for Tracing Incoherent Rays Timo Aila Tero Karras NVIDIA Research Abstract This paper proposes

Aila and Karras / Architecture Considerations for Tracing Incoherent Rays

the treelet roots, which was also experimentally verified tohave a strong correlation with treelet-related memory traffic.Our method produces treelets that have a single root node,although the rest of this paper does not rely on that.

We use dynamic programming, and the algorithm’s ex-ecution consists of two stages. In the first stage the pseu-docode in Figure 3 is executed for each BVH node in reversedepth-first order (starting from leaves) to find the lowest-costtreelet that starts from the specified node. The expected costof a treelet equals its root node’s surface area plus the sumof costs of all nodes in the cut, i.e. the set of nodes that arejust below the treelet.

The algorithm is greedy because instead of trying all pos-sible treelet shapes it selects nodes from the cut according toa simple heuristic (lines 12–14). The heuristic is primarilybased on surface area: if a node has large area, a treelet start-ing from it would probably be intersected often. Since anynode in the cut can be added to the current treelet without in-creasing the treelet’s area, we choose the largest. However,if a node in the cut represents a subtree with tiny memoryfootprint, we would like to include it into the current treeletinstead of creating a separate, tiny treelet for it.

In the second stage of the algorithm nodes are assignedto treelets. We start from the root node, with essentially thesame code, and enlarge the treelet until its cost equals thevalue that was stored during the first pass. Once the costis met, the loop terminates and the function is called recur-sively for each node in the cut.

The additional control parameter ε encourages subdivi-sions with larger average treelet footprint. We prefer larger(and thus fewer) treelets for two reasons. First, fewer treeletsmeans fewer queues during simulation. Second, our subse-quent analysis of the optimal treelet size would be dubious ifthe average size was nowhere near the specified maximum.If the scene footprint (nodes and triangles) is S and maxi-mum treelet size is T , at least S/T treelets will be needed.As a crude approximation for the expected surface area ofthe smallest treelets we use ε = AreaBV Hroot ∗ T/(S ∗ 10)to discourage unnecessarily small treelets. This penalty hasa negligible effect on the total surface area while increas-ing the average treelet memory footprint, e.g., from 3KB to19KB.

Table 5 provides statistics for our test scenes with treeletsnot larger than 48KB. Our method deposited the BVH leafnodes adaptively to levels 1–8, whereas prior work had ex-actly 2 layers of treelets [NFLM07]. As a further validationof our greedy treelet assignment, we also implemented anO(NM2) search that tried all possible cuts in the subtree in-stead of the greedy heuristic. Here, N and M denote the num-ber of nodes in tree and treelet, respectively.

Scene Num Avg. Stddev Avg. Treelet Area48KB size size treelets layers vs.treelets (KB) per ray optimal

VEGETATION 4588 19.3 15.4 11.3 1–5 +20%HAIRBALL 14949 14.3 16.8 6.1 2–8 +15%VEYRON 3563 30.0 13.3 9.2 1–8 +5%

Table 5: Treelet assignment statistics with 48KB max size.

4.2. Scheduler

The remaining task is to choose which queues the proces-sors should fetch rays from. We will analyze two differentscheduling algorithms for this purpose.

LAZY SCHEDULER is as simple as possible: when thequeue a processor is currently bound to gets empty, we bindthe processor to the queue that currently has most rays. Thispolicy has several consequences. The scheduler tends to as-sign the same queue to many processors, and most of thetime only a few treelets are being processed concurrently bythe chip. Considering DRAM traffic, this behavior tends tofavor treelets that are roughly the size of L2 cache. Also,scene traffic is well-optimized because queues are able tocollect maximal number of rays before they are bound.

BALANCED SCHEDULER defines a target size for allqueues (e.g. 16K rays). When a queue exceeds the targetsize, it starts requesting progressively more processors todrain from it. This count grows linearly from zero at the tar-get size to the total number of processors at 2× target size.We also track how many processors are currently bound toa queue, and therefore know how many additional proces-sors the queue thinks it needs. The primary sorting key ofqueues is the additional workforce needed, while the numberof rays in the queue serves as a secondary key. The bindingof a processor is changed if its current queue has too muchworkforce while some other queue has too little. The inputqueue that initially contains all rays requires special treat-ment: we clamp the number of processors it requests to fourto prevent the input queue from persistently demanding allprocessors to itself. This policy leads to more frequent bind-ing changes than LAZY scheduler, but different processorsare often bound to different queues and therefore L1-sizedtreelets are favored. It also creates additional opportunitiesfor queue bypassing (Section 4.4).

The graphs in Figure 4 show the total DRAM traffic asa function of maximum treelet size for both schedulers inVEGETATION. It can be seen that LAZY scheduler worksbest with L2-sized treelets and BALANCED scheduler withL1-sized treelets, although this simplified view disregardspotential L1↔L2 bottlenecks. In both cases the minimumtraffic is reached with a maximum size that is slightly largerthan the caches. Two possible explanations are that the av-erage treelet size is always somewhat smaller than the max-imum, and that rays may not actually access every node andtriangle of a treelet, thus reducing the actual cache footprint.

c© The Eurographics Association 2010.

Page 7: Architecture Considerations for Tracing Incoherent Rays · Architecture Considerations for Tracing Incoherent Rays Timo Aila Tero Karras NVIDIA Research Abstract This paper proposes

Aila and Karras / Architecture Considerations for Tracing Incoherent Rays

100 102 104 1060

2000

4000

6000

8000

10000

12000

14000L1 L2

treelet size (KB)

DR

AM tr

affic

(MB)

100 102 104 1060

2000

4000

6000

8000

10000

12000

14000L1 L2

treelet size (KB)

DR

AM tr

affic

(MB) total traffic

scene traffic

auxiliary traffic(queues, rays,

stacks)

LAZY SCHEDULER BALANCED SCHEDULER

Figure 4: DRAM traffic as a function of treelet size in VEGETATION for both schedulers. Lazy scheduler gives the lowest totaltraffic with treelets that are sized according to L2, whereas the BALANCED scheduler works best with L1-sized treelets.

We will provide detailed results in Section 5, but it is al-ready evident that scene traffic remains clearly higher thanthe lower bound for this scene (158MB). The crucial obser-vation here is that we could expect to process each treeletonce only if all rays visited the nodes in the same order. Inpractice, however, some rays traverse the scene left-to-rightand others right-to-left. Taking this further, we can expect totraverse the scene roughly once for each octant of ray direc-tion, which would suggest a more realistic expectation of 8×the lower bound.

Ramani et al. [RGD09] propose an architecture that essen-tially treats every node as a treelet. As can be extrapolatedfrom our graphs their design point is bound to exhibit verysignificant auxiliary traffic and therefore mandates keepingthe rays and related state in on-chip caches, which is whatRamani et al. do. One challenge is that we have about 16Krays in execution at any given time and based on preliminaryexperiments, we would need to keep at least 10× as manyrays in on-chip caches to benefit significantly from postpon-ing. Considering current technology, this is a fairly substan-tial requirement. The scene traffic might also be higher withtheir approach since we can expect to fetch the scene dataapproximately 8 times per a batch of rays, and currently ourbatch size can be in millions, whereas their approach wouldrequire smaller batches.

4.3. Queue management

We are implicitly assuming that queue operations are veryfast, since otherwise our treelet-based approach might belimited by processing overhead instead of memory traffic.This probably requires dedicated hardware for queue opera-tions, which should be quite realistic considering that plentyof other uses exist for queues [SFB∗09]. Our queues are im-plemented as ring buffers with FIFO policy in external mem-ory, and the traffic goes directly to DRAM because when-ever we push a ray to a queue, we do not expect to need it

for a while. Caching would therefore be ineffective. Writesare compacted on-chip before sending the atoms to externalmemory; queues could therefore work efficiently with anyDRAM atom size.

We tried two memory allocation schemes for the queues.The easier option is to dedicate a fixed amount of storagefor each queue at initialization time. This means that queuescan get full, and therefore pre-emption must be supported toprevent deadlocks. Our pre-emption moves the rays whoseoutput is blocked back to the input queue, which is guar-anteed to have enough space. With fixed-size queues, it isnot clear to us whether any realistic scheduler could guar-antee deadlock-free execution without pre-emption; at leastour current schedulers cannot. BALANCED SCHEDULER canoften avoid pre-emptions but it still needs them occasionally.However, the real problem with this approach is that a largeamount of memory has to be allocated (e.g. thousands of en-tries per treelet) and most of this memory will never be used.

We implemented dynamic resizing of queues to use mem-ory more efficiently and guarantee no deadlocks even with-out pre-emption. When a queue gets full, we extend it withanother page (fixed-sized chunk of memory, e.g. 256 entries)from a pool allocator. Queues maintain a list of pages in theirpossession. When a page is no longer needed, it is returnedto the pool. If we limit the system to have a certain maxi-mum number of rays in flight at a time, the memory require-ments are bounded by this threshold plus one partial pageper treelet as well as one per processor. As long as the poolallocator has been appointed a sufficient amount of memory,queues cannot get full and hence deadlocking is impossible.This is the method actually used in our measurements. Wedid not study the possible microarchitecture of this mecha-nism in detail, but we do not foresee any fundamental prob-lems with it. Lalonde [Lal09] describes other prominent usesfor pool allocators in graphics pipelines.

c© The Eurographics Association 2010.

Page 8: Architecture Considerations for Tracing Incoherent Rays · Architecture Considerations for Tracing Incoherent Rays Timo Aila Tero Karras NVIDIA Research Abstract This paper proposes

Aila and Karras / Architecture Considerations for Tracing Incoherent Rays

Scene Scheduler Treelet Treelet Total Scene traffic (cached) Aux. traffic (uncached) Threads Queuemax changes traffic L2↔DR. L2↔DR. L1↔L2 Queues Rays Stacks alive opssize vs. lower vs. bypassed(KB) per ray (GB) (GB) bound L2↔DR. (GB) (GB) (GB) (%) (%)

VEGETATION LAZY 768 6.9 3.01 1.12 7.1× 7.7× 0.59 0.64 0.57 62 18VEGETATION BALANCED 768 6.9 6.76 5.26 33.3× 1.7× 0.44 0.49 0.48 52 41VEGETATION LAZY 48 11.3 3.97 1.18 7.5× 2.3× 0.89 0.93 0.89 64 23VEGETATION BALANCED 48 11.3 3.77 1.18 7.5× 1.7× 0.81 0.86 0.83 63 30HAIRBALL LAZY 768 2.7 1.78 0.94 9.0× 9.0× 0.24 0.29 0.24 62 25HAIRBALL BALANCED 768 2.7 5.78 5.05 48.5× 1.7× 0.19 0.24 0.21 53 43HAIRBALL LAZY 48 6.1 2.68 1.10 10.6× 2.4× 0.50 0.55 0.47 64 23HAIRBALL BALANCED 48 6.1 2.53 1.08 10.4× 1.7× 0.44 0.49 0.43 63 32VEYRON LAZY 768 5.7 1.73 0.16 3.4× 23.3× 0.48 0.52 0.47 64 21VEYRON BALANCED 768 5.7 2.94 1.75 37.2× 2.5× 0.34 0.39 0.37 51 47VEYRON LAZY 48 9.2 2.49 0.19 4.0× 3.3× 0.75 0.79 0.67 67 21VEYRON BALANCED 48 9.2 2.03 0.20 4.3× 3.6× 0.57 0.61 0.56 60 41

Table 6: Measurements for our treelet-based approach with two schedulers using 48KB and 768KB treelets. Stack top size wasfixed to four and bypassing according to two previous queue bindings was allowed. Additionally, result writes used ∼0.1GB.

4.4. Queue bypassing

We can avoid the round-trip through a queue if the queueis already bound to another processor. This optimization isobviously possible only in parallel architectures. Since thereis no need to preserve the ordering of rays, we can forwardthe ray along with its ray state and stack top to the anotherprocessor’s launcher. This is basically the same operation ascompaction (Section 3.1) but to a different processor. In ourarchitecture, as well as in most massively parallel designs,any two processors can be reached from the shared L2 cache,and thus the queue bypass operation can be carried out at L2bandwidth without causing DRAM traffic.

Since the stack top cache talks directly to DRAM, there isno risk that some of the traversal stack data would be onlyin the previous processor’s L1 cache. Therefore coherent L1caches are not required.

We can optionally allow bypassing decisions accordingto e.g. two previous queues in addition to the current one.The justification is that whenever a processor is bound to anew queue, it will continue to have rays from the previousqueue/treelet in flight for some time. Also, towards the endmany treelets will have fewer rays than fit into a processorand therefore rays from more than two treelets will be inflight simultaneously. Having multiple treelets in flight doesnot immediately lead to working set explosion because whena smaller number rays propagate inside a treelet, only a partof its data is accessed.

5. Results

Table 6 shows results for RANDOM sorted rays in the threescenes using both schedulers with L1 and L2-sized treelets.We do not provide separate results for MORTON sorting be-cause they were virtually identical with RANDOM; it there-fore does not matter in which order the rays are submitted

to the simulator. The lowest total traffic was obtained in allscenes with LAZY scheduler and L2-sized treelets. For thisdesign point to be realistic L2 should be 7.7− 23.3× fasterthan DRAM, which would require a radical departure fromcurrent architectures. The second lowest total traffic wasobtained with BALANCED scheduler and L1-sized treelets,and this combination places only modest assumptions on L2speed (1.7− 3.6× DRAM). In the following we will focuson the latter design point.

Figure 5 shows the total and scene traffic for baseline,improved baseline (baseline with stack top cache), andtreelets with RANDOM sorted rays. The improved baselineessentially eliminates the stack traffic, and the total trafficis roughly halved. The reduction is virtually identical forMORTON sorted rays (Tables 3 and 4).

Treelets offer an additional 50–75% reduction in totalmemory bandwidth on top of the improved baseline, yield-ing a total reduction of 80-85%. Figure 5 also reveals thatwhen treelets are used, the scene traffic is greatly dimin-ished but the auxiliary traffic (rays, stacks, queues) increasesa lot. This is the most significant remaining problem withthe treelet-based approach, and the reason why it can in-crease memory traffic in simple scenes. With MORTON

sorted rays the reduction on top of the improved baselineis still around 50% in VEGETATION and HAIRBALL but inVEYRON treelets actually lose (2.0GB vs 0.9GB).

These results assume that bypassing is allowed to two pre-vious queues. Table 7 shows that if bypassing is allowed onlyto the current queue, the total traffic would increase by about10%, and disabling bypassing would cost another 10%. In-creasing the number of previous queues further did not help.

To gauge scalability, we performed additional tests with64 processors and L1 caches instead of 16, all other param-eters held constant. With BALANCED scheduler and 48KBtreelets the total memory traffic changed very little (<20%)

c© The Eurographics Association 2010.

Page 9: Architecture Considerations for Tracing Incoherent Rays · Architecture Considerations for Tracing Incoherent Rays Timo Aila Tero Karras NVIDIA Research Abstract This paper proposes

Aila and Karras / Architecture Considerations for Tracing Incoherent Rays

9.6improved

4.2improved

Vegetation

24.9baseline

12.9improved

3.8treelets

18.5baseline

2.5treelets

9.2baseline

2.0treelets

Hairball Veyron

25

20

15

10

5

0

scenetraffic

auxiliarytraffic

Figure 5: Memory traffic with RANDOM sorted rays forbaseline, improved baseline, and treelets (from left to right).

and actually decreased in VEYRON. While the scene trafficdid increase somewhat, there were more bypassing opportu-nities on a larger chip, which reduced the auxiliary traffic.This suggests that our approach scales well, and that a rela-tively small L2 can be sufficient even in massive chips.

So far we have assumed that each processor can drain datafrom L1 at least 512 bits at a time (2 adjacent sibling nodes).This may not be the case on existing architectures; for ex-ample on NVIDIA Fermi the widest load instruction is 128bits. This means that four consecutive loads are needed, andthere is a possibility that the data gets evicted from L1 be-tween these loads. We introduced this constraint to the 16-processor chip, and it almost doubled the total traffic in thebaseline, whereas our treelet-based approach was barely af-fected because cache thrashing happens so rarely. In the mostdifficult cases, RANDOM sorted rays in VEGETATION andHAIRBALL, the reduction in total memory traffic comparedto the baseline was now slightly over 90%, whereas the sim-plest case (VEYRON with MORTON) was unaffected.

6. Possible future improvements

6.1. Batch processing vs. continuous flow of rays

We have concentrated on processing the rays in batches. Apotentially appealing alternative would be to let the tracingprocess run continuously so that new rays could be pulled inwhenever the scheduler so decides. Unfortunately our cur-rent schedulers do not guarantee fairness in the sense thatrays that end up in a rarely visited treelet could stay thereindefinitely long. With batch processing this cannot happensince all queues/treelets are processed upon the batch’s end.It should be possible to improve the schedulers so that fair-ness could be guaranteed. We also suspect, without tangibleevidence, that the schedulers can be improved in other waysas well. Another complication from continuous flow of raysis memory allocation for traversal stacks, ray data, and rayresults. In batch processing all of this is trivial, but when

Test Total Scene Aux. traffic By-traffic L2↔D. Queues Rays Stacks pass(GB) (GB) (GB) (GB) (GB) (%)

VEGETATION 2 PQ 3.77 1.18 0.81 0.86 0.83 30VEGETATION 0 PQ 4.07 1.21 0.92 0.97 0.88 20VEGET. no bypass 4.61 1.23 1.14 1.19 0.97 0HAIRBALL 2 PQ 2.53 1.08 0.44 0.49 0.43 32HAIRBALL 0 PQ 2.68 1.09 0.50 0.55 0.46 28HAIRB. no bypass 3.00 1.10 0.63 0.68 0.51 0VEYRON 2 PQ 2.03 0.20 0.57 0.61 0.56 41VEYRON 0 PQ 2.41 0.20 0.71 0.76 0.64 24VEYRON no bypass 2.94 0.20 0.93 0.98 0.74 0

Table 7: Measurements without bypassing, and without andwith previous queues using BALANCED scheduler and 48KBtreelets. PQ=previous queue bindings.

individual rays come and go, a more sophisticated memoryallocation scheme would be called for.

6.2. Stackless traversal

Stack traffic constitutes 17–28% of overall memory commu-nication when treelets are used. One way to remove it alto-gether would be to use fixed-order skiplist traversal [Smi98],which would also reduce the scene traffic since there is awell defined global order for processing the treelets. The po-tentially significant downside of fixed-order traversal is thatrays are not traversed in front-to-back order and in scenesthat have high depth complexity a considerable amount ofredundant work may be performed. That said, in scenes withlow depth complexity (e.g. VEYRON) fixed-order traversalmight indeed be a good idea, and with shadow rays that donot need to find the closest hit, it may generally win.

6.3. Wide trees

Our preliminary tests suggest that wide trees [DHK08,WBB08] could allow visiting fewer treelets per ray than bi-nary trees. Wide trees have other potentially favorable prop-erties too. One could map the node intersections to N lanes inan N-wide tree, instead of dedicating one ray per thread. Asa result, the number of concurrently executing rays would bereduced to 1/N of the original, and the scene-related mem-ory traffic should also be lower. We consider this a promisingavenue of future work.

6.4. Compression

Both the baseline method and our treelet-based approachwould benefit from a more compact representation of scenedata and rays. The first step could be to drop the planesfrom BVH nodes that are shared with the parent node[Kar07], or possibly use less expressive but more compactnodes, such as bounding interval hierarchy [WK06]. Alter-natively some variants of more complex compression meth-ods [LYM07, KMKY09] might prove useful.

c© The Eurographics Association 2010.

Page 10: Architecture Considerations for Tracing Incoherent Rays · Architecture Considerations for Tracing Incoherent Rays Timo Aila Tero Karras NVIDIA Research Abstract This paper proposes

Aila and Karras / Architecture Considerations for Tracing Incoherent Rays

It is not clear to us how rays could be compressed. Onecould imagine quantizing the direction, for example, but pre-sumably that would have to be controllable by the applica-tion developer. One could also represent, e.g., hemispheresas a bunch of directions and a shared origin, although in ourdesign that would in fact increase the bandwidth since onewould need to fetch two DRAM atoms per ray (unique andshared components). Nevertheless, this is a potentially fruit-ful topic for future work.

6.5. Keeping rays on-chip

Looking much further to the future, it may be possible thatL3 caches of substantial size (e.g. 100MB) could be eitherpressed on top of GPUs as stacked die or placed right nextto GPUs and connected via high-speed interconnect. In sucha scenario it might make sense to keep a few million raysand perhaps stack tops on chip, thus eliminating the relatedmemory traffic.

6.6. Shading

We have completely ignored shading in this paper. We pos-tulate that the queues offer a very efficient way of collectingshading requests for a certain material, shader, or object, andthen executing them when many have been collected, simi-larly to the stream filtering of Ramani et al. [RGD09].

7. Conclusions

Our simulations show that stack top caching and the treelet-based approach can significantly reduce the memory band-width requirements arising from incoherent rays. We believethat for many applications the possibility of submitting raysin an arbitrary order will be a valuable feature.

In terms of getting the required features actually built,several milestones can be named. Before dedicated supportmakes financial sense, ray tracing has to become a importanttechnology in mainstream computing. We also have to haveimportant applications whose performance is limited by thememory bandwidth. Regardless, we believe that most of therequired features, such as hardware-accelerated queues, haveimportant uses outside this particular application.

Acknowledgements Thanks to Peter Shirley and Lauri Savi-oja for proofreading. Vegetation and Hairball scenes cour-tesy of Samuli Laine.

References

[AL09] AILA T., LAINE S.: Understanding the efficiency of raytraversal on GPUs. In Proc. High-Performance Graphics 2009(2009), pp. 145–149.

[DHK08] DAMMERTZ H., HANIKA J., KELLER A.: Shallowbounding volume hierarchies for fast SIMD ray tracing of inco-herent rays. Comp. Graph. Forum 27, 4 (2008), 1225–1234.

[Hal60] HALTON J.: On the efficiency of certain quasi-random se-quences of points in evaluating multi-dimensional integrals. Nu-merische Mathematik 2, 1 (1960), 84–90.

[HSHH07] HORN D. R., SUGERMAN J., HOUSTON M., HAN-RAHAN P.: Interactive k-d tree GPU raytracing. In Proc. Sym-posium on Interactive 3D graphics and games (2007), ACM,pp. 167–174.

[Kar07] KARRENBERG R.: Memory aware realtime ray tracing:The bounding plane hierarchy. Bachelor thesis, Saarland Univer-sity, 2007.

[KMKY09] KIM T.-J., MOON B., KIM D., YOON S.-E.:RACBVHs: random-accessible compressed bounding volume hi-erarchies. In ACM SIGGRAPH ’09: Posters (2009), pp. 1–1.

[Kra07] KRASHINSKY R. M.: Vector-Thread Architecture AndImplementation. PhD thesis, MIT, 2007.

[Lal09] LALONDE P.: Innovating in a software graphics pipeline.ACM SIGGRAPH 2009 course: Beyond programmable shading,2009.

[LYM07] LAUTERBACH C., YOON S.-E., MANOCHA D.: Ray-strips: A compact mesh representation for interactive ray trac-ing. In Proc. IEEE Symposium on Interactive Ray Tracing 2007(2007), pp. 19–26.

[NFLM07] NAVRATIL P. A., FUSSELL D. S., LIN C., MARKW. R.: Dynamic ray scheduling to improve ray coherence andbandwidth utilization. In Proc. IEEE Symposium on InteractiveRay Tracing 2007 (2007), pp. 95–104.

[NVI08] NVIDIA: NVIDIA CUDA Programming Guide Version2.1. 2008.

[NVI10] NVIDIA: Nvidia’s next generation CUDA compute ar-chitecture: Fermi. Whitepaper, 2010.

[PKGH97] PHARR M., KOLB C., GERSHBEIN R., HANRAHANP.: Rendering complex scenes with memory-coherent ray tracing.In Proc. ACM SIGGRAPH 97 (1997), pp. 101–108.

[RGD09] RAMANI K., GRIBBLE C. P., DAVIS A.: Streamray: astream filtering architecture for coherent ray tracing. In ASPLOS’09: Proc. 14th International Conference on Architectural Sup-port for Programming Languages and Operating Systems (2009),ACM, pp. 325–336.

[SFB∗09] SUGERMAN J., FATAHALIAN K., BOULOS S., AKE-LEY K., HANRAHAN P.: Gramps: A programming model forgraphics pipelines. ACM Trans. Graph. 28, 1 (2009), 1–11.

[Smi98] SMITS B.: Efficiency issues for ray tracing. J. Graph.Tools 3, 2 (1998), 1–14.

[SWS02] SCHMITTLER J., WALD I., SLUSALLEK P.: Saarcor:a hardware architecture for ray tracing. In Proc. ACM SIG-GRAPH/EUROGRAPHICS conference on Graphics hardware(2002), pp. 27–36.

[SWW∗04] SCHMITTLER J., WOOP S., WAGNER D., PAULW. J., SLUSALLEK P.: Realtime ray tracing of dynamic sceneson an FPGA chip. In Proc. ACM SIGGRAPH/EUROGRAPHICSconference on Graphics hardware (2004), pp. 95–106.

[WBB08] WALD I., BENTHIN C., BOULOS S.: Getting ridof packets: Efficient SIMD single-ray traversal using multi-branching BVHs. In Proc. IEEE/Eurographics Symposium onInteractive Ray Tracing 2008 (2008), pp. 49–57.

[WK06] WÄCHTER C., KELLER A.: Instant ray tracing: Thebounding interval hierarchy. In Proc. Eurographics Symposiumon Rendering 2006 (2006), pp. 139–149.

[WSS05] WOOP S., SCHMITTLER J., SLUSALLEK P.: RPU: aprogrammable ray processing unit for realtime ray tracing. ACMTrans. Graph. 24, 4 (2005), 434–444.

c© The Eurographics Association 2010.