flexrender: a distributed rendering architecture for ray...

FlexRender: A distributed rendering architecture for ray tracing hugescenes on commodity hardware.

Bob Somers and Zoe J. WoodCalifornia Polytechnic University, San Luis Obispo, CA, USA

[email protected], [email protected]

Keywords: Rendering, distributed rendering, cluster computing

Abstract: As the quest for more realistic computer graphics marches steadily on, the demand for rich and detailed im-agery is greater than ever. However, the current “sweet spot” in terms of price, power consumption, and perfor-mance is in commodity hardware. If we desire to render scenes with tens or hundreds of millions of polygonsas cheaply as possible, we need a way of doing so that maximizes the use of the commodity hardware that wealready have at our disposal. We propose a distributed rendering architecture based on message-passing thatis designed to partition scene geometry across a cluster of commodity machines, allowing the entire scene toremain in-core and enabling parallel construction of hierarchical spatial acceleration structures. The designuses feed-forward, asynchronous messages to allow distributed traversal of acceleration structures and evalu-ation of shaders without maintaining any suspended shader execution state. We also provide a simple methodfor throttling work generation to keep message queueing overhead small. The results of our implementationshow roughly an order of magnitude speedup in rendering time compared to image plane decomposition, whilekeeping memory overhead for message queuing around 1%.

Figure 1: Field of high resolution meshes whose geometriccomplexity is nearly 87 million triangles. Rendered across4 commodity machines using FlexRender

1 INTRODUCTION

Rendering has advanced at an incredible pace inrecent years. At the heart of rendering is describingthe world we wish to draw, which has traditionallybeen done by defining surfaces. While exciting de-velopments in volume rendering techniques happenon a regular basis, it is unlikely we will abandon sur-face rendering any time soon. The champion of sur-face representations in rendering, has been the polyg-

onal mesh. Even when used to store smooth surfacesin a compressed way (as with subdivision surfaces),they are usually tesselated into large numbers of finepolygons for rendering. Since their representation isdefined explicitly, meshes with fine levels of detailhave significantly higher storage requirements. Thus,as the demand for higher visual fidelity increases, thenatural tendency is to increase geometric complexity.

Graphics, and ray tracing in particular, has longbeen said to be a problem that is embarrassingly par-allel, given that many graphics algorithms operateon pixels independently. Graphics processing units(GPUs) have exploited this fact for years to achieveamazing throughput of graphics primitives in real-time. While processor architectures have become ex-ceedingly parallel and posted impressive speedups,the memory hierarchy has not had time to catch up.For a processor to perform well, the CPI, cycles perinstruction, must remain low to ensure time is spentdoing useful work and not waiting on data.

In current memory hierarchies, data access timecan take anywhere from around 12 cycles (4 nanosec-onds for an L1 cache hit) to over 300 cycles (100nanoseconds for main memory). Techniques suchas out-of-order execution are helpful in filling thiswasted time, but for memory intensive applications

it can be difficult to fill all the gaps with useful work.Because of this fact, there is a lot more to paral-

lel rendering than initially meets the eye. Graphicsmay indeed by highly parallel, but its voracious ap-petite for memory access is actively working againstits parallel efficiency on current architectures.

This paper presents FlexRender, a ray tracer de-signed for rendering scenes with high geometric com-plexity on commodity hardware. We specifically tar-get commodity hardware because it currently has anexcellent cost to performance ratio, but still typicallylacks enough memory to fit large scenes entirely inRAM.

Our work describes the following core contribu-tions:

1. A feed-forward messaging design between work-ers that passes enough state with every ray tonever require reply messages.

2. An approach to shading using this messaging de-sign that does not require suspending execution ofshaders and maintaining suspended state.

3. An extension to a stackless BVH traversal algo-rithm (Hapala et al., 2011) that makes it possibleto suspend traversal at one worker and resume itat another.

4. A simple throttling method for managing memoryoverhead for message queueing without resortingto a complicated dynamic scheduling algorithm.

5. Technical insight into the challenges of buildingsuch a renderer, an analysis of our results, and anopen-source implementation.

In particular, we show that FlexRender canachieve speedups that are around an order of magni-tude better than image plane decomposition for com-modity machines, and can naturally self-regulate thecluster of workers to keep the memory overhead dueto message queuing around 1% of each worker’s mainmemory.

It is important to remember that FlexRender isan experimental design, and some of our designgoals worked better than others. In the hopes ofmaking future experimentation easier for others, weprovide the complete source code of FlexRender athttp://www.flexrender.org.

2 Background

FlexRender leverages observations from severalfundamental building blocks, including Radiometry,bounding volume hierarchies, and Morton coding.

Radiometry: For the purposes of FlexRender, wenote the following observations of the radiometrymodel of light:

Light behaves linearly. The combined effect of tworays of light in a scene is the same as the sum oftheir independent effects on the scene.

Energy is conserved. When a ray reflects off of asurface, it can never do so with more energy thanit started with.

FlexRender exploits these observations in the follow-ing ways:

The location of computation does not matter. Ifthe scene is distributed across many machines, itmakes no difference which machine computes theeffect of a ray. The sum of all computations willbe the same as if all the work was performed on asingle machine.

Transmittance models energy conservation. If westore the amount of energy traveling along a ray(the transmittance) with the ray itself, we need notknow anything about the preceding rays or statethat brought this ray into existence. We can com-pute its contribution to the scene independentlyand ensure that linearity and energy conservationare both respected.

Bounding Volume Hierarchies: The traversal al-gorithm of a BVH is naturally recursive, but recursiveimplementations keep their state on the call stack. InFlexRender, BVH traversal may need to be suspendedon one machine and resumed it on another, so allof the traversal state needs to be explicitly exposed.Refactoring it as an iterative traversal explicitly ex-poses the state for capturing, but unfortunately it stillrequires a traversal stack of child nodes that must bevisited on the way back up the tree. In FlexRender,rays carry all necessary state along with them, andthis approach would require the entire stack to be in-cluded as ray state.

However, (Hapala et al., 2011) presents a stack-less traversal method. The key insight is that if parentlinks are stored in the tree, the same traversal can beachieved using a three-state automaton describing thedirection you came from when you reached the cur-rent node (i.e. from the parent, from the sibling, orfrom the child). They show that their traversal algo-rithm produces identical tree walks and never retestsnodes that have already been tested.

FlexRender leverages this traversal algorithm dueto its low state storage requirements. Each ray onlyneeds to know the index of the current node it istraversing and the state of the traversal automaton.

Morton Coding and the Z-Order Curve Mortoncoding is a method for mapping coordinates in multi-dimensional space to a single dimension. In particu-lar, walking the multidimensional space with a Mor-ton coding produces a space-filling Z-order curve.

Figure 2: Examples of two dimensional and three di-mensional Z-order curves. Credit: “David Eppstein” and“Robert Dickau” (Creative Commons License)

More concretely, FlexRender needs to distributescene data to machines in the cluster, without a pre-processing step, in a spatially coherent way. If the ge-ometry on each machine consists of a localized patchof the overall geometry, it allows us to minimize com-munication between the machines, and thus, only paythe network cost when absolutely needed.

Because the Morton coding produces a spatiallycoherent traversal of 3D space, dividing up the 1DMorton code address space among all the machinesparticipating in the render gives a reasonable assur-ance of spatial locality for the geometry sent to eachmachine.

The Morton coding is also simple to implement.For example, to map a point P in a region of 3Dspace (defined by its bounding extents min and max)to a Morton-coded 64-bit integer, discretizing eachaxis evenly allows for 21 bits per axis, yielding a 63-bit address space (and one unused bit in the integer).We compute the Morton code by calculating the 21-bit discretized component of P along each axis, thenshifting the components from each axis into the 64-bitinteger one bit at a time from the most-significant toleast-significant bit.

3 Related Work

A number of algorithms are designed to com-pletely avoid large amounts of geometry alto-gether. Displacement maps (Krishnamurthy andLevoy, 1996) and normal maps using texture data(Cohen et al., 1998) and (Cignoni et al., 1998) arecommonly used, especially in real-time applications.However, normal mapping displays visual artifacts,particularly around the edges of the mesh or in ar-

eas where the texture is stretched or compressed.Similarly, level of detail is commonly used to re-place high resolution meshes with lower resolutionversions when they are far enough away from theviewer (Clark, 1976). However, it requires the cre-ation of multiple resolution meshes and algorithms tosmoothly handle resolution changes.

More recently, (Djeu et al., 2011) provides aframework for efficiently tracing multiresolution ge-ometry, but it has large memory requirements and isnot applied to clusters of commodity machines. Sim-ilarly (Afra, 2012) presents a voxel-based paging sys-tem designed for interactive use, but it requires anexpensive preprocessing step and displays visual ar-tifacts at low LOD levels.

In (DeMarle et al., 2004), distributed shared mem-ory is used to page in necessary data for rendering,but it relies on a complicated task scheduler to avoidflooding the network with large data blocks and itsuffers from cache storms in a multithreaded envi-ronment. Efficient memory use is also the focus of(Moon et al., 2010), which obtains an order of mag-nitude speedup from reordering ray evaluation in acache-oblivious way. This approach would comple-ment our method very well.

The focus of (Reinhard et al., 1999), was devel-opment of a hybrid work scheduler which would im-prove performance with more effective hardware uti-lization, but it had problems with high memory over-head. Earlier this year, (Navratil et al., 2012) pre-sented an improved approach which uses a dynamicscheduler to move work from an image plane decom-position to a domain decomposition as rays begin todiverge.

A similar approach to ours was presented in(Badouel et al., 1994), but it failed to address the issueof shader execution within their asynchronous clus-ter design. Another similar approach is the Kilauearenderer in (Kato and Saito, 2002), but its pipeline isvery latency sensitive and network communication isheavy due to rays being duplicated across many work-ers.

Recent research in strict out-of-core methods hasfocused on techniques for using main memory as anefficient cache. In (Kontkanen et al., 2011), theypresent a technique for dealing with huge point cloudsused in point-based global illumination and demon-strate impressive cache hit rates of 95% and higher.Prior to that, (Pantaleoni et al., 2010) generated di-rectional occlusion data using ray tracing, which wascompressed into spherical harmonics for evaluation atrender time. Ray tracing was not used for primary vis-ibility. Like FlexRender, they also used hierarchicalBVHs, but for the purpose of chunking scene data into

units of work that could be loaded into main memoryquickly.

Using GPGPU techniques, (Garanzha et al.,2011), presented a method for ray tracing large scenesthat are out-of-core on the GPU. Parts of their archi-tecture are similar to FlexRender on a single-devicescale, specifically their use of hierarchical BVHs andray queues to handle rays cast from shaders. How-ever, the global ray queue is fairly memory intensive,potentially storing up to 32 million rays.

Leveraging MapReduce (Dean and Ghemawat,2004), an implementation of a ray tracer was pre-sented by (Northam and Smits, 2011), but sendinglarge amounts of scene data to workers significantlyslowed down processing. Their solution was to breakthe scene into small chunks and resolve the inter-section tests in the reduce step. Unfortunately, thisdoes a lot of unnecessary work, because many non-winning intersections are computed that would havebeen pruned in a typical BVH traversal.

4 FlexRender Architecture

In general, rendering with FlexRender proceeds inthe following way:

1. Read in the scene data and define the maximumbounding extents of the entire scene. As data isread in, it is distributed to the workers using theMorton coding.

2. Once the scene is distributed, workers build BVHsin parallel for their respective local region of thescene. When complete, they send their maximumbounding extents to the renderer.

3. The renderer constructs a top-level BVH fromeach of the workers’ bounds and sends this top-level BVH to all the workers.

4. Workers create image buffers. All shading com-puted on a worker is written to its own buffer.

5. Workers cast primary rays from the camera andtest for intersections.

• During intersection testing, rays may be for-warded over the network to another worker ifthey need to test geometry within the boundingextents of that worker.

• Once the nearest intersection is found, illumi-nation ray messages are created and sent toworkers that have lights. These workers sendlight rays back towards the point of intersectionto test for occlusion.

• If the light rays reach the original intersection,the point is illuminated and the worker com-putes shading.

6. Once all rays have been accounted for, work-ers send their image buffers back to the renderer,which composites them together into a final im-age.

In the next few sections, we will examine each partof this process in greater detail.

4.1 Workers and the Renderer

There are two potential roles a machine can play dur-ing the rendering process.

Worker These machines receive a region of thescene and act as ray processors, finding intersec-tions computing shading. They produce an imagethat is a component of the final render. There maybe an arbitrary number of workers participating.

Renderer This machine distributes scene data to theworkers. Once rendering begins it monitors thestatus of the workers and halts any potential run-away situations (see Section 4.4.3). When therenderer decides the process is complete, it re-quests the image components from each workerand merges them into the final image. There isonly a single renderer in any given cluster.

The network architecture is client/server con-nected in a star topology. Each worker exposes aserver which processes messages and holds clientconnections open to every other worker for pass-ing messages around the cluster. The renderer alsoholds a client connection to every worker for sendingconfiguration data, preparing the cluster for render-ing (described in Section 4.3), and monitoring renderprogress (described in Section 4.5).

The current implementation does not include anyprovisions for fault tolerance. However, there is no ar-chitectural restriction preventing such measures. Fortopics involving the design of robust and resilientclusters, we refer the reader to the distributed systemsliterature.

The graphics machinery is fairly straightforward.A scene consists of a collection of meshes, which arestored as indexed face sets of vertices and faces. Eachmesh is a assigned a material, which consists of ashader and potentially a set of bindings from texturesto names in the shader. A shader is a small programthat computes the lighting on the surface at a particu-lar point.

4.2 Fat Rays

The core message type in FlexRender is the fat ray.They are so named because they carry additional stateinformation along with their geometric definition ofan origin and a direction. Their counterparts, slimrays, consist of only the geometric components.

Specifically, a fat ray contains the following data,which is detailed in the subsequent sections:

• The type of ray this is.

• The source pixel that this ray contributes to.

• The bounce count, or number of times this rayhas reflected off a surface (to prevent infiniteloops).

• The origin and direction of the ray.

• The ray’s transmittance, (related to the amountof its final pixel contribution).

• The emission from a light source carried along theray, if any.

• The target intersection point of the ray, if any.

• The traversal state of the top-level BVH.

• The hit record, which contains the worker, mesh,and t value of the nearest intersection.

• The current worker this ray should be forwardedto over the network.

• The number of workers touched by the ray. (Notused for rendering, only analysis.)

• A next pointer for locally queuing rays. (Obvi-ously invalid over the network.)

In total, the size of a fat ray is 128 bytes.

4.3 Render Preparation

To begin rendering FlexRender ensures that all theworkers agree on basic rendering parameters, suchas the minimum and maximum extents of the scene,which is used for driving Morton coding discretiza-tion along each axis. Once this data has been synchro-nized with all workers, each worker establishes clientconnections with every other worker that remain openfor the duration of the render.

The renderer then divides up the 63-bit Mortoncode address space evenly by the number of workersparticipating and assigns ranges of the address space(which translate into regions of 3D space) to eachworker. At this point the renderer begins reading inscene data and takes the following actions with eachmesh:

1. Compute the centroid of the mesh by averaging itsvertices.

2. Compute the Morton code of the mesh centroid.The range this falls in determines which workerthe mesh will be sent to.

3. Ensure that any asset dependencies (such as ma-terials, shaders, and textures) for this mesh arepresent on the designated worker.

4. Send the mesh data to the designated worker.

5. Delete the mesh data from renderer memory.

Although the current implementation reads scenedata in at the renderer and distributes it over the net-work, there is no inherent reason why it needs to doso. For example, if all workers have access to sharednetwork storage, the renderer could simply instructeach worker to load a particular mesh itself, reducingload time.

4.3.1 Parallel Construction of Bounding VolumeHierarchies

Next, workers construct a BVH for each meshthey have for accelerating spatial queries against themeshes. While building each BVH, the worker tracksbounding extents of each mesh. Workers then build aroot BVH for the entire region owned by the workerwith the mesh extents as its leaf nodes. When testingfor intersections locally, a worker first tests againstthis root BVH to determine candidate meshes, thentraverses each mesh’s individual BVH to compute ab-solute intersections. After construction of this rootBVH, the root node’s bounding extents describe thespatial extent of all geometry owned by the worker.

Once construction of all local BVHs is complete,workers send their root bounding extents to the ren-derer. The renderer then constructs a final top-levelBVH with worker extents as its leaf nodes. This top-level BVH is then distributed to all workers. This isa quick and lightweight process, since the number ofnodes is only 2n−1, where n is the number of work-ers. This top-level BVH guides where rays are sentover the network in Section 4.4.4.

4.4 Ray Processing

Workers are essentially multithreaded ray processors.Their general work consists of processing rays by test-ing for intersection, potentially forwarding rays to an-other worker, or computing shading values if a rayterminates.

To manage these various tasks when all the geom-etry and lights may not be local to the worker, thereare three different types of rays:

• Intersection rays are very much like traditionalrays, which identify points in space that requireshading.

• Illumination ray messages are copies of inter-section rays that have terminated. They are sentto workers who have emissive geometry and as-sist in the computation of shading.

• Light rays are Monte Carlo samples that con-tribute direct illumination to a point being shaded.

Figure 3: The three different ray types and their interac-tions.

The typical lifetime of these various rays is as follows:1. An intersection ray is cast into the scene.2. When that intersection ray terminates at a point in

space, it dies and spawns an illumination ray mes-sage sent to each worker with emissive geometry.

3. When those illumination ray messages are deliv-ered, they die and spawn light rays cast toward theoriginal intersection.

4. If they terminate at the original intersection, theworker computes shading for it. When those lightrays terminate, they die.A single intersection ray can spawn many addi-

tional rays and light rays are the most likely to ter-minate without generating additional rays. These arekey factors for deciding ray processing order for thecluster.

4.4.1 Illumination

In FlexRenderer there is no special “light” type, onlymeshes that are emissive, which is a property set bythe assigned material. Emissive meshes are known toinject light into the scene through their shader. Duringscene loading, the renderer maintains a list of workersthat have been sent at least one emissive mesh andsends the list to all workers before rendering begins.

Once intersection testing has identified a pointneeding shading, FlexRender must determine the vis-ibility of light sources from that point. In a traditional

ray tracer, shadow rays are cast from the point of in-tersection to sample points on the surface of the lightto determine visibility. However, from the perspectiveof the worker computing shading, it has no idea wherethe lights in the scene are, or if there is any geometryoccluding the light, it just knows which workers haveemissive meshes.

To solve this problem, illumination ray messagesare used to request that emissive meshes instead testfor visibility from their perspective. In this way,FlexRender traces an equivalent of shadow rays butin the opposite direction, which we call light rays. In-stead of originating at the intersection point and trav-eling in the direction of the light, they originate atthe light and travel in the direction of the intersec-tion point. Area lights are supported through MonteCarlo integration by creating multiple light rays withorigins sampled across the surface of the light.

If a light ray returns to the original intersectionpoint after being traced through the cluster, which canbe determined via the information carried forward inthe fat rays, the renderer knows its path from the lightwas unoccluded. The advantage of this method is thatit leverages the same method for tracing rays througha distributed BVH that used for testing ray intersec-tions.

In summary, illuminating an intersection is doneas follows:

1. On the worker where an intersection was found,copy the ray data into an illumination ray messagewith the target set to the point of intersection.

2. Send the illumination ray message to all emissiveworkers.

3. When a worker receives an illumination ray mes-sage, generate sample points across the surface ofall emissive meshes. Set these sample points asthe origins for new light rays.

4. Set the directions of all the light rays such thatthey are oriented in the direction of the target.

5. Push each light ray into the ray queue and let thecluster process them as usual.

4.4.2 Ray Queues

To manage the various types and numerous rayspresent in the system, each worker has a ray queue,with typical push and pop operations. Internally, thisis implemented as three separate queues, where raysare separated by type. It also contains informationabout the scene camera, for generating new primaryrays. When a new ray arrives at a worker, it is im-mediately pushed into the queue according to its type(intersection, illumination, or light).

When the worker pops a ray from the queue, raysare pulled from the internal queues in the followingorder:

1. The light queue, since these are least likely togenerate new rays.

2. If the light queue is empty, the worker pops fromthe illumination queue, since these will generatea limited number of new rays.

3. If the illumination queue is empty, the workerpops from the intersection queue, since these cangenerate the most new rays.

4. If all of the internal queues are empty, the workeruses the camera to cast a new primary ray into thescene

Processing rays as a priority queue is essentialfor managing the exponential explosion of work thancan occur if too many primary rays enter the systemat once. It also helps reduce memory required forqueuing because new work is not generated until theworker has nothing else to work on.

4.4.3 Primary Ray Casting

To give the cluster the ability to regulate itself, eachworker in the cluster is responsible for casting a por-tion of the primary rays in the scene. Consider thecase where a worker has not received any work re-cently from other workers in the cluster for whateverreason. By giving the worker control over primary raycasting for a portion of the scene, we enable workersto generate work when they have nothing to do.

However, in the case where a worker containsmostly background geometry, it will infrequently re-ceive work from others because its size in screenspace is small, yet it is in charge of casting primaryrays for a larger slice of the image. In this case theworker may go on an unfettered spree of primary raycasting, generating lots of work for others and littlefor itself. To prevent this “runaway” case from over-burdening the ray queues, workers report statisticsabout their progress to the renderer occasionally (10times per second in our implementation). If a workeris getting significantly further ahead of the others inprimary ray casting, the renderer temporarily disablesprimary ray casting on that worker.

This is not ideal, since lightly loaded workers willwait idle until they receive work to do, but it is sim-ple to implement and works remarkably well in prac-tice for mitigating the ray explosion and associatedmemory required for message queueing. See (Rein-hard et al., 1999) and (Navratil et al., 2012) for moresophisticated approaches that use dynamic schedulingto increase performance.

4.4.4 Distributed BVH Traversal

Distributed traversal of the scene BVH happens at twolevels. Recall that the top level BVH contains bound-ing boxes for each worker and is present on all work-ers. Below that, each worker contains a BVH consist-ing of only the mesh data present on that worker.

When a ray is generated, traversal begins at theroot of the top level tree and proceeds as described by(Hapala et al., 2011). When a leaf node intersectionoccurs closer to the camera than the current closestintersection, the traversal state is packed into the rayand the ray is forwarded to the candidate worker. Thepacked traversal state includes the index of the cur-rent top-level node and the state of the traversal au-tomaton (from the parent, from the sibling, or fromthe child). The logic for top-level traversal is shownin Algorithm 1.

Algorithm 1: Top-level BVH traversal.while current node ! = root do

if ray intersects leaf node thenif leaf node t < ray t then

pack traversal state into ray;send ray to leaf node worker;

endend

end

When a ray is received over the network and un-packed at the next worker, the traversal state is exam-ined. If the traversal state points to the root of the top-level tree, the scene traversal has completed and theworker can shade the intersection if it was the workeron which that intersection occurred. Otherwise, theray must be forwarded to the worker who won, sincethat worker has the shading assets.

If the traversal state is not at the root, the workertests the ray against its local mesh BVH to find theclosest intersection on this worker. If the result ofthat test is closer to the camera than the current clos-est intersection, the worker updates the hit record ofthe ray and resumes traversal of the top-level BVH byreinstating the traversal state packed into the ray andcontinuing where it left off. This logic is shown inAlgorithm 2.

In the best case, a ray is generated on, intersectsonly with, and is shaded on a single worker. Theserays never need to touch the network. In the worsecase, the ray potentially intersects with all nodes, soit must touch every node in the tree.

In addition, there are two interesting corner casesto consider.

Algorithm 2: Worker-level BVH traversal.if ray traversal state == root then

if winning worker == this worker thenshade;

elsesend ray to winning worker;

endelse

test for intersections against worker BVH;if worker t < ray t then

write hit record into ray;endresume top-level traversal;

end

• If the ray is generated on a node that it does notintersect until later in the traversal, it consumesone additional network hop at the very beginningof the traversal.

• If the ray completes its traversal on a worker dif-ferent from that which had the closest intersec-tion, it consumes one additional network hop atthe end of the traversal to put the ray on the cor-rect worker for shading.

Thus, the worst case scenario actually touches n+2 workers, where n is the number of workers in thecluster.

We discuss possible optimizations that may elim-inate both of these corner cases in the Future WorkSection. Additionally in Section 5, we show that inpractice, the number of workers that each ray touchesis less than or equal to the theoretical O(log n) per-formance of the binary tree for the majority of rays inthe scene.

4.4.5 Shading

When a light ray arrives at a worker, it checks thatits point of intersection is within some epsilon of thetarget. If it is, the worker considers the light samplevisible, looks up the material, shader, and textures forthe mesh, and runs the shader. The shader is responsi-ble for writing its computed values into the worker’simage buffer.

A shader may do any (or none) of the following:

• Sample textures.

• Compute light based on a shading model.

• Accumulate computed light into image buffers.

• Cast additional rays into the scene.

When new rays are cast into the scene from ashader, the results of that cast are not immediatelyavailable. Instead, the cast pushes the new rays intothe queue for processing and the traversal and shadingsystems ensure that the result of secondary and n-arytraced rays will be included in the final image. Thesource pixel of the new ray is inherited from its par-ent to ensure that it contributes to the correct pixel inthe final image. In addition, the desired transmittancealong the new ray is multiplied with the transmittanceof the parent ray to ensure energy conservation is pre-served.

4.5 Render Completion

In a traditional recursive ray tracer, determining whenthe render is complete is a simple task: When the lastprimary ray pops its final stack frame the render isover. In FlexRender, however, no one worker (or therenderer, for that matter) knows where the “last ray”is in the cluster. To determine when a render is com-plete, the workers report statistics about their activi-ties to the renderer at regular intervals (in our currentimplementation, 10 times per second). We found thefollowing four statistics to be useful for determiningrender completion:

Primary Ray Casting Progress The amount of theworker’s primary rays that have been cast into thescene.

Rays Produced The number of rays created at theworker during this measurement interval. This in-cludes new intersection rays cast from the cameraor a shader, illumination ray messages created byterminating intersection rays, or light rays createdby processing illumination ray messages.

Rays Killed The number of rays finalized at theworker during this measurement interval. This in-cludes intersection rays that terminated or did nothit anything, illumination ray messages that weredestroyed after spawning light rays, or light rayswho hit occluders or computed a shading value.

Rays Queued The number of rays currently in thisworker’s ray queue.

If any worker has not finished casting its primaryrays, we know for certain the render is not com-plete. Secondly, if no rays are produced, killed, orqueued at any workers for some number of consecu-tive measurement intervals, it is reasonable to assumethat the render has concluded. Our current imple-mentation (reporting at 10 Hz) waits for 10 consec-utive intervals of “uninteresting” activity on all work-ers before declaring the render complete. In practice,this achieves a reasonable balance between ending the

render as quickly as possible and risking concludingit before all rays have been processed.

4.5.1 Image Synthesis

Once the render has been deemed “complete”, therenderer requests the image buffers from each worker.Because all rendering was computed by respectinglinearity, computing a pixel in the final output imagesimply requires summing corresponding pixels in theworker buffers.

This process can yield some interesting intermedi-ate images, as seen in Figure 4. Each worker’s bufferrepresents the light in the final image that interacts insome way with geometry present on that worker. Fordirect light, this shows up as shaded samples wheregeometry was present and dark areas where it was not.For other effects such as reflections and global illumi-nation, we see the reflected light that interacted withthe geometry present on the worker.

Figure 4: The bottom images show the geometry split be-tween two workers. The composite image is on top. Notehow the Buddha reflects the left side geometry in the leftimage and the right side geometry in the right image. Theactual Buddha mesh data was distributed to the worker onthe left.

5 Results

For testing, we used 2-16 Dell T3500 worksta-tions with dual-core Intel Xeon W3503 CPUs clockedat 2.4 GHz. Each workstation had 4 GB of systemRAM, a 4 GB swap partition, and was running Cen-tOS 6. FlexRender consists of around 14,000 lines

of code, mostly C++. We leveraged several popularopen source libraries, including LuaJIT, libuv, Msg-Pack, GLM, and OpenEXR. We used pfstools for tonemapping our output images and pdiff for computingperceptual diffs.

5.1 Test Scene

Our test scene, “Toy Store”, has a geometric complex-ity of nearly 42 million triangles. The room geometryis relatively simple, but the toys on the shelves areunique (non-instanced) copies of the Stanford bunny,Buddha, and dragon meshes that have been remeshedfrom their highest quality reconstructions down to14,249 faces, 49,968 faces, and 34,972 faces respec-tively. There are 30 toys per shelf and 42 shelves inthe scene for a total of nearly 1,300 meshes. Approx-imately one quarter of the meshes are rendered usinga mirrored shader, while the others used a standardPhong shader. The mirrored shader is what causessome black pixels in the image due to the scene notbeing completely enclosed. The scene consists of1.09 GB of mesh data and 5 GB of BVH data oncethe acceleration structures are built.

We rendered a test image at a resolution of1024x768 with no subpixel antialiasing, 10 MonteCarlo samples per light (32 lights in the scene), anda recursion depth limit of 3 bounces. For discussingspeedups initially, we will focus on the 8-worker case(8 workers in the image plane decomposition configu-ration vs. 8 workers in the FlexRender configuration),but we will discuss varying cluster sizes in Section5.6.

5.2 Image Plane Decomposition vs.FlexRender

In the image plane decomposition case, we consider8 workers by chopping up the image into several ver-tical slices, with a different machine responsible foreach slice (see Figure 5). The slices are then reassem-bled to form the final output image.

We report the time for each machine to completeeach phase of rendering (loading the scene, buildingthe BVH, and rendering its slice of the image) in Ta-ble 1. The total duration of the render would take10,061 seconds (the slowest time), because the lastslice is needed before the final image can be assem-bled. We report average times for the image planedecomposition workers compared to FlexRender foreach of the 3 rendering phases in Table 2 and showFlexRender’s speedup over the average case of 6.57times faster. We also report the total render time

Figure 5: Slices of the Toy Store scene rendered with 8 ma-chines using image plane decomposition (composite resultimage shown in Figure 6).

Load BVH Render Total1 288 169 533 9902 290 172 1119 15813 290 167 1491 19484 289 261 9511 100615 295 174 1792 22616 290 192 677 11597 287 167 1109 15618 294 171 303 768

Table 1: Time for each phase of the IPD algorithm, for eachof the 8 workers, in seconds. These include loading thescene, building the BVH, rendering the slice, and the totaltime of all three phases. Worker 4 was hit particularly hardwith divergent secondary rays, and spent a large amount oftime in I/O wait.

(all three phases combined), using the slowest imageplane decomposition worker (time to final image).

Figure 6: Toy Store scene that is comprised of nearly 1,300meshes and 42 million triangles.

IPD FlexRender SpeedupLoading 290.4 (avg) 326 0.89xBuild BVH 183.6 (avg) 20 9.18xRendering 2066.9 (avg) 1186 1.74xTotal 10061 1532 6.57x

(slowest)Table 2: Time (in seconds) for each phase of rendering withFlexRender versus the IPD configuration.

5.3 Geometry Distribution

To ensure the entire scene stays in-core, FlexRen-der must distribute the geometry across the availableRAM in the cluster effectively. With the exceptionof one worker, the Morton coding and Z-order curvedid a decent job of partitioning the scene data evenly.Figure 7 shows a breakdown of the geometry distri-bution. The one worker which did not contain verymuch geometry was in the top corner of the Toy Storeclosest to the camera. This octant of the scene onlycontained a fill light facing the rest of the geometry.

Figure 7: Percentage of total geometry that was distributedto each worker. Other than the worker with little geometry(top corner with only a fill light), the size of geometry perworker varied from 9% (106 MB of geometry and 488 MBof BVH data) to 20% (221 MB of geometry 1007 MB ofBVH data) of the entire scene.

5.4 Network Hops

The intent of the top-level BVH is to reduce networkcost by only sending rays across the network whenthey venture into that worker’s region of space. Sincea BVH exhibits O(log n) traversal, we expect thatwith 8 workers rays would be processed by 3 work-ers in the average case.

Table 3 shows that 63.2% of rays are handled by3 workers or less, and nearly 16% never even re-quire the network. (They are generated on, intersect

# of Workers 1 2 3 4% of Rays 15.8% 25.6% 21.8% 18.8%

# of Workers 5 6 7 8+% of Rays 8.9% 7.4% 1.5% 0.2%

Table 3: Percentage of rays that were processed by the givennumber of workers.

with, and are shaded on a single worker.) In addi-tion, 36.8% of rays must touch more than the expected3 workers. However, including hops for the cornercases mentioned in Section 4.4.4, 82% of rays are ator below O((log n)+ 1) and 90.9% of rays are at orbelow O((log n)+2).

5.5 Ray Queue Sizes

Keeping the ray queue size small is critical to the longterm health of the render. If rays begin queuing upfaster than the cluster can process them, eventuallythe cluster will begin swapping when accessing rays,which violates our fundamental performance goal ofstaying in-core.

Because of this, we logged ray queue sizes on eachworker over time while rendering Toy Store on a clus-ter of 16 workers. Figure 8 shows the queue sizes overtime with workers represented as different colors. Ta-ble 4 breaks down the average queue size, as well asthe maximum size over the course of the entire renderand the storage demands of the maximum size, bothin terms of raw storage space and a fraction of thesystem RAM.

Most workers sit comfortably below 1% of theirRAM being used for queued rays, while the busiestworker used just over 1%. This demonstrates that ourwork throttling mechanisms, while simple, are keep-ing the cluster from generating more work than it canhandle.

Figure 8: Number of rays queued on each worker over time.Different colors correspond to different workers.

Avg Max Max MemoryQueued Queued Storage Use

1 107 12221 1.49 MB 0.04%2 506 29796 3.64 MB 0.09%3 969 34970 4.27 MB 0.10%4 3116 57597 7.03 MB 0.17%5 1687 36981 4.51 MB 0.11%6 1430 44053 5.38 MB 0.13%7 4070 80899 9.88 MB 0.24%8 2000 16861 2.06 MB 0.05%9 3767 97429 11.89 MB 0.29%

10 7921 178191 21.75 MB 0.53%11 4477 69702 8.51 MB 0.21%12 4943 92043 11.24 MB 0.27%13 51164 430736 52.58 MB 1.28%14 2193 45428 5.55 MB 0.14%15 0 0 0 MB 0.00%16 18 1374 0.17 MB 0.01%

Table 4: Size of ray queues when rendering Toy Store witha 16-worker FlexRender cluster.

5.6 Cluster Size

To evaluate the scalability of the FlexRender architec-ture, we ran Toy Store renders with cluster sizes of 4workers, 8 workers, and 16 workers. For comparison,we ran the same renders in the image plane decompo-sition configuration also of 4, 8, and 16 workers.

Table 5 shows a continual and impressive im-provement in the BVH construction time as the num-ber of workers increases. This is thanks to the hugeparallelization speedup from partitioning the scenegeometry with Morton coding and building each sub-tree in parallel.

We also see render time improvements of roughlyan order of magnitude, depending on how FlexRen-der chooses to distribute the geometry based on thenumber of workers available. It is likely that this per-formance advantage will slowly erode at much largercluster sizes (due to increasing network communica-tion costs), but for small to medium cluster sizes wesee approximately linear growth.

5.7 Example Renders

Figure 6 shows the final Toy Store scene render, withnearly 42 million triangles and 1,300 meshes.

Figure 1 shows a scene with even higher resolu-tion meshes, bringing the geometric complexity up to87 million triangles, rendered on a cluster of 4 ma-chines.

Figure 9 shows a monochromatic Cornell Box ren-dered by two workers. The final composite is on the

# of Workers 4 8 16IPD BVH Build 203 184 171Flex. BVH Build 39 20 10Speedup 5.21x 9.20x 17.1xIPD Render 14833 10061 7702Flex. Render 1742 1532 970Speedup 8.51x 6.57x 7.94xIPD Total 14536 9800 7413Flex. Total 1405 1206 540Speedup 10.35x 8.13x 13.73x

Table 5: Comparison of cluster sizes with both imageplane decomposition (IPD) and the FlexRender configura-tion (Flex). Times are in seconds.

left, while the individual image buffers are stacked onthe right. With only direct lighting and no reflections,this clearly shows which geometry was assigned toeach worker.

Figure 9: A monochromatic Cornell Box, showing the dis-tribution of geometry between two workers.

Figure 10 demonstrates FlexRender’s Lua-basedshader system. The bunny mesh is toon shaded, whilethe Buddha mesh has a perfect mirror finish. Reflec-tion rays for the mirrored Buddha and floor are castdirectly from the shaders.

5.8 Summary

With these results, we have demonstrated thatFlexRender meets our claimed contributions in thefollowing ways.

• Rays carry enough state to never require repliesthanks to the feed-forward messaging design.

• This design allows the implementation of a shadersystem that requires no suspension of shader exe-cution state as rays are moved between workers.

• FlexRender can correctly and efficiently traversethe scene when workers only contain spatially lo-cal BVHs, linked by a top-level BVH.

Figure 10: A Cornell Box with a toon shaded bunny andmirrored Buddha.

• The system keeps itself regulated and reducesmemory usage by throttling new work generationand processing rays in an intelligent order.

• FlexRender shows consistent speedups over im-age plane decomposition, which suffers greatlyfrom swapping to disk frequently during the ren-dering process.

5.9 Future Work

FlexRender could benefit from specialized tuningto scheduling and spatial distribution. Specifically,adding dynamic task scheduling as described in(Navratil et al., 2012) could dramatically improve per-formance, and adopting a more sophisticated spatialdistribution method as described in (Badouel et al.,1994) could also help with workload distribution.

The cache-oblivious ray reordering technique pre-sented in (Moon et al., 2010) would complementour work quite nicely, and could offer another or-der of magnitude improvement in performance. Ex-plorations with multiresolution geometry or couplingFlexRender with existing out-of-core methods wouldalso dramatically increase the complexity of scenesFlexRender can handle efficiently.

With respect to our existing implementation, somenetwork hops could be avoided when the ray fin-ishes traversing the top-level BVH but ends up on aworker that is not the point of intersection. Becausethe worker has no shading assets for the terminatinggeometry, it must pass the ray back to the “winning”worker for shading. By distributing all of the shadingassets (shaders, textures, and materials) to all work-

ers, any worker would be capable of computing shad-ing regardless of whether it was responsible for theterminating geometry.

Memory optimizations, such as object pooling,could exploit the transient nature of many fixed-sizefat rays to reduce allocation overhead and heap frag-mentation. In addition, the linear BVH node struc-tures were intentionally padded to 64-bytes to matchthe cache line size on current CPUs. However with-out an aligned STL allocator for the vector that storesthis array of nodes, cache alignment is not guaranteed.Additionally, greater code optimization (e.g. use ofSIMD instructions) could also improve performance.

Finally, one of FlexRender’s strengths is that itdecouples the ray tracing computations from shadingwith a feed-forward design that requires no replies.Because messages are both asynchronous and neverreturn, this opens up the possibility for batching com-putation to specialized hardware coprocessors, suchas GPUs or the upcoming Intel MIC cards.

REFERENCES

Afra, A. (2012). Interactive ray tracing of large modelsusing voxel hierarchies. Computer Graphics Forum,31(1):75–88.

Badouel, D., Bouatouch, K., and Priol, T. (1994). Distribut-ing data and control for ray tracing in parallel. Com-puter Graphics and Applications, IEEE, 14(4):69 –77.

Cignoni, P., Montani, C., Scopigno, R., and Rocchini, C.(1998). A general method for preserving attributevalues on simplified meshes. In Proceedings of theconference on Visualization ’98, VIS ’98, pages 59–66, Los Alamitos, CA, USA. IEEE Computer SocietyPress.

Clark, J. H. (1976). Hierarchical geometric models forvisible-surface algorithms. In Proceedings of the 3rdannual conference on Computer graphics and inter-active techniques, SIGGRAPH ’76, pages 267–267,New York, NY, USA. ACM.

Cohen, J., Olano, M., and Manocha, D. (1998).Appearance-preserving simplification. In Proceedingsof the 25th annual conference on Computer graph-ics and interactive techniques, SIGGRAPH ’98, pages115–122, New York, NY, USA. ACM.

Dean, J. and Ghemawat, S. (2004). Mapreduce: simplifieddata processing on large clusters. In Proceedings ofthe 6th conference on Symposium on Opearting Sys-tems Design & Implementation - Volume 6, OSDI’04,pages 10–10, Berkeley, CA, USA. USENIX Associa-tion.

DeMarle, D. E., Gribble, C. P., and Parker, S. G. (2004).Memory-savvy distributed interactive ray tracing. InProceedings of the 5th Eurographics conference onParallel Graphics and Visualization, EG PGV’04,pages 93–100, Aire-la-Ville, Switzerland, Switzer-land. Eurographics Association.

Djeu, P., Hunt, W., Wang, R., Elhassan, I., Stoll, G., andMark, W. R. (2011). Razor: An architecture fordynamic multiresolution ray tracing. ACM Trans.Graph., 30(5):115:1–115:26.

Garanzha, K., Bely, A., Premoze, S., and Galaktionov,V. (2011). Out-of-core gpu ray tracing of complexscenes. In ACM SIGGRAPH 2011 Talks, SIGGRAPH’11, pages 21:1–21:1, New York, NY, USA. ACM.

Hapala, M., Davidovic, T., Wald, I., Havran, V., andSlusallek, P. (2011). Efficient stack-less bvh traversalfor ray tracing. In 27th Spring Conference on Com-puter Graphics, SCCG ’11.

Kato, T. and Saito, J. (2002). ”kilauea”: parallel globalillumination renderer. In Proceedings of the FourthEurographics Workshop on Parallel Graphics and Vi-sualization, EGPGV ’02, pages 7–16, Aire-la-Ville,Switzerland, Switzerland. Eurographics Association.

Kontkanen, J., Tabellion, E., and Overbeck, R. S. (2011).Coherent out-of-core point-based global illumination.In Eurographics Symposium on Rendering.

Krishnamurthy, V. and Levoy, M. (1996). Fitting smoothsurfaces to dense polygon meshes. In Proceedingsof the 23rd annual conference on Computer graph-ics and interactive techniques, SIGGRAPH ’96, pages313–324, New York, NY, USA. ACM.

Moon, B., Byun, Y., Kim, T.-J., Claudio, P., Kim, H.-S., Ban, Y.-J., Nam, S. W., and Yoon, S.-E. (2010).Cache-oblivious ray reordering. ACM Trans. Graph.,29(3):28:1–28:10.

Navratil, P. A., Fussell, D. S., Lin, C., and Childs,H. (2012). Dynamic scheduling for large-scaledistributed-memory ray tracing. In Proceedings of Eu-rographics Symposium on Parallel Graphics and Visu-alization, pages 61–70.

Northam, L. and Smits, R. (2011). Hort: Hadoop onlineray tracing with mapreduce. In ACM SIGGRAPH2011 Posters, SIGGRAPH ’11, pages 22:1–22:1, NewYork, NY, USA. ACM.

Pantaleoni, J., Fascione, L., Hill, M., and Aila, T. (2010).Pantaray: fast ray-traced occlusion caching of mas-sive scenes. In ACM SIGGRAPH 2010 papers, SIG-GRAPH ’10, pages 37:1–37:10, New York, NY, USA.ACM.

Reinhard, E., Chalmers, A., and Jansen, F. W. (1999). Hy-brid scheduling for parallel rendering using coherentray tasks. In Proceedings Parallel Visualization andGraphics Symposium, pages 21–28.

flexrender: a distributed rendering architecture for ray...

Documents