dynamically scheduled region-based image compositing · 2016-06-02 · eurographics symposium on...

Eurographics Symposium on Parallel Graphics and Visualization (2016)W Bethel E Gobbetti (Editors)

Dynamically Scheduled Region-Based Image Compositing

AVPascal Grosset Aaron Knoll amp Charles Hansen

Scientific Computing and Imaging Institute University of Utah Salt Lake City UT USA

AbstractAlgorithms for sort-last parallel volume rendering on large distributed memory machines usually divide a datasetequally across all nodes for rendering Depending on the features that a user wants to see in a dataset all thenodes will rarely finish rendering at the same time Existing compositing algorithms do not often take this intoconsideration which can lead to significant delays when nodes that are compositing wait for other nodes that arestill rendering In this paper we present an image compositing algorithm that uses spatial and temporal awarenessto dynamically schedule the exchange of regions in an image and progressively composite images as they becomeavailable Running on the Edison supercomputer at NERSC we show that a scheduler-based algorithm withawareness of the spatial contribution from each rendering node can outperform traditional image compositingalgorithms

Categories and Subject Descriptors (according to ACM CCS) I31 [Computer Graphics] Hardware ArchitecturemdashParallel processing I32 [Computer Graphics] Graphics SystemsmdashDistributednetwork graphics

1 Introduction

Visualization is increasingly important in the scientific com-munity Several High Performance Computing (HPC) cen-ters such as the Texas Advanced Computing Center (TACC)and Livermore Computing Center (LC) now have clus-ters dedicated to visualization Most clusters in HPC cen-ters are usually distributed memory machines with hundredsor thousands of nodes each of which has a very powerfulCPU andor GPU with lots of memory connected through ahigh-speed network The most commonly used approach forparallel rendering on these systems is sort-last [MCEF94]In sort-last parallel rendering the data to be visualized isequally distributed among the nodes Each node loads its as-signed subset of the dataset that it renders to an image Dur-ing the compositing stage the images are exchanged andthe final image is gathered on the display node In this paperour focus is on the compositing stage of distributed volumerendering

Image compositing has two parts computation (blend-ing) and communication Many algorithms such as BinarySwap [MPHK93] and Radix-k [PGRlowast09] have been de-veloped for image compositing These algorithms try toevenly distribute the computation among the nodes How-ever as shown by Grosset et al [GPClowast15] image composit-ing algorithms should pay more attention to communica-

tion than to computation Nowadays the computing powerof nodes in a supercomputer greatly exceeds the communi-cation speed between nodes Trying to minimize communi-cation and overlapping communication with computation ismore important than focusing on evenly balancing the work-load In this paper we focus specifically on communicationand threads and auto-vectorization are used to fully benefitfrom the computational power of CPUs

The time each node takes to finish rendering its assignedregion of a dataset in sort-last parallel rendering is rarelythe same There are several reasons for this first it is rarefor datasets to have a uniform distribution of data Figure 1shows two commonly used test volume datasets that havenumerous empty regions after a transfer function has beenapplied to extract interesting features in each dataset Thenodes assigned to rendering these empty regions have muchless work to do and will finish early Second when us-ing perspective projection nodes closer to the camera pro-duce a larger image compared to nodes far from the cam-era Rendering a larger image takes more time than render-ing a smaller image Finally if the user zooms in on onespecific region of a dataset part of the dataset might fall out-side the viewing frustum and not need to be rendered More-over the difference in rendering speed is further increasedif lighting is used and normals need to be calculated and ifthe rendering takes place on a medium-sized cluster where

ccopy The Eurographics Association 2016

P Grosset Aaron Knoll amp C Hansen Dynamically Scheduled Region-Based Image Compositing

there are hundreds rather than thousands of nodes the timetaken to render a large image can be substantially greaterthan the time to render a small image If we do not wantthe uneven rendering to slow down compositing nodes thatare done rendering should exchange images only with nodesthat are also done rendering and not wait on nodes which arestill rendering In this paper we keep track of which nodesare done compositing and only schedule compositing whennodes have completed rendering

Figure 1 Two commonly used test datasets the Bonsaidataset on the left and Backpack dataset on the right

One of the common approaches for load balancing in dis-tributed volume rendering is to split and distribute the datasetbased on how long each region will take to render the ap-proach used by Marchesin et al [MMD06] and Fogal etal [FCSlowast10] rather than just diving the data equally usingfor example a k-d tree However arbitrarily assigning datato nodes may not be an effective strategy when consideringin situ visualization Data movement between nodes or writ-ing to disk is very costly in large scale simulations As men-tioned by Yu et al [YWGlowast10] for in situ visualization itis the simulation code that dictates the data partitioning anddistribution among nodes Thus in situ visualization uses thesame nodes for visualization as those generating the data inthe simulation and is best performed without requiring datamovement between nodes or disk In this paper we proposethat work is scheduled at the compositing stage and does notrequire data redistribution for balanced rendering Since theimage compositing only transfers sub-images our proposedtechnique would be easily integrated with existing in situ vi-sualization and analysis software

The main contribution of this paper is an image composit-ing algorithm that uses a scheduler with both spatial andtemporal awareness of the compositing process We start bydividing the final image into a number of regions r and createa depth-ordered list of nodes for each region Based on thedata loaded by each node and the properties of the camerathe contribution of each node to regions of the final imagecan be determined Nodes not contributing to a region canthen be removed from that regionrsquos list The scheduler alsoupdates the region list after each node is done rendering byeliminating nodes that rendered nothing for a region Thisprocess ensures that a node not contributing to a region is

never made to receive data for that region thus minimizingcommunication The algorithm then schedules the exchangeof images and ensures that no nodes wait for a node thatis still rendering if another option for compositing is avail-able Thus when the slowest node is done rendering most ofthe regions of the final image have already been compositedand there is minimal overhead to assemble the final imageThe algorithm uses one MPI rank per node and threads forCPU cores which Howison et al [HBC10] showed to be bet-ter than one MPI rank per core Auto-vectorization is alsoused to fully leverage the compute capabilities of modernCPUs and asynchronous MPI communication is also usedto overlap communication with computation We comparethis scheduling-based image compositing algorithm againstTOD-Tree on the Edison supercomputer at NERSC using abox and sphere artificial dataset and a combustion dataset

The paper is organized as follows Section 2 describes thecommonly used compositing algorithms and techniques thatare used to achieve load balance in distributed volume ren-dering Section 3 describes our algorithm and the implemen-tation details The results are presented in section 4 wherewe also discuss the results of strong scaling on three typesof datasets Finally the paper is wrapped up in section 5 withthe conclusion and future work

2 Previous Work

Many algorithms have been designed to tackle image com-positing in distributed volume rendering The simplest algo-rithm is serial direct send in which all the processes involvedin rendering send their rendered image to the display nodewhich then blends them together This approach results in amassive load imbalance and can be quite slow for large im-ages and many nodes Parallel direct send [Hsu93] [Neu94]improves on serial direct send by dividing the workloadamong all the processes Each process is responsible for onesection of the final image and gathers that section from allother processes In the gathering stage each process sendsits authoritative section to the display node

Tree-based algorithms have also been used for imagecompositing In binary tree compositing each node is rep-resented as a leaf of the tree The height h of the tree islog2n where n is the number of nodes in the tree In eachsubtree a child sends its data to its sibling for blending andbecomes inactive This exchange goes on for each level ofthe tree until the final image is at the root (display node)of the tree Binary Swap [MPHK93] improves on this ap-proach by keeping all nodes active until the gathering stageInitially each node is responsible for the whole image Ateach level of the tree nodes responsible for the same regionare paired in a subtree and exchange information so that eachis responsible for half of the image for which it was initiallyresponsible for This process continues until there is onlyone node responsible for each 1n section of the final im-age Then the display node gathers these sections from all



n nodes Binary Swap has been extended for non-powers of2 by Yu et al [YWM08] In Radix-k [PGRlowast09] instead ofgrouping nodes in pairs for a round the size of regions tobe grouped is determined by a vector k where k = k1k2 In each ki-sized group the nodes exchange information ina parallel direct send way so that each node in a ki-sizedgroup is responsible for 1ki of the final image All nodeswith the same authoritative section of the image are then col-lected into groups of size ki+1 which continues for i roundsfollowed by a gather stage in which each authoritative sec-tion is assembled on the display node These algorithmshave been implemented by Moreland et al [MKPH11] inICET [Mor11] along with some optimizations for communi-cation

To account for the much faster compute speed com-pared to communication speed in supercomputers Grossetet al [GPClowast15] developed the TOD-Tree algorithm whichfocuses on reducing and overlapping communication withcomputation TOD-Tree has a parallel direct send stage tobalance the workload followed by k-ary compositing to re-duce communication Also Howison et al [HBC12] showedthat parallel rendering is faster when one MPI rank is usedper node instead of per core Therefore we also use one MPIrank per node and we use threads and auto-vectorization onthe CPU

However although these algorithms are fast they do notake into account the contents of the image from each ren-dering process They all decide statically for which region acomputing process should be responsible and stick to that al-location A process then may be responsible for a region forwhich it does not have any initial content which needlesslyincreases communication However some algorithms takeinto account the image contents of a node The ScheduledLinear Image Compositing (SLIC) algorithm of Stompel etal [SMLlowast03] ensures that the region to which a process isassigned is the one to which it contributes The contributionto the final image from each process is computed based onthe data extents loaded by a process and the camera posi-tion Scan lines of the overlapping regions are assigned toprocesses contributing to them in an interleaving fashionAlso image regions that do not overlap with other imagesare directly sent to the display node without any blendingStrengert et al [SMWlowast04] used the SLIC algorithm for im-age compositing on GPU clusters However although SLIChas spatial awareness of the contribution of each renderingprocess it does not have any temporal awareness that is itdoes not know when a process will finish rendering and isready to participate in compositing

Load balancing approaches to distributed volume render-ing usually take rendering time into account when composit-ing Fang et al [FSZlowast10] use a pipeline approach in whichthey overlap rendering and compositing Systems that pro-vide a complete solution to rendering and compositing suchas Equalizer system [EMP09] and Chromium [HHNlowast02]

have knowledge of both compositing and rendering thatgives them more flexibility to balance the workload TheEqualizer framework uses the direct send technique of Eile-mann et al [EP07] that splits images into tiles to improve im-age compositing Other approaches such as that of Moloneyet al [MWMS07] use an estimate on the cost to ren-der a pixel to do dynamic load balancing using a sort-firstrendering approach and Muller et al [MSE07] Fogal etal [FCSlowast10] and Marchesin et al [MMD06] use the previ-ous rendering time in a time varying datatset to estimate thecost of rendering the current timestep Frey and Ertl [FE11]redundantly distribute blocks of volume data across the ren-dering nodes to allow for easier load balancing In this pa-per we do not try to move the data between nodes and esti-mate the rendering time Instead we communicate with therendering nodes to schedule compositing accordingly Beingable to move the data between nodes may help reduce therendering imbalance among nodes but in the case of in situvisualization data movement is often too costly and we haveto use the data decomposition dictated by the simulation Inthese situations the only place to deal with load imbalancewould be at the compositing stage

3 Methodology

It is rare for rendering on all the nodes of a distributed mem-ory machine to finish at the same time Improving composit-ing time therefore requires minimizing the time betweenwhen the slowest process finishes rendering and composit-ing is complete the orange region in figure 2 For that tohappen processes still rendering should not delay composit-ing

Figure 2 Rendering + Compositing timeline

One of the issues with compositing algorithms such asparallel Direct Send Binary Swap Radix-k and TOD-Treeis their lack of awareness of which processes have finishedrendering and which processes are still rendering whichsometimes delays image compositing as some processeswait for images from other processes that are still renderingFigure 3 shows an example of eight processes doing com-positing using Radix-k Let us assume that two rounds areneeded and vector k = 42 To get the correct final imagepartial images need to be blended in the correct depth order(front-to-back or back-to-front) So if processes 6 and 0 infigure 3 are still rendering while the remaining processes arecompositing Radix-k will have to wait for 6 and 0 to finishrendering and be stuck in round 1 of parallel direct send for



all regions The same delay would occur in Binary Swap andTOD-Tree if some nodes waiting to exchange images withnodes that are still rendering since they too lack temporalawareness

Figure 3 The first round of Radix-k for eight processes andfour regions The green rectangle shows the region for whicheach process is responsible and the blue region shows thedata from each process Process 6 and 0 have more data torender and will finish rendering after the other processes

The same set of processes can be represented as a graphas shown in figure 4 If we blend exclusively based on depthprocesses 4 1 7 and 5 can start compositing while waitingfor 6 and 0 to finish rendering Also since there are neverany cycles in the graph we will refer to it as a chain

This procedure however can still be improved upon If 6and 0 do not contain information relevant to the whole im-age they should not delay compositing for the whole imageIt is common for compositing algorithms to divide an imageinto regions and allocate each region to a process If for ex-ample four regions are used as shown in figure 3 processes6 and 0 do not contribute to the first and last regions andso they should not delay compositing for these regions ofthe image If we use a chain to represent each region pro-cess 6 and 0 will be omitted from the first and last chain Asthe number of processes increases to hundreds or even thou-sands the contribution of one process to the whole image de-creases Therefore stalling the whole compositing becauseof a few slow rendering processes can be avoided we needto stall a only few regions Having spatial awareness willhelp mitigate this issue Moreover spatial awareness willprevent the algorithm from making a process authoritativeon a region for which it has no data For example in fig-ure 3 process 2 is responsible for the last region but has no

Figure 4 Nodes sorted by depth in a chain The red nodesare still rendering while the green nodes are done rendering

data contributing to that region This increases communica-tion as process 2 has to transfer all its data to other regionsand needs to get all the data for its responsible region fromother processes

For our algorithm we divide the image into a set of r re-gions with a depth-sorted chain for each region To createthe chains for each region we can obtain information aboutthe data extents each process is loading using MPI Gatheror if a k-d tree is used to partition the data this informa-tion can be obtained programmatically for each region fromthe k-d tree Using the extents and camera information wecan compute the depth of each process and the position andarea contributed by each process in the final image For eachchain we also need to decide which processes will be re-sponsible for gathering information To try to ensure that dif-ferent nodes are used to collect information for each chainthe first collector node in the chain for region i is the ith nodein the chain The second is the (i+ r)th node If a chain hasfewer than r nodes the last node is made the collector nodefor that region The collector processes are marked with ablack circle inside as in figure 5 The number of regions inthis case is 4 The first chain chain 0 colored pink has onlythree nodes So the last node is set as the collector The sec-ond chain chain 1 colored cyan has seven nodes Thereforenode 1 and node 5 are set as collectors

Figure 5 Four chains one for each of the four regions (pur-ple blue yellow and gray) into which the final image is split

This approach will only work for depth-orderable de-composition Many simulations use block-structured AMRgrids which are depth-orderable If for example unstruc-tured grids are used concave regions could be generatedthrough domain decomposition where sorting by depth andthen compositing the various subdomains would result in in-correct images In this paper our focus is on block struc-tured grids with the block-structured decomposition alreadyimposed by the simulation

31 Algorithm

For our algorithm we have set aside one process that is notinvolved in compositing and rendering to act as a schedulerThe scheduler builds a chain for each region and the com-positing processes contact the scheduler to determine with



Algorithm 1 Initialize SchedulerCollect the depth and extents for each processSort the processes based on depthConstruct a chain based on depthfor each region do

Use the computed depth chain as the starting pointCompute and store the extents for that regionfor each process in the chain do

Compute the extents of the processif extents of process does not overlap thechainrsquos then

Delete the process from the chainAdjust the to and from neighbor for thedeleted process

if length of chain lt number of regions r thenSet the last process as a collector

elseSet every r process to be a collector

Create a buffer for final receiveLaunch asynchronous MPI receive for final image

which processes they should exchange images The chainfor each region is constructed as indicated in algorithm 1

Based on the depth information from each process adepth-sorted chain as shown in figure 4 is constructed thatis used as the initial chain for each region For each regionprocesses that do not contribute to that region are removedfrom the chain which creates spatial awareness for each re-gion and reduces the length of each chain In software eachchain is implemented using a hash map unordered_map inC++ so that access time is always O(1) and each node ofthe chain stores the neighbors to and from it The last step isto create a buffer to receive the composited image for eachregion This step ensures that when the final image regionsare sent to the display node they are not written to temporarybuffer but directly to the final image

The scheduler is then started and awaits communicationfrom the compositing processes Algorithm 2 shows the al-gorithm for the scheduler If the scheduler is receiving in-formation from a process for the first time it also receivesthe extents of the rendered image The chain for each regionis initialized based on the expected rendered extents fromeach process but depending on the transfer function someregions might not have been rendered for a process Basedon the rendered extents therefore some nodes are removedfrom region chains if they do not have any information forthat region If that process p was marked as a collector pro-cess for a specific region its neighbor is made a collectorprocess and the process p is deleted to minimize unneces-sary transfer of data to that process

Next the scheduler performs dynamic scheduling bydeciding which processes should communicate with each

Algorithm 2 Scheduler

while done doWait for communication from rendering processesif first communication from a process then

Receive rendered extents from the processfor each region do

Determine extents of the regionif extents of a process does not overlap thechainrsquos then

Remove the process from the chainAdjust neighbors to and from fordeleted process

Mark the process as active in the chains where theyexistsfor each active chain do

if only one process in chain thenMark process to send information todisplay nodeErase chain

elseFind neighbor for incoming nodeif neighbor found then

Determine if sender or receiverMark receiver as busyDelete sender from chainSave sender and receiver information

for each active chain doif size is 1 AND process is ready then

Process will send data to display node

for each process marked for communication doSend information

if all chains are empty thenExit Scheduler

other In each region for which the received process is ac-tive the received node in that chain is marked as ready andthe chain is checked to see if there is any neighbor processmarked as ready If a valid neighbor is found and it is a col-lector process the non-collector will send its data to the col-lector process Otherwise the node having the smaller imagewill send data to the node with the larger image to mini-mize communication time In each case the sender node ismarked for deletion and the receiver is marked as busy Thelast step of the algorithm is to check if any chains are nowempty or have only one remaining ready process If there isonly one ready process it is made to send its information tothe display node and the chain is erased The next step is tosend all the information at once to each node that has workto do A process might need to send data to a node x for aspecific region and receive data from the same node x foranother region All the communication to a node from thescheduler is done in one step



Algorithm 3 CompositorGet the extents of the image rendered by the processCount the number of active regions covered by theimage (countActiveRegions)while done do

if first time thenSend extents to the scheduler

elseTell scheduler that it is ready

Wait for the scheduler to respondfor each process to communicate with do

if Only process in chain thenSend data to display nodecountActiveRegions -= 1

elseif Send then

Async send to neighborcountActiveRegions -= 1

elseReceive imageif last round then

Create opaque imageCreate alpha bufferBlend current image with thebackgroundBlend in opaque bufferSend to display nodecountActiveRegions -= 1

elseBlend with image on node

if no active regions left thenExit loop

Each compositing node runs the Compositor algorithmshown in algorithm 3 The first time a process communicateswith the scheduler once it is done rendering it sends its ex-tents to the scheduler As mentioned before based on thetransfer function a process will not always render all data ithas loaded and as spatial awareness is a key component ofour algorithm we want to update the region chains to reflectthe state of the rendering Also each process will receive inone message all the other processes with which it needs tocommunicate to keep communication in the system to a min-imum Information for each communication will contain theneighbor with which to communicate the region blendingdirection and MPI tag Also each send from a process is inthe form of an asynchronous send to maximize overlappingcommunication with computation

32 Choosing number of regions

For the scaling run we have set the number of regions tobe 16 This number was determined after a series of initial

test runs where we experimented with 1 2 4 8 16 and32 regions for 4096 x 4096 sized images When few re-gions are used a slow node impacts few regions but sinceeach region occupies a substantial portion of the image com-positing ends up being slow For example if we use onlytwo regions for an 8K x 8K image and there is one slownode in the upper region half of the compositing is delayedby one node If too many regions are used one slow nodewill impact many small regions but since there are many re-gions the overall impact of a slow region will be less How-ever many regions will result in lots of communication withmany exchanges which we want to avoid Sixteen regionsprovided a good balance between avoiding too much com-munication and one node having too much of an impact onthe whole compositing process

4 Testing and Results

The test platform used is the Edison Cray XC30 supercom-puter at NERSC Edison uses the Cray Aries high-speed in-terconnect with Dragonfly topology that has an MPI band-width of about 8 GBsec and latency in the range of 025to 37 usec It has 5576 compute nodes each of which hastwo 24 GHz 12-core Ivy Bridge processors with 64 GB ofmemory per node We scaled up to 2048 nodes of the 5576nodes of Edison

The test datasets that we used are an artificial box and ar-tificial sphere test dataset and a combustion dataset shownin figure 6 The combustion dataset has 106457688 cellsstored as doubles and is split into 5996 blocks for a totalsize of 09 GB It is part of combustion dataset that has 30scalar values per timestep and about 500 timesteps Fuel isinjected into the combustion chamber through a number oftubes located at the bottom of the dataset Combustion startsabove these tubes and rises to the top of the combustionchamber hitting the ceiling and the walls When visualiz-ing this dataset much more work has to be done in the upperregions of the dataset thereby creating an imbalance in therendering workload The artificial datasets are simpler eachrendering process is assigned one block of uniform scalardata per node The box dataset is similar to what was used byMoreland et al [MWP01] and we also introduced a spheredataset whose diameter is equal to the length of the cube

Figure 6 The datasets box (left) sphere (middle) andcombustion (right)



The algorithm we compared against is the TOD-Tree al-gorithm of Grosset et al [GPClowast15] Grosset et al haveshown that TOD-Tree generally performs better than Radix-k and both TOD-Tree and our algorithm uses threads andauto-vectorization compared to the ICET library [Mor11]which does not use threads

Figure 7 Scaling of the combustion dataset on Edison -showing rendering and compositing

41 Scheduler Cost

Building and running the scheduler is fast the time it took toconstruct the region chains and using MPI Gather to collectthe depth and extents information from each node for 2048nodes was measured to be on average 05 millisecond Thetime it took the scheduler to respond to a compositing nodeif neighbors were available was on average 02 millisecondWith a latency of at most 37 millionth of a second the cost

Figure 8 Scaling of the combustion dataset on Edison -showing compositing only



of communicating with the scheduler is minimal comparedto the cost of exchanging data among nodes

42 Scaling Studies

For each of the three datasets and for each of the three im-age sizes used (2048 x 2048 4096 x 4096 and 8192 x8192 pixels) we performed 10 runs after an initial warm-uprun and the results are the average of these runs after someoutliers have been eliminated

Figure 7 shows the total time it takes to render and com-posite the combustion dataset for up to 2048 nodes on Edi-son As expected as the number of nodes increases the totaltime it takes to render the dataset decreases The focus ofthis paper is image compositing and so for the remainder ofthis section we focus on compositing

Depending on the amount of rendering work each nodehas to do compositing will start at different times on eachnode The compositing time that needs to be minimized isthe time interval between when the slowest rendering jobfinishes and the final image is ready on the display nodethe orange region in figure 2 Any compositing done in theinterval of time between the fastest rendering node and theslowest rendering node does not slow down the entire com-positing process Therefore the compositing time that wemeasured and plotted in figures 8 and 9 is the time intervalbetween the slowest rendering job and the image being readyon the display node

Figure 8 shows the compositing time for the combustiondataset for 2048 x 2048 (2Kx2K) 4096 x 4096 (4Kx4K)and 8192 x 8192 (8Kx8K) sized images for TOD-Treeand our Dynamically Scheduled Region-Based (DSRB) al-gorithm When there are few nodes each node renders alarger region and so influences many regions of the chainWe therefore do not gain much from overlapping render-ing with compositing since compositing in most regions isstalled by waiting for other nodes As the number of nodesincreases and the contribution of each node to regions de-creases the overlapping of compositing and rendering al-lows us to perform better than the TOD-Tree which doesnot have any spatial or temporal awareness of the image be-ing rendered from each node We also see that there is morevariation for the 2K x 2K image compared to the 4K x 4Kimage and 8K x 8K image since it is more communicationbound The 8K x 8K image has the least variation as it ismore computation bound

The Dynamically Scheduled Region-Based compositingalgorithm also performs faster than TOD-Tree on the artifi-cial dataset The difference in compositing times betweenthe sphere and box is minimal in most cases Howeversince there is less data for the sphere dataset it takes lesstime to render compared to the box dataset and so has lessfree compositing time compared to the box dataset This is

translated in the chains by the box having a faster composit-ing time since what we are showing as compositing timeis the time interval between the slowest rendering and finalimage being ready For TOD-Tree the sphere is generallyfaster since there is overall less data to process As with thecombustion dataset compositing gets faster as the numberof nodes increases Here again when more nodes are usedeach node has a smaller share of the entire image and a slownode impacts fewer regions resulting in faster compositing

Figure 9 Scaling of the artificial box and sphere datasetson Edison - showing compositing only



5 Conclusion and Future Work

In this paper we have introduced an image compositingalgorithm that has both spatial and temporal awareness ofcompositing Spatial awareness ensures that no compositingprocesses will ever receive data for a region to which it doesnot contribute thereby minimizing communication Tempo-ral awareness ensures that processes do not try to commu-nicate with processes that are still rendering thereby min-imizing delays Combining spatial and temporal awarenessstreamlines compositing by allowing several regions of animage to be fully composited fairly quickly Compositing isdelayed only for data-intensive regions of an image Thisgives us a substantial gain compared to TOD-Tree whichlacks spatial and temporal awareness The DSRB algorithmcan also beneficial in situ visualization scenarios where thedomain decomposition is dictated by the simulation

As future work we would like to run the scheduler asa thread on one of the compositing nodes instead of on aseparate node Also we would like to try to find a way toestimate the time it takes to render on each node and seehow this approach can be used to reduce communicationMore complex visualization workloads involving polygonsglyphs and mixed non-volumetric data may require a moresophisticated scheduler perhaps employing directed graphsinstead of chains Also we would like to run the DSRB al-gorithm on a GPU-accelerated supercomputer where the ren-dering times are likely to be shorter than on a CPU-only ac-celerated supercomputer

One of the limitations of the DSRB algorithm is where theload is perfectly balanced Having the extra communicationwith the scheduler will decrease the performance of DSRBalgorithm Also as the number of nodes we use to renderincreases there will be a point at which each node will fin-ish rendering even with lighting and imbalance in workloadnearly at the same time and the differences in renderingcompletion time will become negligible We would like torun experiments on large supercomputers to determine whenthis will happen for various data and image sizes This willhelp establish the architecture dependent crossover point atwhich we should switch over to algorithms such as TOD-Tree and Radix-k which minimize communication Anotherlimitation is the depth-orderable requirement described inSection 3 While DSRB performs well for block-structureddecompositions including block-structured AMR grids un-structured grids are not guaranteed to be depth-orderable dueto the potential of concave regions This could be overcomewith some limited data replication and would be interestingfuture research

6 Acknowledgments

The authors would like to thank the National Energy Re-search Scientific Computing Center (NERSC) for providingaccess to the Edison supercomputer and the support staff atNERSC for helping resolve compilation issues

This research was partially supported by the Departmentof Energy National Nuclear Security Administration un-der Award Number(s) DE-NA0002375 the DOE SciDACInstitute of Scalable Data Management Analysis and Visual-ization DOE DE-SC0007446 NASA NSSC-NNX16AB93Aand NSF ACI-1339881 NSF IIS-1162013

References

[EMP09] EILEMANN S MAKHINYA M PAJAROLA R Equal-izer A Scalable Parallel Rendering Framework IEEE Transac-tions on Visualization and Computer Graphics 15 3 (May 2009)436ndash452 URL httpdxdoiorg101109TVCG2008104 doi101109TVCG2008104 3

[EP07] EILEMANN S PAJAROLA R Direct send compositingfor parallel sort-last rendering In Proceedings of the 7thEurographics Conference on Parallel Graphics and Visualiza-tion (Aire-la-Ville Switzerland Switzerland 2007) EGPGVrsquo07 Eurographics Association pp 29ndash36 URL httpdxdoiorg102312EGPGVEGPGV07029-036doi102312EGPGVEGPGV07029-036 3

[FCSlowast10] FOGAL T CHILDS H SHANKAR S KRUumlGER JBERGERON R D HATCHER P Large data visualization on dis-tributed memory multi-gpu clusters In Proceedings of the Con-ference on High Performance Graphics (Aire-la-Ville Switzer-land Switzerland 2010) HPG rsquo10 Eurographics Associationpp 57ndash66 URL httpdlacmorgcitationcfmid=19214791921489 2 3

[FE11] FREY S ERTL T Load Balancing Utilizing Data Redun-dancy in Distributed Volume Rendering In Eurographics Sym-posium on Parallel Graphics and Visualization (2011) KuhlenT Pajarola R Zhou K (Eds) The Eurographics Associationdoi102312EGPGVEGPGV11051-060 3

[FSZlowast10] FANG W SUN G ZHENG P HE T CHEN GNetwork and Parallel Computing IFIP International Confer-ence NPC 2010 Zhengzhou China September 13-15 2010Proceedings Springer Berlin Heidelberg Berlin Heidelberg2010 ch Efficient Pipelining Parallel Methods for Image Com-positing in Sort-Last Rendering pp 289ndash298 URL httpdxdoiorg101007978-3-642-15672-4_25doi101007978-3-642-15672-4_25 3

[GPClowast15] GROSSET A V P PRASAD M CHRISTENSEN CKNOLL A HANSEN C Tod-tree Task-overlapped direct sendtree image compositing for hybrid mpi parallelism In Proceed-ings of the 15th Eurographics Symposium on Parallel Graph-ics and Visualization (Aire-la-Ville Switzerland Switzerland2015) PGV rsquo15 Eurographics Association pp 67ndash76 URLhttpdxdoiorg102312pgv20151157 doi102312pgv20151157 1 3 7

[HBC10] HOWISON M BETHEL E W CHILDS HMPI-hybrid Parallelism for Volume Rendering on LargeMulti-core Systems In Proceedings of the 10th Euro-graphics Conference on Parallel Graphics and Visualiza-tion (Aire-la-Ville Switzerland Switzerland 2010) EGPGVrsquo10 Eurographics Association pp 1ndash10 URL httpdxdoiorg102312EGPGVEGPGV10001-010doi102312EGPGVEGPGV10001-010 2

[HBC12] HOWISON M BETHEL E CHILDS H Hybrid Paral-lelism for Volume Rendering on Large- Multi- and Many-CoreSystems Visualization and Computer Graphics IEEE Trans-actions on 18 1 (Jan 2012) 17ndash29 doi101109TVCG201124 3



[HHNlowast02] HUMPHREYS G HOUSTON M NG R FRANK RAHERN S KIRCHNER P D KLOSOWSKI J T ChromiumA Stream-processing Framework for Interactive Rendering onClusters ACM Trans Graph 21 3 (July 2002) 693ndash702 URL httpdoiacmorg101145566654566639 doi101145566654566639 3

[Hsu93] HSU W M Segmented Ray Casting for Data Paral-lel Volume Rendering In Proceedings of the 1993 Symposiumon Parallel Rendering (New York NY USA 1993) PRS rsquo93ACM pp 7ndash14 URL httpdoiacmorg101145166181166182 doi101145166181166182 2

[MCEF94] MOLNAR S COX M ELLSWORTH D FUCHS HA sorting classification of parallel rendering Computer Graphicsand Applications IEEE 14 4 (1994) 23ndash32 doi10110938291528 1

[MKPH11] MORELAND K KENDALL W PETERKA THUANG J An Image Compositing Solution at Scale In Pro-ceedings of 2011 International Conference for High PerformanceComputing Networking Storage and Analysis (New York NYUSA 2011) SC rsquo11 ACM pp 251ndash2510 URL httpdoiacmorg10114520633842063417 doi10114520633842063417 3

[MMD06] MARCHESIN S MONGENET C DISCHLER J-MDynamic Load Balancing for Parallel Volume Rendering In Eu-rographics Symposium on Parallel Graphics and Visualization(2006) Heirich A Raffin B dos Santos L P (Eds) The Eu-rographics Association doi102312EGPGVEGPGV06043-050 2 3

[Mor11] MORELAND K IceT Usersrsquo Guide and ReferenceTech rep Sandia National Lab January 2011 3 7

[MPHK93] MA K-L PAINTER J HANSEN C KROGH M Adata distributed parallel algorithm for ray-traced volume render-ing In Parallel Rendering Symposium 1993 (1993) pp 15ndash22105 doi101109PRS1993586080 1 2

[MSE07] MUumlLLER C STRENGERT M ERTL T Adaptive loadbalancing for raycasting of non-uniformly bricked volumes Par-allel Comput 33 6 (June 2007) 406ndash419 URL httpdxdoiorg101016jparco200612002 doi101016jparco200612002 3

[MWMS07] MOLONEY B WEISKOPF D MOumlLLER TSTRENGERT M Scalable sort-first parallel direct volumerendering with dynamic load balancing In Proceedings of the7th Eurographics Conference on Parallel Graphics and Visual-ization (Aire-la-Ville Switzerland Switzerland 2007) EGPGVrsquo07 Eurographics Association pp 45ndash52 URL httpdxdoiorg102312EGPGVEGPGV07045-052doi102312EGPGVEGPGV07045-052 3

[MWP01] MORELAND K WYLIE B N PAVLAKOS C JSort-last parallel rendering for viewing extremely large datasets on tile displays In IEEE Symposium on Paralleland Large-Data Visualization and Graphics (2001) BreenD E Heirich A Koning A H J (Eds) IEEE pp 85ndash92 URL httpdblpuni-trierdedbconfpvgpvg2001htmlMorelandWP01 6

[Neu94] NEUMANN U Communication Costs for ParallelVolume-Rendering Algorithms IEEE Comput Graph Appl14 4 (July 1994) 49ndash58 URL httpdxdoiorg10110938291531 doi10110938291531 2

[PGRlowast09] PETERKA T GOODELL D ROSS R SHEN H-WTHAKUR R A Configurable Algorithm for Parallel Image-compositing Applications In Proceedings of the Conference onHigh Performance Computing Networking Storage and Anal-ysis (New York NY USA 2009) SC rsquo09 ACM pp 41ndash

410 URL httpdoiacmorg10114516540591654064 doi10114516540591654064 1 3

[SMLlowast03] STOMPEL A MA K-L LUM E B AHRENSJ PATCHETT J Slic Scheduled linear image compositingfor parallel volume rendering In Proceedings of the 2003IEEE Symposium on Parallel and Large-Data Visualizationand Graphics (Washington DC USA 2003) PVG rsquo03 IEEEComputer Society pp 6ndash URL httpdxdoiorg101109PVGS20031249040 doi101109PVGS20031249040 3

[SMWlowast04] STRENGERT M MAGALLAtildeSN M WEISKOPF DGUTHE S ERTL T Hierarchical Visualization and Com-pression of Large Volume Datasets Using GPU Clusters InEurographics Workshop on Parallel Graphics and Visualiza-tion (2004) Bartz D Raffin B Shen H-W (Eds) The Eu-rographics Association doi102312EGPGVEGPGV04041-048 3

[YWGlowast10] YU H WANG C GROUT R W CHEN J H MAK-L In Situ Visualization for Large-Scale Combustion Sim-ulations IEEE Comput Graph Appl 30 3 (May 2010) 45ndash57 URL httpdxdoiorg101109MCG201055 doi101109MCG201055 2

[YWM08] YU H WANG C MA K-L Massively Parallel Vol-ume Rendering Using 2-3 Swap Image Compositing In Pro-ceedings of the 2008 ACMIEEE Conference on Supercomput-ing (Piscataway NJ USA 2008) SC rsquo08 IEEE Press pp 481ndash4811 URL httpdlacmorgcitationcfmid=14133701413419 3



there are hundreds rather than thousands of nodes the timetaken to render a large image can be substantially greaterthan the time to render a small image If we do not wantthe uneven rendering to slow down compositing nodes thatare done rendering should exchange images only with nodesthat are also done rendering and not wait on nodes which arestill rendering In this paper we keep track of which nodesare done compositing and only schedule compositing whennodes have completed rendering

Figure 1 Two commonly used test datasets the Bonsaidataset on the left and Backpack dataset on the right

One of the common approaches for load balancing in dis-tributed volume rendering is to split and distribute the datasetbased on how long each region will take to render the ap-proach used by Marchesin et al [MMD06] and Fogal etal [FCSlowast10] rather than just diving the data equally usingfor example a k-d tree However arbitrarily assigning datato nodes may not be an effective strategy when consideringin situ visualization Data movement between nodes or writ-ing to disk is very costly in large scale simulations As men-tioned by Yu et al [YWGlowast10] for in situ visualization itis the simulation code that dictates the data partitioning anddistribution among nodes Thus in situ visualization uses thesame nodes for visualization as those generating the data inthe simulation and is best performed without requiring datamovement between nodes or disk In this paper we proposethat work is scheduled at the compositing stage and does notrequire data redistribution for balanced rendering Since theimage compositing only transfers sub-images our proposedtechnique would be easily integrated with existing in situ vi-sualization and analysis software

The main contribution of this paper is an image composit-ing algorithm that uses a scheduler with both spatial andtemporal awareness of the compositing process We start bydividing the final image into a number of regions r and createa depth-ordered list of nodes for each region Based on thedata loaded by each node and the properties of the camerathe contribution of each node to regions of the final imagecan be determined Nodes not contributing to a region canthen be removed from that regionrsquos list The scheduler alsoupdates the region list after each node is done rendering byeliminating nodes that rendered nothing for a region Thisprocess ensures that a node not contributing to a region is

never made to receive data for that region thus minimizingcommunication The algorithm then schedules the exchangeof images and ensures that no nodes wait for a node thatis still rendering if another option for compositing is avail-able Thus when the slowest node is done rendering most ofthe regions of the final image have already been compositedand there is minimal overhead to assemble the final imageThe algorithm uses one MPI rank per node and threads forCPU cores which Howison et al [HBC10] showed to be bet-ter than one MPI rank per core Auto-vectorization is alsoused to fully leverage the compute capabilities of modernCPUs and asynchronous MPI communication is also usedto overlap communication with computation We comparethis scheduling-based image compositing algorithm againstTOD-Tree on the Edison supercomputer at NERSC using abox and sphere artificial dataset and a combustion dataset

The paper is organized as follows Section 2 describes thecommonly used compositing algorithms and techniques thatare used to achieve load balance in distributed volume ren-dering Section 3 describes our algorithm and the implemen-tation details The results are presented in section 4 wherewe also discuss the results of strong scaling on three typesof datasets Finally the paper is wrapped up in section 5 withthe conclusion and future work

2 Previous Work

Many algorithms have been designed to tackle image com-positing in distributed volume rendering The simplest algo-rithm is serial direct send in which all the processes involvedin rendering send their rendered image to the display nodewhich then blends them together This approach results in amassive load imbalance and can be quite slow for large im-ages and many nodes Parallel direct send [Hsu93] [Neu94]improves on serial direct send by dividing the workloadamong all the processes Each process is responsible for onesection of the final image and gathers that section from allother processes In the gathering stage each process sendsits authoritative section to the display node

Tree-based algorithms have also been used for imagecompositing In binary tree compositing each node is rep-resented as a leaf of the tree The height h of the tree islog2n where n is the number of nodes in the tree In eachsubtree a child sends its data to its sibling for blending andbecomes inactive This exchange goes on for each level ofthe tree until the final image is at the root (display node)of the tree Binary Swap [MPHK93] improves on this ap-proach by keeping all nodes active until the gathering stageInitially each node is responsible for the whole image Ateach level of the tree nodes responsible for the same regionare paired in a subtree and exchange information so that eachis responsible for half of the image for which it was initiallyresponsible for This process continues until there is onlyone node responsible for each 1n section of the final im-age Then the display node gathers these sections from all








3 Methodology















31 Algorithm




































elseif Send then


















41 Scheduler Cost






42 Scaling Studies














6 Acknowledgments



References



































3 Methodology















31 Algorithm




































elseif Send then


















41 Scheduler Cost






42 Scaling Studies














6 Acknowledgments



References







































31 Algorithm




































elseif Send then


















41 Scheduler Cost






42 Scaling Studies














6 Acknowledgments



References






























































elseif Send then


















41 Scheduler Cost






42 Scaling Studies














6 Acknowledgments



References
































41 Scheduler Cost






42 Scaling Studies














6 Acknowledgments



References































42 Scaling Studies














6 Acknowledgments



References


































6 Acknowledgments



References





























dynamically scheduled region-based image compositing · 2016-06-02 · eurographics symposium on...

Documents