performance of space subdivision techniques in ray tracing

Volume 11, (1992), number 4 pp . 213-220

Performance of Space Subdivision Techniques in Ray Tracing

M. D. J. McNeill, B. C. Shah, M.-P. Hebert, P. F. Lister and R. L. Grimsdale

VLSI and Computer Graphics Research Group, School of Engineering, University of Sussex, Brighton BNl 9QT UK

Abstract Whilst providing images of excellent quality, ray tracing is a computationally intensive task. The first part of this paper compares the speed-up achieved in ray tracing using various space subdivision algorithms and discusses the implications of implementing the algorithms on parallel processing systems. The second part addresses the problem of building the data structure within the rendering process, a situation which occurs when the rendering process is parallelised and dynamic scenes are rendered. Greater performance can be achieved with dynamic structure building compared to creation of the structure prior to rendering. The dynamic building algorithm proposed reduces the building time and storage cost of space subdivision structures, and decreases the data structure creation-render cycle time, thus enhancing image parallelism performance.

1. Introduction

Ray tracing19 is one of the most photorealistic methods of image synthesis. In order to simulate geometric optics, rays of light are traced backward from the observer towards the scene; further rays are cast from the visible objects to the light sources and in the directions determined by reflection and refraction. In naive implemen- tations of ray tracing, the visible object is determined by testing every object for intersection with each ray; this is so computationally expensive that only very simple scenes can be rendered. Since most computations are performed while determining the closest intersected object19, speeding up the object search has been a subject of intensive research8. Space subdivision techniques, the most popular approach to fast ray tracing, are reviewed in Section 2. Although the processing time is reduced dramatically, parallelisation of the ray tracing process is still necessary to reach interactive rates. Parallel algorithms are surveyed in Section 3. Section 4 discusses the traversal costs and the storage requirements for the various space subdivision techniques. Traditionally the data structure is built in prior to rendering, which is acceptable when rendering static scenes. However, in a parallel system, where rendering times are significantly reduced, this data structure creation time becomes large relative to the rendering time. Additionally, when tracing dynamic scenes, the data structure creation and initialisation becomes part of the rendering cycle, since the data

structure needs to be re-built when an object moves. An algorithm for dynamic building of the data structure is suggested in Section 5, which can be used to enhance the performance of the ray tracer by reducing the overall rendering time and the storage costs of the data structure.

2. Space Subdivision Techniques

The expense of naive ray tracing results from the testing of every object for intersection with every ray. Space subdivision techniques8 subdivide the scene into small cells pointing to lists of objects. Identifying the cells intersected by a ray determines the objects which lay near the ray path. It is these objects which are more likely to be intersected by the ray. Grids5, octrees7 and binary space partition trees (BSP trees)13 are traditional space subdivision techniques.

The scene is first bounded by a rectangular paral- lelepiped. In a uniform grid the bounded scene is subdivided into equally sized cells. An octree is built recursively by subdividing the bounding volume into eight equally sized parallelepipeds, which are then further subdivided into eight smaller volumes if they contain too many objects. The subdivision of a cell stops when it contains a sufficiently small number of objects or if the cell size becomes too small. Note that because we use the HERO algorithm for octree traversal1, neither the scene bounding volume nor the octants have to be cubes.

214 M.D.J. McNeill et al. / Performance of Space Subdivision Techniques,

Figure 1. Distribution of processing time among ray tracing tasks.

The BSP tree is also built recursively but a cell is subdivided into two instead of eight. The slicing plane is successively chosen to be perpendicular to each of the coordinate axes. The grid can be represented by an array; the octree and the BSP tree can be represented by trees whose leaf nodes point to object lists.

For a scene modelled with 10,000 polygons, the number of objects intersected by a ray is typically 1011. Considerable gain in performances can therefore be obtained from adopting space subdivision techniques. Nevertheless, their use introduces new costs such as the identification of intersected cells and the building of the structure. These costs are negligible compared to the improvement obtained through using space subdivision. However, they are significant if compared with the total cost of rendering. For example, the distribution of the processing time (on a general purpose workstation) when ray tracing the Utah teapot, tessellated with 9,120 triangles and structured with an octree, is indicated in Figure 1. Although this distribution varies with each type of data structure and each scene, the cost for identifying intersected cells is always significant enough to gain further performance through optimising the traversal of the data structure. The performance of the traversal algorithms associated with grids, octrees and BSP trees is discussed in Section 4.1.

3. Parallel Ray Tracing

Despite the improvement obtained with space subdivision techniques, ray tracing a 1024 x 1024 image can still require about 900 million floating-point operations. Parallelism is thus the only current means to achieve interactivity. Up to 1990, three approaches have been considered :

Pipelining The different processes of ray tracing are distributed among processors. For example, in the

LINKS- 1 pipeline15, one processor searches for all objects intersected by a ray; a second processor determines the nearest intersected object and a third processor computes the intensity of the ray. This parallelism exhibits poor load balancing.

Pixel parallelism As pixel computations are independent, they can be computed by different processors. The intrinsic load balancing and the low overheads of this parallel algorithm result in a very efficient use of computing power. However, each processor needs to access the whole database. Duplicating the database for each processor wastes large amounts of memory while sharing memory leads to serious delays in fetching data4.

Space parallelism The scene is split into several volumes which are distributed among processors. A ray is sent from one processor to another as it moves from region to region. This parallelisation uses memory efficiently. As the number of processors increases, the compu- tation-communication ratio decreases and results show that the efficiency of such systems drops for a number of processors exceeding ten or twenty12.

More recent research16, 9, 3, 14 shows that a slight modi- fication to pixel parallelism allows a single database to be shared by several processors without contention. If processors compute a packet of neighbouring pixels instead of single pixels, image coherence leads to a high locality of reference. Thus a cache can overcome some of the problems associated with pixel parallelism. However, this new image parallelism can degrade if the shared database is accessed too frequently. As space subdivision techniques select the objects which lay near the ray path, it can be seen that the same objects are likely to be required when processing two neighbouring pixels. Simi- larly, there is a high probability that the same cells will be accessed when computing neighbouring pixel intensities.

M.D.J. McNeill et al. / Performance of Space Subdivision Techniques 215

Nevertheless, if the data structure is too cumbersome- cells are small and very few objects are tested for intersection-then many cells together with their objects will be accessed and may overwrite data which will be requested while processing the next pixel. Cache efficiency can then decrease, leading to database contention. If, however, the cells are larger, many more objects will be checked for intersection and the performance also de- grades. Therefore an important aspect affecting the performance of space subdivision techniques in the image parallelism scheme is the amount of memory required for storing an efficient data structure. Section 4.2 examines the memory requirements of grids, octrees and BSP trees.

4. Comparing the Performance of Space Subdivision Techniques

4.1. Ray-Structure Intersection

Uniform subdivision allows fast cell-to-cell propagation of the ray. However, objects are poorly distributed among grid cells. Thus, a ray is likely to intersect either more cells or more objects than when using an adaptive structure like an octree or a BSP tree. In order to make a fair comparison of the relative computational cost of traversing a grid, an octree or a BSP tree, an efficient traversing algorithm is associated with each of these structures. For the grid, we consider an incremental technique similar to the one developed by Amanatides and Woo2. Only two additions and two comparisons are used to determine the next hit cell. For the octree, we consider the HERO algorithm' which requires on average three divisions and additions, comparisons and one XOR mask to determine the next intersected leaf node11. For the BSP tree, we consider a traditional algorithm based on Glassner's approach7 which requires six divisions, seven additions and five comparisons in order to locate the next hit leaf node. All these costs can be computed per grid cell length, by using the Samet octree model17 and Devillers' analysis6. If these results are extended to the BSP tree, we obtain an upper limit on the cost for the traversal of the BSP tree. For the purpose of the analysis, if the width of a grid cell is one unit and equal to the smallest octree node at a depth d, then the width of the whole grid structure is units (assuming the root node of the octree is at a depth of one). The cost of traversing the space subdivision structure, per unit of length, is on average :

Considering that d is usually larger than ten11, intersecting tree structures is computationally less expensive than intersecting grids on most computers. The main advantage of incremental techniques is that they can be performed on simple hardware. This could lead to a fast ray tracing machine for scenes modelled with 3D cells. However, for complex scenes the repeated cycle of structure traversal and object intersection rules out the possibility of building simple and efficient hardware for grid traversal.

4.2. Memory Requirements of Space Subdivision Techniques

Comparing BSP Trees and Octrees

In a BSP tree, the nodes are subdivided along the axis directions in successive stages. Therefore, if similar criteria are used for stopping the subdivision of a node for building a BSP tree and an octree-the same threshold of objects per leaf and the same minimum size of cell-then the BSP tree will have less leaf nodes than the octree. However, this is offset by the fact that the ratio of branches per leaf node is one half in a BSP tree and one eighth in an octree. We have compared the number of nodes in an octree and in a BSP tree for a set of sample scenes (see Table 1). The gears and tetrahedra scenes have been proposed in10. The teapot1 and teapot2 scenes represent the Utah teapot tessellated into 980 and 9,120 triangles respectively. The numbers of nodes are very similar for both structures. However, a BSP tree is deeper than an octree if it is built with the same criteria of subdivision. Thus more branch nodes are accessed in order to reach a leaf node in a BSP tree than in an octree, for a similar size of leaf. This leads to an increase in data accesses.

Comparing Grids with Tree Structures

BSP trees and octrees are always smaller than grids for similar efficiency. If a grid is built to the same depth as an octree or BSP, then the latter structures will be smaller

Table 1 : Comparison of the number of nodes in BSP trees and octrees

216 M.D.J. McNeill et al. / Performance of Space Subdivision Techniques

Figure 2. Memory requirement per level of octree.

Figure 3. Number of accesses to nodes per level of octree.

since they are adaptive structures. If, on the other hand, 5. Octree Building

5.1. Performance of Ray Tracing Algorithm with Varying Octree Complexity

a grid is built with a similar number of cells as the octree or BSP tree has nodes, then the grid cells will be larger than the smallest tree node. Therefore the grid cells may contain more objects than the tree leaf nodes. The number of objects tested for intersection is thus larger for grid than for tree structures, which leads to an increase in rendering time.

In14, we pointed out that the nodes of the upper levels of an octree were heavily used, while a significant percentage of the deepest nodes were never accessed. In the teapot2 scene 42% of the nodes belong to levels 8


to 12 (Figure 2), whereas the accesses to these nodes represent less than 1 % of the total accesses, as shown in Figure 3. This arises because the probability that a node

Clipping algorithms based on Cohen-Sutherland and the Sutherland-Hodgeman algorithms can be used to determine the objects located in a particular node.

is intersected by a ray decreases with its size. If a ray intersects the root node, the probability that a node of level i is accessed is 4-i. Nevertheless, we also observed that it is not possible to truncate the deepest levels of the octree without increasing the number of computations dramatically, due to the large number of objects residing in the subsequent leaf nodes. This leads to a proposal for dynamic octree building, where the upper levels of the tree are built prior to rendering, and the lower levels are built on demand during the rendering. Only those nodes belonging to the lower depths of the octree which are actually accessed by rays are therefore built. The problem of deciding the optimum depth at which to stop building the octree must then be addressed.

5.2. Dynamic Building of Octrees

In dynamic building, the octree nodes are classified into one of the three possible states : leaf, branch, or partially built (L, B, or P). A partially built node is one which contains more than the maximum allowed number of objects at the current depth. During the initialisation stage, only the upper levels of the octree are built. The usual criterion of a maximum number of objects per leaf is used. In order to decide the optimum depth to build the octree initially, the number of accesses to nodes at each level was analysed. Several test scenes were then rendered and results are presented in 5.2.2.

If a branch node is intersected by a random ray, the maximum number of its children which the ray can intersect is 4. The probability that a child is hit by the ray i s 0.5. Therefore, if there are A , requests to the nodes of level l, the number of requests to the nodes of level I+ 1 is in the worst case

where is the number of partially built nodes at level I, B, is the number of leaf branch nodes at level I , and L, is the number of leaf nodes at level l.

Therefore the number of accesses to level l+1 is smaller than at level l if i.e. if there are three times as many leaf nodes than branch and partially built nodes at level l. This can be calculated dynamically by examining the nodes at each level, and can therefore be used as a guide to determine when to stop building the octree and start rendering.

During the octree crossing procedure, if the ray hits a node of type ‘P’, then the node is subdivided once and the normal octree traversal algorithm is performed. The

5.2.1. Algorithm octree is therefore built level by level as-required by the renderer.

Pseudocode for the octree building algorithm is given below. The octree is created in breadth-first order, and a queue is used to store the nodes as they are processed.

5.2.2. Results

Results are presented for three test scenes (rendering

218 M.D.J. McNeill et al. / Performance of Space Subdivision Techniques

times are given in seconds; all scenes were rendered on an Apollo DN4500.). It can be seen from Table 2 that since there is a negligible reduction in rendering time for the small gears database (1,169 polygons), dynamic building is not worthwhile in this case. For the teapot2 database, with 9,120 polygons, there is no appreciable advantage in continuing construction of the octree beyond level 8, since the rendering times remain similar after this point-Table 3. Theory predicts that in the worst case the number of accesses to nodes starts to drop after level 6, when the number of leaves is greater than three times the number of branch and partially built nodes. Note that in this case the number of accesses actually starts to drop after level 5. However, an appreciable drop in accesses only occurs after level 8, after which the rendering times remain relatively constant. Similar results are seen for tracing the same database to a greater ray depth, i.e. spawning more reflection and shadow rays (Table 4). Finally, results are presented in Table 5 for a larger database of over 19,000 polygons. Theory predicts that in the worst case accesses start to drop after level 6. It can be seen that rendering times are similar beyond level 8, and therefore there is little point in initially building the octree beyond this depth.

Table 4: Rendering times per level of teapot2 octree, ray depth 4

Table 2: Rendering times per level of gears octree (1,169 polygons)

Table 5: Rendering times per level of large octree (19,000

polygons)

In conclusion, while it is possible to predict in theory when, in the worst case, the number of accesses to a particular level starts to drop, only when there is a

Table 3: Rendering times per level of teapot2 octree (9,120 polygons), ray depth 2

significant drop in accesses does the rendering time remain relatively constant. It is therefore advantageous to continue building the octree until this significant drop in accesses occurs. From our analysis, when the number of leaves of level n is more than ten times the number of branches and partially built nodes of that level, then octree construction should be halted, since the rendering time will not improve much for a more complete octree. Time is saved by not building all the lower branches and leaves-only those nodes which are accessed will be eventually built-and less memory is used to store the structure. In the case of the database with 19,000 polygons, this results in a saving of some 45% of the memory required for the structure, and a saving of 20 % of the time taken to build the structure for no significant penalty in rendering time.


Interactive ray tracing can only be achieved with current technology through the use of massive parallelism obtained from large multiprocessing systems. In a parallel ray tracer, where dynamic scenes are rendered, data structure creation and storage play an important part in the performance of the system. A dynamic octree building strategy has been proposed which reduces the cost of this data structure creation time and the storage requirements, thus enhancing the performance of the parallel ray tracing algorithm. Results have been quoted for several test databases which indicate that the algorithm is particularly suited to large databases.

References

5.3. Discussion Advantages of the dynamic octree building algorithm

are :

Memory requirements are reduced, which i s particularly attractive in multiprocessing environments where processor memory size may be severely limited. If image space parallel algorithms are used then each processor traces only a small fraction of the screen and only a small part of the data structure is accessed. Therefore most of the structure is redundant, and with dynamic building is not built.

Building and initialisation costs are reduced, as only the upper levels of the octree are created initially.

A faster rendering time is achieved in dynamic 1. M. Agate, R. L. Grimsdale and P. F. Lister, “The environments, where part of the database is altered HERO algorithm for ray-tracing octrees”, in Ad- between frames. The part of the data structure con- vances in Computer Graphics Hardware IV (1991). taining the altered volume has to be reconstructed 2. J. Amanatides and A. Woo, “A fast voxel algorithm

for ray tracing”, in Proc of Eurographics ’ 8 7 , pp. prior to the rendering stage and therefore the octree construction costs become integrated into the total rendering costs. Hence accelerating the octree construction task implies a better overall performance in 3. D. Badouel and T. Priol, “An Efficient Parallel Ray

the data structure creation-render cycle time. Tracing Scheme for Highly Parallel Architectures”, in Advances in Computer Graphics Hardware V,

If voxel antialiasing 18 is supported, then that part of forthcoming. Presented at the Eurographics ’ 9 0 Hardware workshop. the octree where the cross-sectional projected area of

the octant onto the screen is smaller than a pixel need

is implemented directly on the voxel. Construction of “ Synthese d’images par lancer de rayon algorithmes et architecture ”, Acta Electronica, 26(3-4) pp. 249- this part of the octree is therefore redundant, and as

this will inevitably be lower levels of the structure this 259 (1984). construction is saved using the dynamic build al- 5. J. G. Cleary and G. Wyvill, “Analysis of an Al- gorithm. gorithm for Ray Tracing using Uniform Space

Subdivision”, The Visual Computer, 4(2) pp. 65-83 The penalty of this algorithm is that of a small overhead (July 1988). in comparing the number and type of nodes built at each

stage of the data structure initialisation phase. This is not 6. O. Devillers, “The Macro-regions : an Efficient significant compared to the reduction in rendering time Space Subdivision Structure for Ray Tracing”, Proc. achieved. of the Eurographics ‘89, pp. 27-38 (1989).

7. A. S. Glassner, “Space subdivision for fast ray 6. Conclusion tracing”, IEEE Computer Graphics and Applications,

3(10) pp. 15-22 (October 1984). A comparative study of spatial data structures has shown that octree-based algorithms perform better than 3D 8. A. S. Glassner (ed.), An Introduction To Ray Tracing, grids and BSP tree structures in both sequential and Academic Press Inc. (1989). parallel ray tracers which use general purpose processors. 9. S. A. Green and D. J. Paddon, “Exploiting co- It has been demonstrated that despite the inexpensive herency for multiprocessor ray tracing”, IEEE

Computer Graphics and Applications, pp. 12-26 cell-to-cell propagation in uniform subdivision structures, the costs of traversing grids and tree structures are (November 1989). very similar, particularly when a fast tree traversal algorithm is used. In addition, the adaptive tree structures 10. E. Haines, “A proposal for standard graphics are in general smaller than uniform grids for a similar environments ”, IEEE Computer Graphics and Appli- level of efficiency. As less nodes are accessed in octree cations, 7( 11) pp. 3-5 (1987). structures than in BSP trees, the octree was found to be 11. M.-P. Hebert, A Multiprocessor Architecture for Ray the most effective structure for accelerating the ray tracing Tracing, PhD thesis, University of Sussex, UK

3-10 (August 1987).

not be traversed any further, as the illumination model 4. C. Bouville, R. Brusq, J. L. Dubois and I. Marchal,

task. (1990).

220

12. G. M. Birtwistle, J. G. Cleary, B. M. Wyvill and R. Vatti, “ Multiprocessor Ray Tracing”, Computer Graphics Forum, 5(1) pp. 3-12 (1986).

Ray Tracing ”. In Techniques for Computer Graphics, Springer Verlag, pp. 173-193 (1987).

14. M.-P. Hebert, M. D. J. McNeill, B. Shah, R. L. Grimsdale and P. F. Lister, “MARTI-a multiprocessor architecture for ray tracing images ”, in Advances in Computer Graphics Hardware V, forthcoming. Presented at the Eurographics ’90 Hardware workshop.

15. H. Nishimura, H. Ohno, T. Kawata, I. Shirakawa and K. Omura, “Links-1 : A parallel pipelined multimicrocomputer system for image creation ”, in

M.D.J. McNeill et al. / Performance of Space Subdivision Techniques

SIGARCH, 10th Annual International Symposium on Computer Architecture, pp. 387-394 (June 1983).

16. M. Potmesil and E. Hoffert, “The Pixel machine: A 13. M. R. Kaplan, “The Use of Spatial Coherence in Parallel Image Computer ”, Computer Graphics,

23(3) pp. 69-78 (1989). SIGGRAPH 89 Proceedings.

17. H. Q. Samet, Applications of Spatial Data Struc- tures : Computer Graphics, Image Processing and GIS, Computer Series, Addison Wesley (1989).

18. B. C. Shah, Photo-realistic Image Generation Tech- niques, DPhil thesis, University of Sussex, UK, in progress.

19. T. Whitted, “An improved illumination model for shaded display”, Communications of the ACM, 23(6) pp. 353-349 (1980).

performance of space subdivision techniques in ray tracing

Documents