vertex culling illustration at sbr07

70
SBR07 Faster ray packets through vertex culling Illustration and implementation by Syoyo Fujita

Upload: syoyo-fujita

Post on 06-May-2015

2.540 views

Category:

Business


1 download

TRANSCRIPT

Page 1: Vertex Culling illustration at SBR07

SBR07

Faster ray packets through

vertex cullingIllustration and implementation by

Syoyo Fujita

Page 2: Vertex Culling illustration at SBR07

SBR07SBR07

Faster ray packets through

vertex culling

Alexander Reshetov

IEEE/EG Symposium on

Interactive Ray Tracing 2007

Faster Ray Packets - Triangle Intersection through Vertex Culling Alexander Reshetov

Intel Corporation

ABSTRACT

Acceleration structures are used in ray tracing to sharply reduce number

of ray-triangle intersection tests at the expense of traversing such

structures. Bigger structures eliminate more tests, but their traversal

becomes less efficient, especially for ray packets, for which number of

inactive rays increases at the lower levels of the acceleration structures.

For dynamic scenes, building or updating acceleration structures is one

of the major performance impediments.

We propose a new way to reduce the total number of tests by creating a

special transient frustum every time a leaf is traversed by a packet of rays.

This frustum contains intersections of active rays with a leaf node and

eliminates over 90% of all potential tests. It allows a tenfold reduction in

size of acceleration structure whilst still achieving a better performance.

!"#$%&'()!!"#$%&'&(")'*++!"#$*,-(&./+0'.++&0'!()1/+0'.+2+&0(')1*,++()&,03,!&(")!

1 INTRODUCTION

Finding the intersection of a ray and a triangle is equivalent to solving

the linear system of three equations

!"#"$"%"4"&'"#"(")&*"!"#'+"#",")&-"!"#'+ (1)

with five additional requirements

$""%""&""%""&"!.%"

$""%""'("""$""%"")("""'*)""%""+

(2)

(3)

The left part of the system (1) defines a ray with the origin !"and the

direction %; the right part ! a point inside the triangle with vertices &'/"&*/"

and"&-. The unknown variables are: $ ! distance to the intersection point

"#$%& '()& #*+,-& $#./.0, and (/" , ! Barycentric coordinates of the point

inside the triangle. It is required that the found intersection point be

closer '$& '()& #*+,-&origin than the previously found one (2) and within

'()&'#.*0/1),-&2$304*#.)-&(3).

One way to solve the system (1) is to use 5#*%)#,-&rule, which allows

finding numerical values directly for all three variables. This will result

in the implementations similar to Möller-Trumbore [13]. On the other

hand, the system (1) also could be solved using Gaussian elimination. It is computationally efficient only if the distance $ is calculated first

(which, of course, could also be computed directly by using the well-

known expression for the distance to a plane along a ray). By substituting the found value of $ in any two of the remaining equations,

( and , variables could be found. This substitution geometrically

corresponds to projecting the triangle and the intersection point to the

plane defined by the coordinates of the two chosen equations and leads

to algorithms comparable with one given by Badouel [1]. The S.S.E.

implementation for groups of four rays, proposed by Wald [21], is based

on the equivalent 2D projection. Example of the S.S.E. based 3D

approach (formulated in terms of Plücker coordinates) could be found in

6)0'(.0,-& 7(898 [2]. For vector implementations, conditions (2-3) are usually converted to masks used to selectively update the $, (, and ,

values. It makes sense to use packets bigger than the intrinsic SIMD

width of the targeted architecture as it amortizes per-triangle

computations and saves bandwidth. It also allows to make decisions

summarily for the whole packet without processing individual rays, for

example using frustum or interval arithmetic ([5], [7], [11], [22], [23]). There is no one universal way of solving the system (1) which will be

suitable for all situations and architectures. In particular, CPU

implementations could use conditions (2-3) for exiting from the test

"*+,-.) ,."/,0'"&1&"(2"3%45-03".16%+7

'!!,$&,5+&"+6777879+:.#$"3(%#+")+6)&,0'!&(;,+<'.+=0'!()1+>??@+A&&$B88CCCD%)(2%*#D5,80&?@8<=?@DA&#*+E*#/+9,0#')./+:,$&,#F,0+G?2G>/+>??@7

early if either one of the five conditions is false. For implementations on

GPU [14] or Cell [3], branchless algorithms are more efficient.

For most scenes, the great majority of ray - triangle intersection tests can be

eliminated by creating special acceleration structures such as kd-trees,

grids, Bounding Volume Hierarchies (BVH), which exploit spatial

coherency of a scene ([2], [7], [22], [23]). During rendering, the

acceleration structure is traversed and all triangles in visited leaf nodes are

tested for intersections. For structures which utilize spatial subdivision

(kd-trees, grids), it is possible to create an instance of this type for which

total number of tests will be just slightly more than the number of ray-

segments (one test per ray), but this may not result in the best

performance. The optimal size of the acceleration structure is dependent

on how fast it could be traversed compared with the average speed of the

used ray-triangle intersection test. For hierarchical acceleration structures,

the higher levels of the hierarchy are usually the most effective in reducing

the number of potential tests. Going down in the spatial hierarchy, nodes

are becoming smaller and smaller and some of the rays in the packet may

miss them. This negatively affects the utilization of SIMD units [15].

In recent years, focus of ray tracing research has shifted to dynamic

scenes ([8], [12], [17], [19], [22], [23]), for which it is necessary to

optimize for the total execution time (build/update time plus rendering

time). Frequently, to improve the build time, only axis-aligned bounding

boxes (AABB) of triangles are used even for kd-trees ([8], [17]). For

this reason, it is important to analyze performance or ray-triangle

intersection tests in the situation when intersection could not be

eliminated by testing a ray against the AABB of a triangle. This

approach was chosen by Kensler and Shirley [10], in which different 3D

versions of intersection algorithms were analyzed and the best one was

found via genetic optimization of the fitness function.

Even when a ray does intersect the AABB, the probability of it

intersecting the triangle is only about 20%. It is important to swiftly

reject the remaining 80%. Amongst the approaches analyzed in the

literature [4] are: use of interval arithmetic; testing the intersection of a

frustum containing rays (typically represented as 4 corner rays) with a

triangle [5]; and culling of the AABB of the individual triangle against

the frustum. In these approaches per-packet data structures are computed

before the traversal and then used to eliminate unnecessary tests. The

culling techniques could also be used for the individual rays as well, as

first proposed by Snyder and Barr in 1987 paper [18], in which a ray was

first tested against the box formed by the intersection of the $2:);',-&

bounding box and a visited cell.

Few years ago, ray-tracing researchers were mostly interested in

rendering static scenes for which persistent data structures were created,

optimized for all possible camera positions. Eventually, focus was shifted

to dynamic scenes, for which data structures are created (or updated)

every frame. There are also initial studies in how to create kd-trees lazily,

deeply subdividing only those areas of space that are visited during

rendering of a particular frame [9]. In a sense, we pursue this trend to the

extreme, when transient data structures are created for every packet and

every visited leaf node. This allows very tight structures, optimized for a

given packet and a cell. Surprisingly, these fleeting structures could be

created and handled very efficiently, building on top of the clipping

algorithms, characteristic for the modern packet traversal techniques.

8-9:&"7;D+=A,+&0')3(,)&+ H0%3&%#+ (3+!0,'&,5+,;,0.+ &(#,+'+0'.+$'!I,&+;(3(&3+'+*,'H+ )"5,+ J3"*(585'3A,5+ *(),3+ !"00,3$")5+ &"+ '!&(;,8()'!&(;,+ 0'.3KD+ L)*.+'!&(;,+ 0'.3+ '0,+ %3,5+ &"+ !"#$%&,+ &A,+ H0%3&%#D+ M,H&B+ 1,),0'*+ 0'.3D+ <(1A&B+$0(#'0.+0'.3+J!"##")+"0(1()KD+

Page 3: Vertex Culling illustration at SBR07

SBR07SBR07

Demo

Page 4: Vertex Culling illustration at SBR07

SBR07SBR07

Key contribution

Page 5: Vertex Culling illustration at SBR07

SBR07SBR07

90 %1/10

Culling RateSpatial DataStruture Size

Page 6: Vertex Culling illustration at SBR07

SBR07SBR07

Ray-triangle intersection cost

Page 7: Vertex Culling illustration at SBR07

SBR07SBR07

1 Cycle(on average)

Page 8: Vertex Culling illustration at SBR07

SBR07SBR07

It’s extremely

fast!

Page 9: Vertex Culling illustration at SBR07

SBR07SBR07

Algorithm

Page 10: Vertex Culling illustration at SBR07

SBR07SBR07

What VC do?(VC = Vertex Culling)

Page 11: Vertex Culling illustration at SBR07

SBR07SBR07

Every timea packet enters

a leaf node,do following

Page 12: Vertex Culling illustration at SBR07

SBR07SBR07

Create packet

Precompute coeff

for each trianglestriangle-packet Cull

triangle-ray intersectif failed

Page 13: Vertex Culling illustration at SBR07

SBR07SBR07

Create packet

Precompute coeff

for each trianglestriangle-packet Cull

triangle-ray intersectif failed

Page 14: Vertex Culling illustration at SBR07

SBR07SBR07

Create packet from active rays

Active ray

Inactive ray

Page 15: Vertex Culling illustration at SBR07

SBR07SBR07

Create packet

Precompute coeff

for each trianglestriangle-packet Cull

triangle-ray intersectif failed

Page 16: Vertex Culling illustration at SBR07

SBR07SBR07

Think about

triangle-packettest

Page 17: Vertex Culling illustration at SBR07

SBR07SBR07

Axis aligned

Page 18: Vertex Culling illustration at SBR07

SBR07SBR07

Frustum plane

Page 19: Vertex Culling illustration at SBR07

SBR07SBR07

vo

n(v - o) . n < 0

= outside

Page 20: Vertex Culling illustration at SBR07

SBR07SBR07

v any of dot(v, n) < 0= outside of the frustum

Page 21: Vertex Culling illustration at SBR07

SBR07SBR07

v

all of dot(v, n) >= 0= inside of the frustum

Page 22: Vertex Culling illustration at SBR07

SBR07SBR07

v0

v1

v2o

n

all of (v .o) < 0 = triangle is outside

of the plane

Page 23: Vertex Culling illustration at SBR07

SBR07SBR07

v0

v1

v2

Finally, we got

any of (v . n) > 0= frustum and triangle (possibly) intersects

Page 24: Vertex Culling illustration at SBR07

SBR07SBR07

1 vertex, 4 plane inside/outside test computation

can be reduced to

d = [Vz, -Vy, -Vz, Vy] + [Vx, Vx, Vx, Vxx] q1 + q0

Page 25: Vertex Culling illustration at SBR07

SBR07SBR07

d = [Vz, -Vy, -Vz, Vy] + [Vx, Vx, Vx, Vxx] q1 + q0

q1, q0 =SIMD variable.Pre-computable

when packet enters leaf node.

Page 26: Vertex Culling illustration at SBR07

SBR07SBR07

Culling cost per vertex.

Page 27: Vertex Culling illustration at SBR07

SBR07SBR07

1 SIMD mul+

2 SIMD add

Page 28: Vertex Culling illustration at SBR07

SBR07SBR07

Very efficient

Page 29: Vertex Culling illustration at SBR07

SBR07SBR07

This is the core of VC

Page 30: Vertex Culling illustration at SBR07

SBR07SBR07

Create packet

Precompute coeff

for each trianglestriangle-packet Cull

triangle-ray intersectif failed

Page 31: Vertex Culling illustration at SBR07

SBR07SBR07

Just apply efficient vertex-frustum

culling for each triangle vertex

Page 32: Vertex Culling illustration at SBR07

SBR07SBR07

Culling cost per triangle per packet =

3(vtx) * 3(mul,add) +

compare

Page 33: Vertex Culling illustration at SBR07

SBR07SBR07

Create packet

Precompute coeff

for each trianglestriangle-packet Cull

triangle-ray intersectif failed

Page 34: Vertex Culling illustration at SBR07

SBR07SBR07

Page 35: Vertex Culling illustration at SBR07

SBR07SBR07

First do corner rays and

triangle test

Page 36: Vertex Culling illustration at SBR07

SBR07SBR07

Such a case can not be culled by VC.

Page 37: Vertex Culling illustration at SBR07

SBR07SBR07

Page 38: Vertex Culling illustration at SBR07

SBR07SBR07

uv

all(u) < 0 orall(v) < 0 orall(u+v) > 1

=All rays in the packet

does not intersect triangle.

Page 39: Vertex Culling illustration at SBR07

SBR07SBR07

If all failed,do

ray-triangle test for each rays in

the packet.

Page 40: Vertex Culling illustration at SBR07

SBR07SBR07

Thats all!

Page 41: Vertex Culling illustration at SBR07

SBR07SBR07

Benefit of VC

Page 42: Vertex Culling illustration at SBR07

SBR07SBR07

90 %1/10

Culling RateSpatial DataStruture Size

Page 43: Vertex Culling illustration at SBR07

SBR07SBR07

less traversalmuch VCs

much traversalless VCs

Page 44: Vertex Culling illustration at SBR07

SBR07SBR07

Right has 12x more nodes than left,

but right is just 30% faster than left.

Page 45: Vertex Culling illustration at SBR07

SBR07SBR07

spatial data structureCoarse Fine

Sequential access more isects

less traversalsless memory

Random access less isects

more traversalsmuch memory

BVHBIH

kd-tree

Page 46: Vertex Culling illustration at SBR07

SBR07SBR07

spatial data structureCoarse Fine

Sequential access more isects

less traversalsless memory

Random access less isects

more traversalsmuch memory

VC

Page 47: Vertex Culling illustration at SBR07

SBR07SBR07

The strongest acceleration

structure on the earth?

Page 48: Vertex Culling illustration at SBR07

SBR07SBR07

Implementation

Page 49: Vertex Culling illustration at SBR07

SBR07SBR07

Here is my implementation of

VC + BVH

Page 50: Vertex Culling illustration at SBR07

SBR07SBR07

C++, for VC

Original

C, for gcc

My impl.

Page 51: Vertex Culling illustration at SBR07

SBR07SBR07

Performance &

Profiling

Page 52: Vertex Culling illustration at SBR07

SBR07SBR07

512x51210k tris

SAH-BVH

Page 53: Vertex Culling illustration at SBR07

SBR07SBR07

Compile flag-O3

-ffast-math-mfpmath=sse

-msse2

Page 54: Vertex Culling illustration at SBR07

SBR07SBR07

4.7M rays/sec

@ 2.16 GHz Core2 Mac

Page 55: Vertex Culling illustration at SBR07

SBR07SBR07

= 460 Cycles/ray

Hmm...

Page 56: Vertex Culling illustration at SBR07

SBR07SBR07

nPackets : 8192nTraversedLeafs(per packet) : 43066 (5.257080)nTraversedInnerNodes(per packet) : 109922 (13.418213)nPacketBBoxTests(per packet) : 228036 (27.836426) nBVHAnyHits : 131968 (57.871564 %) nBVHAllMisses : 61462 (26.952762 %) nBVHLastResorts : 34606 (15.175674 %) nBVHLastResortRayBBoxTests : 322288 (per last resort tests) : 9.313067nTestedTriangles(per packet) : 314650 (38.409424) nCulledByBeams(rate) : 223618 (71.068807 %) nCulledByUV(rate) : 18536 (5.890990 %)nActuallyTestedTrianglesPerRay : 2.060669

Statistics

Page 57: Vertex Culling illustration at SBR07

SBR07SBR07

Core2 Mac

Page 58: Vertex Culling illustration at SBR07

SBR07SBR07

VC

BVH

Page 59: Vertex Culling illustration at SBR07

SBR07SBR07

Redundant calcs...

Page 60: Vertex Culling illustration at SBR07

SBR07SBR07

Non SIMDizedcodes...

Page 61: Vertex Culling illustration at SBR07

SBR07SBR07

Yeah, I have to optimize BVH code before

optimizing VC :-)

Page 62: Vertex Culling illustration at SBR07

SBR07SBR07

Implementation period

Page 63: Vertex Culling illustration at SBR07

SBR07SBR07

packetBVH

Vertex Culling

2 Days6 Days

Page 64: Vertex Culling illustration at SBR07

SBR07SBR07

Very easy to implement!

MLRTA requires a half year in my case.

Page 65: Vertex Culling illustration at SBR07

SBR07SBR07

TODO

Page 66: Vertex Culling illustration at SBR07

SBR07SBR07

Much more optimization.

Page 67: Vertex Culling illustration at SBR07

SBR07SBR07

Try to use VC technique for

incoherent rays

Page 69: Vertex Culling illustration at SBR07

SBR07SBR07

?

Page 70: Vertex Culling illustration at SBR07

SBR07SBR07

Thank you!