masked software occlusion culling
TRANSCRIPT
Magnus Andersson
2
Occlusion Culling
Stanford Bunny in the Crytek Sponza AtriumEye
View frustum
3
Occlusion Culling
Stanford Bunny in the Crytek Sponza Atrium
Fully occluded
4
Occlusion Culling
Stanford Bunny in the Crytek Sponza Atrium
Partially occluded
Pixel processing
Geometry processing
Draw call
5
Hardware Fixed-function Occlusion Culling
Handled automatically under the hood
Per-tile culling granularity
– Semi-occluded triangles can be partially culled
Very late in the pipeline
Upload frame data
Game logic
Z Tile Culling
CP
U s
ide
GP
U s
ide
CP
U s
ide
GP
U s
ide
Game logic +
Pixel processing
Geometry processing
Draw call
Upload frame data
Z Tile Culling
SW culling
6
Software Occlusion Culling
Cull very early in the pipeline
– Cull both CPU and GPU work
Short delay
– Can be integrated with scene traversal
7
Binary Space Partitioning (BSP) trees & portals
Precomputed – very efficient
Scene (occluders) must be static
Difficult to handle general scenes
Potentially Visible Sets (PVS)
Quake II, id Software, 1997
Half-Life 2, Valve Corporation, 2004
8
Potentially Visible Sets (PVS)
Quake II, id Software, 1997
Half-Life 2, Valve Corporation, 2004
Player
Not part of PVS
Leaf boundaries
9
Increasingly popular
Modern games have more complex and dynamic worlds
No complex pre-computation
– Simpler content pipeline
Dynamic Occlusion Culling
Assassin’s Creed Unity, Ubisoft, 2014
Battlefield 4, EA DICE, 2013
[HA15]
[Col11]
10
Hierarchical Z Buffer (HiZ) [Greene93]
Rasterize to full resolution z buffer
Create HiZ buffer
– Find the maximum depth in each NxN tile
Perform occlusion query with HiZ buffer
General algorithm works for both SW and HW occlusion culling
Z-buffer Based Culling
Full resolution depth buffer
HiZ buffer
Complexobject
Bounding shape
Dragon model courtesy of Stanford University Computer Graphics Laboratory
11
Intel Software Occlusion Culling Framework [CMK16]
Algorithm phases:
1. Rasterize a few designated occluder objects to z buffer
– Heavily SSE/AVX optimized
– Parallel triangle setup
– Parallel pixel depth computation
2. Compute 1-level HiZ buffer (and throw away z buffer)
3. Perform queries and render surviving objects
12
Rendering to z-buffer per pixel
Updating HiZ tile needs all pixels within the tile
Occlusion Query per tile
Wouldn’t it be nice to compute HiZ directly?
– Being conservative is the only requirement
Idea: use alternative HiZ representation
Z-buffer Based Culling
Full resolution depth buffer
HiZ buffer
13
Alternative HiZ buffer representation
Masked Occlusion Culling for Graphics Hardware [AHAM15]
Two depth values per tile
Per-pixel selection mask
zmax0 zmax
1 Layer selection mask
0 0 0 10 0 1 10 0 1 10 1 1 1
0 0 0 00 0 0 00 0 0 10 0 0 1
1 1 1 11 1 1 11 1 1 11 1 1 1
0 0 0 10 0 1 10 0 0 10 0 0 1
14
Masked Occlusion Culling [AHAM15]
15
Masked Occlusion Culling [AHAM15]
16
Masked Occlusion Culling [AHAM15]
17
Masked Occlusion Culling [AHAM15]
18
Masked Occlusion Culling [AHAM15]
19
Masked Occlusion Culling [AHAM15]
20
Masked Occlusion Culling [AHAM15]
Merge
?
21
Masked Occlusion Culling [AHAM15]
22
Masked Occlusion Culling [AHAM15]
CulledNot culled
23
Masked Occlusion Culling [AHAM15]
Triangle meshes
24
Originally designed for graphics hardware
Directly update HiZ buffer withoutcomputing a full res z buffer
Decouples coverage sampling (rasterization) and depth computation
Masked Occlusion Culling [AHAM15]
Approximate, conservative HiZ buffer
Depth buffer
25
Masked Software Occlusion Culling
Could Masked Occlusion Culling [AHAM15] be really fast for softwareocclusion culling?
Much less memory to read/write than full res z-buffer
Updates use bitmasks – can process many pixels in parallel (i.e. SSE/AVX)
No need to compute per-pixel depths
– Would need a fast SW rasterizer to compute coverage
Turns out it can
Paper presented at High Performance Graphics this year [HAAM16]
Source code available!
26
Single Instruction, Multiple Data (SIMD)
3 3 5 6 2
32 bits 32 bits 32 bits 32 bits 32 bits
A A
5 5 7 3 5B B
+ + + ++
8 8 12 9 7
256 bits
AVXx86
4 1 4 10
5 11 4 5
+ + + +
9 12 8 15
32 bits 32 bits 32 bits 32 bits
27
Single Instruction, Multiple Data (SIMD)
32 bits
AVXx86
0xAC1DBA5EAC1DBA5EAC1DBA5EAC1DBA5E51CAFE3751CAFE3751CAFE3751CAFE37
256 bits
A
0x51CAFE3751CAFE3751CAFE3751CAFE37AC1DBA5EAC1DBA5EAC1DBA5EAC1DBA5EB
&
0x0008BA160008BA160008BA160008BA160008BA160008BA160008BA160008BA16
0xAC1DBA5EA
0x51CAFE37B
&
0x0008BA16
New algorithmtarget architecture
Supported in our library codeEasily extended to AVX-512
28
An abridged history of Intel’s SIMD instruction sets
SSE, 1999128b wide
SSE2, 2001
SSE4, 2006Intel® microarchitecture code name Nehalem
AVX, 2011256b wide2nd Gen Intel® Core™ Processors
AVX2, 20134th Gen Intel® Core™ Processors
AVX-512, 2016512b wide
1998 2017
Masked software occlusion culling
30
Algorithm Overview
Update
Depth test
Compute coverage
Traversal setup
Depth plane
Compute bounds
Clip
Transform
8-wide triangle setup
8 scanlines
256 pixels (8 tiles with 8x4 pixels)
Til
e
tra
ve
rsa
lT
ria
ng
lese
tup
31
Transform and Clip
Update
Depth test
Compute coverage
Traversal setup
Depth plane
Compute bounds
Clip
Transform
32
Compute Bounding Box
Padded to 32x8 pixel supertiles
Update
Depth test
Compute coverage
Traversal setup
Depth plane
Compute bounds
Clip
Transform
33
Compute Depth Plane Depth = ax + by + c
– Conservative tile depth: Check sign of a and b
– Can be incrementally updated Update
Depth test
Compute coverage
Traversal setup
Depth plane
Compute bounds
Clip
Transform
-, - +, -
-, + +, +
Clamp to vertex depths
+ a
+ b
34
Supertile Traversal Order
Update
Depth test
Compute coverage
Traversal setup
Depth plane
Compute bounds
Clip
Transform
35
AVX Register Layout
Update
Depth test
Compute coverage
Traversal setup
Depth plane
Compute bounds
Clip
Transform
36
AVX Register Layout
One scanline per SIMD lane
Update
Depth test
Compute coverage
Traversal setup
Depth plane
Compute bounds
Clip
Transform
Compute slopes (∆y/∆x) once
– Similar to regular scanline rasterizers
37
Edge Slopes
Update
Depth test
Compute coverage
Traversal setup
Depth plane
Compute bounds
Clip
Transform
38
Compute Intersections
Compute intersections for each scanline
– Eight scanlines in parallel using AVX Update
Depth test
Compute coverage
Traversal setup
Depth plane
Compute bounds
Clip
Transform
Intersections
39
Compute Coverage Mask
Start with full coverage mask
Update
Depth test
Compute coverage
Traversal setup
Depth plane
Compute bounds
Clip
Transform
Intersections
40
Compute Coverage Mask
>>>>>>>>>>>>>>>>
Start with full coverage mask
– Shift each lane (scanline) to intersection
– AVX2 and later have per-lane shift instruction Update
Depth test
Compute coverage
Traversal setup
Depth plane
Compute bounds
Clip
Transform
Intersections
41
Compute Coverage Mask
Repeat the same process for the next edge
Left edge
Right edge
Right edge
Update
Depth test
Compute coverage
Traversal setup
Depth plane
Compute bounds
Clip
Transform
Intersections
42
Compute Coverage Mask
Repeat the same process for the next edge
– Edge is facing right invert maskUpdate
Depth test
Compute coverage
Traversal setup
Depth plane
Compute bounds
Clip
Transform
Intersections
43
Compute Coverage Mask
Combine masks of all overlapping edges
Update
Depth test
Compute coverage
Traversal setup
Depth plane
Compute bounds
Clip
Transform
44
Compute Coverage Mask
Combine masks of all overlapping edges
– Using bitwise ANDUpdate
Depth test
Compute coverage
Traversal setup
Depth plane
Compute bounds
Clip
Transform
45
Compute Coverage Mask
Combine masks of all overlapping edges
– Using bitwise ANDUpdate
Depth test
Compute coverage
Traversal setup
Depth plane
Compute bounds
Clip
Transform
46
Shuffle Mask
Shuffle mask to form better shaped tiles
– Before: each SIMD lane is a scanlineUpdate
Depth test
Compute coverage
Traversal setup
Depth plane
Compute bounds
Clip
Transform
47
Shuffle Mask
Shuffle mask to form better shaped tiles
– Before: each SIMD lane is a scanline
– After: each SIMD lane is a 8x4 tile Update
Depth test
Compute coverage
Traversal setup
Depth plane
Compute bounds
Clip
Transform
48
Depth Test
Interpolate conservative depth (per 8x4 tile)
Test against bufferUpdate
Depth test
Compute coverage
Traversal setup
Depth plane
Compute bounds
Clip
Transform
Buffer
49
Update Tile
Update
Depth test
Compute coverage
Traversal setup
Depth plane
Compute bounds
Clip
Transform
Two code paths (can be switched compile time)
– Original update method [AHAM15]
– New update method tailored for SW [HAAM16]
Why use a new update method?
– Faster – same culling power
– Less accurate than original, more dependent on render order
– Works best if you render front-to-back
50
Update Tile, New Method [HAAM16]
Update
Depth test
Compute coverage
Traversal setup
Depth plane
Compute bounds
Clip
Transform
zmax is the reference layer
– Maximum value for the entire tile
zmax is the working layer
– Maximum value for a subset of the tile
– Updated as
– New depth = max(zmax , zmax)
– New mask = TriangleMask OR LayerMask
Whenever working layer mask is full, overwrite reference layer
1
1
tri
0
51
Update Tile
52
Update Tile
53
Update Tile
54
Update Tile
55
Update Tile
Discard heuristic: If zmax – zmax > zmax – zmax , discard working layer
56
Update Tiletri1 10
Restart
57
Update Tile
58
Update Tile
59
Update Tile
Full overwrite:Restart from new value
60
Update Tile
Update
Depth test
Compute coverage
Traversal setup
Depth plane
Compute bounds
Clip
Transform
Update is quicker than original [AHAM15]
Test is also quicker
– Need only to test against reference layer (zmax)0
62
ResultsIntel Occlusion Culling Sample
Clear: Clearing the depth buffer
Geom: Transform & project geometry
Rast: Triangle setup & occluder rasterization
Gen: Compute HiZ buffer from full resolution z buffer
Test: Perform occlusion queries
3.7x16x
(μs)
Old [CMK16]
New [HAAM16]
63
Performance comparison for camera animation
Results
First frame
Last frame
Old New Frustum only
Code is available as open-source
65
Masked Occlusion Culling API
void SetResolution();
void SetNearClipPlane();
void ClearBuffer();
static void TransformVertices();
Result RenderTriangles();
Result TestTriangles();
Result TestRect();
void ComputePixelDepthBuffer();
OcclusionCullingStatistics GetStatistics();
Setup
Debug
Render &query
66
Masked Occlusion Culling APIResult RenderTriangles(
float *inVtx,
uint *inTris,
int nTris,
ClipPlanes mask,
ScissorRect *scissor,
VertexLayout &layout
);
Render to the software HiZ buffer
// Clip space vertex positions
// Index array (Indices to inVtx buffer)
// Triangle count (the number of index triplets in inTris)
// Mask for potential frustum bound overlap
// Scissor region
// Vertex format of inTris. There is a fast-path for AoS with
(x, y, z, w) coordinates
67
Masked Occlusion Culling APIResult RenderTriangles(
float *inVtx,
uint *inTris,
int nTris,
ClipPlanes mask,
ScissorRect *scissor,
VertexLayout &layout
);
Eye
View frustum
Near plane
mask = 0
mask = leftPlane | nearPlane
Clipping is not free...
– If you’re already doing frustum culling, let the API know the outcome
68
Masked Occlusion Culling APIResult RenderTriangles(
float *inVtx,
uint *inTris,
int nTris,
ClipPlanes mask,
ScissorRect *scissor,
VertexLayout &layout
);
Eye
View frustum
Scissor region (screen space AABB)
Can be used for threading
– One scissor region per thread
69
Masked Occlusion Culling APIResult TestTriangles(
float *inVtx,
uint *inTris,
int nTris,
ClipPlanes mask,
ScissorRect *scissor,
VertexLayout &layout
);
Test triangles against the software HiZ buffer
– Does not update the buffer
// Returns the collective culling outcome of the triangles
// Clip space vertex positions
// Index array (Indices to inVtx buffer)
// Triangle count (the number of index triplets in inTris)
// Mask for potential frustum bound overlap
// Scissor region
// Vertex format of inTris. There is a fast-path for AoS with
(x, y, z, w) coordinates
70
Masked Occlusion Culling APIResult TestRect(
float xmin,
float ymin,
float xmax,
float ymax,
float wmin
);
Test rectangle against the software HiZ buffer
– Does not update the buffer
// Returns the culling outcome of the screen space rectangle
/*
Screen space bounds:
[xmin, ymin] – [xmax, ymax]
*/
// Conservative clip space w (typically the w-component of the nearest
bbox vertex in clip space)
71
Example use case: Scene Bounding Volume Hierarchy (BVH) traversal and culling
ClearBuffer();
prioQueue.push(root);
while (!prioQueue.empty()) {
Node node = prioQueue.pop();
if (FrustumTest(node) == Culled)
continue;
compute_screen_space_bounds(node);
if (TestRect(bounds) == Culled)
continue;
if (node is InnerNode) {
prioQueue.push(node.left, dist);
prioQueue.push(node.right, dist);
} else (node is Leaf) {
TransformVertices(leaf.vertices);
RenderTriangles(xfVertices);
send_leaf_to_GPU();
}
}
RenderFrame
Culled!
72
Essential Tools We Have Relied On
Intel® VTune™
– https://software.intel.com/en-us/intel-vtune-amplifier-xe
SSE/AVX intrinsics guide
– https://software.intel.com/sites/landingpage/IntrinsicsGuide/
73
References
[AHAM15] ANDERSSON M., HASSELGREN J., AKENINE-MÖLLER T.: Masked Depth Culling for Graphics Hardware. ACM Transactions on Graphics 34, 6 (2015), pp. 188:1–188:9
[CMK16] CHANDRASEKARAN C., MCNABB D., KUAH K., FAUCONNEAU M., GIESEN F.: Software Occlusion Culling. Published online at: https://software.intel.com/en-us/articles/software-occlusion-culling, (2013–2016)
[Col11] COLLIN D.: Culling the Battlefield. Game Developer’s Conference (presentation), (2011)
[Greene93] GREENE N., KASS M., MILLER G.: Hierarchical Z-Buffer Visibility. In Proceedings of SIGGRAPH, (1993), pp. 231–238
[HA15] HAAR U., AALTONEN S.: GPU-Driven Rendering Pipelines. SIGGRAPH Advances in Real-Time Rendering in Games course, (2015)
[HAAM16] HASSELGREN J., ANDERSSON M., AKENINE-MÖLLER T.: Masked Software Occlusion Culling. High Performance Graphics, (2016)
74
Check it out!
GitHub: Lightweight library
– https://github.com/GameTechDev/MaskedOcclusionCulling
GitHub: Example integrated in Intel’s Software Occlusion Culling demo
– https://github.com/GameTechDev/OcclusionCulling
Project page: Masked Software Occlusion Culling
– https://software.intel.com/en-us/articles/masked-software-occlusion-culling
Questions and feedback welcome
Legal Notices and DisclaimersIntel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.
Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.
Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.
This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
Statements in this document that refer to Intel’s plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and uncertainties. A detailed discussion of the factors that could affect Intel’s results and plans is included in Intel’s SEC filings, including the annual report on Form 10-K.
All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. © 2016 Intel Corporation. Intel, the Intel logo, VTune and others are trademarks of Intel Corporation in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others.