masked software occlusion culling

76
Magnus Andersson

Upload: intel-software

Post on 23-Jan-2017

962 views

Category:

Technology


9 download

TRANSCRIPT

Page 1: Masked Software Occlusion Culling

Magnus Andersson

Page 2: Masked Software Occlusion Culling

2

Occlusion Culling

Stanford Bunny in the Crytek Sponza AtriumEye

View frustum

Page 3: Masked Software Occlusion Culling

3

Occlusion Culling

Stanford Bunny in the Crytek Sponza Atrium

Fully occluded

Page 4: Masked Software Occlusion Culling

4

Occlusion Culling

Stanford Bunny in the Crytek Sponza Atrium

Partially occluded

Page 5: Masked Software Occlusion Culling

Pixel processing

Geometry processing

Draw call

5

Hardware Fixed-function Occlusion Culling

Handled automatically under the hood

Per-tile culling granularity

– Semi-occluded triangles can be partially culled

Very late in the pipeline

Upload frame data

Game logic

Z Tile Culling

CP

U s

ide

GP

U s

ide

Page 6: Masked Software Occlusion Culling

CP

U s

ide

GP

U s

ide

Game logic +

Pixel processing

Geometry processing

Draw call

Upload frame data

Z Tile Culling

SW culling

6

Software Occlusion Culling

Cull very early in the pipeline

– Cull both CPU and GPU work

Short delay

– Can be integrated with scene traversal

Page 7: Masked Software Occlusion Culling

7

Binary Space Partitioning (BSP) trees & portals

Precomputed – very efficient

Scene (occluders) must be static

Difficult to handle general scenes

Potentially Visible Sets (PVS)

Quake II, id Software, 1997

Half-Life 2, Valve Corporation, 2004

Page 8: Masked Software Occlusion Culling

8

Potentially Visible Sets (PVS)

Quake II, id Software, 1997

Half-Life 2, Valve Corporation, 2004

Player

Not part of PVS

Leaf boundaries

Page 9: Masked Software Occlusion Culling

9

Increasingly popular

Modern games have more complex and dynamic worlds

No complex pre-computation

– Simpler content pipeline

Dynamic Occlusion Culling

Assassin’s Creed Unity, Ubisoft, 2014

Battlefield 4, EA DICE, 2013

[HA15]

[Col11]

Page 10: Masked Software Occlusion Culling

10

Hierarchical Z Buffer (HiZ) [Greene93]

Rasterize to full resolution z buffer

Create HiZ buffer

– Find the maximum depth in each NxN tile

Perform occlusion query with HiZ buffer

General algorithm works for both SW and HW occlusion culling

Z-buffer Based Culling

Full resolution depth buffer

HiZ buffer

Complexobject

Bounding shape

Dragon model courtesy of Stanford University Computer Graphics Laboratory

Page 11: Masked Software Occlusion Culling

11

Intel Software Occlusion Culling Framework [CMK16]

Algorithm phases:

1. Rasterize a few designated occluder objects to z buffer

– Heavily SSE/AVX optimized

– Parallel triangle setup

– Parallel pixel depth computation

2. Compute 1-level HiZ buffer (and throw away z buffer)

3. Perform queries and render surviving objects

Page 12: Masked Software Occlusion Culling

12

Rendering to z-buffer per pixel

Updating HiZ tile needs all pixels within the tile

Occlusion Query per tile

Wouldn’t it be nice to compute HiZ directly?

– Being conservative is the only requirement

Idea: use alternative HiZ representation

Z-buffer Based Culling

Full resolution depth buffer

HiZ buffer

Page 13: Masked Software Occlusion Culling

13

Alternative HiZ buffer representation

Masked Occlusion Culling for Graphics Hardware [AHAM15]

Two depth values per tile

Per-pixel selection mask

zmax0 zmax

1 Layer selection mask

0 0 0 10 0 1 10 0 1 10 1 1 1

0 0 0 00 0 0 00 0 0 10 0 0 1

1 1 1 11 1 1 11 1 1 11 1 1 1

0 0 0 10 0 1 10 0 0 10 0 0 1

Page 14: Masked Software Occlusion Culling

14

Masked Occlusion Culling [AHAM15]

Page 15: Masked Software Occlusion Culling

15

Masked Occlusion Culling [AHAM15]

Page 16: Masked Software Occlusion Culling

16

Masked Occlusion Culling [AHAM15]

Page 17: Masked Software Occlusion Culling

17

Masked Occlusion Culling [AHAM15]

Page 18: Masked Software Occlusion Culling

18

Masked Occlusion Culling [AHAM15]

Page 19: Masked Software Occlusion Culling

19

Masked Occlusion Culling [AHAM15]

Page 20: Masked Software Occlusion Culling

20

Masked Occlusion Culling [AHAM15]

Merge

?

Page 21: Masked Software Occlusion Culling

21

Masked Occlusion Culling [AHAM15]

Page 22: Masked Software Occlusion Culling

22

Masked Occlusion Culling [AHAM15]

CulledNot culled

Page 23: Masked Software Occlusion Culling

23

Masked Occlusion Culling [AHAM15]

Triangle meshes

Page 24: Masked Software Occlusion Culling

24

Originally designed for graphics hardware

Directly update HiZ buffer withoutcomputing a full res z buffer

Decouples coverage sampling (rasterization) and depth computation

Masked Occlusion Culling [AHAM15]

Approximate, conservative HiZ buffer

Depth buffer

Page 25: Masked Software Occlusion Culling

25

Masked Software Occlusion Culling

Could Masked Occlusion Culling [AHAM15] be really fast for softwareocclusion culling?

Much less memory to read/write than full res z-buffer

Updates use bitmasks – can process many pixels in parallel (i.e. SSE/AVX)

No need to compute per-pixel depths

– Would need a fast SW rasterizer to compute coverage

Turns out it can

Paper presented at High Performance Graphics this year [HAAM16]

Source code available!

Page 26: Masked Software Occlusion Culling

26

Single Instruction, Multiple Data (SIMD)

3 3 5 6 2

32 bits 32 bits 32 bits 32 bits 32 bits

A A

5 5 7 3 5B B

+ + + ++

8 8 12 9 7

256 bits

AVXx86

4 1 4 10

5 11 4 5

+ + + +

9 12 8 15

32 bits 32 bits 32 bits 32 bits

Page 27: Masked Software Occlusion Culling

27

Single Instruction, Multiple Data (SIMD)

32 bits

AVXx86

0xAC1DBA5EAC1DBA5EAC1DBA5EAC1DBA5E51CAFE3751CAFE3751CAFE3751CAFE37

256 bits

A

0x51CAFE3751CAFE3751CAFE3751CAFE37AC1DBA5EAC1DBA5EAC1DBA5EAC1DBA5EB

&

0x0008BA160008BA160008BA160008BA160008BA160008BA160008BA160008BA16

0xAC1DBA5EA

0x51CAFE37B

&

0x0008BA16

Page 28: Masked Software Occlusion Culling

New algorithmtarget architecture

Supported in our library codeEasily extended to AVX-512

28

An abridged history of Intel’s SIMD instruction sets

SSE, 1999128b wide

SSE2, 2001

SSE4, 2006Intel® microarchitecture code name Nehalem

AVX, 2011256b wide2nd Gen Intel® Core™ Processors

AVX2, 20134th Gen Intel® Core™ Processors

AVX-512, 2016512b wide

1998 2017

Page 29: Masked Software Occlusion Culling

Masked software occlusion culling

Page 30: Masked Software Occlusion Culling

30

Algorithm Overview

Update

Depth test

Compute coverage

Traversal setup

Depth plane

Compute bounds

Clip

Transform

8-wide triangle setup

8 scanlines

256 pixels (8 tiles with 8x4 pixels)

Til

e

tra

ve

rsa

lT

ria

ng

lese

tup

Page 31: Masked Software Occlusion Culling

31

Transform and Clip

Update

Depth test

Compute coverage

Traversal setup

Depth plane

Compute bounds

Clip

Transform

Page 32: Masked Software Occlusion Culling

32

Compute Bounding Box

Padded to 32x8 pixel supertiles

Update

Depth test

Compute coverage

Traversal setup

Depth plane

Compute bounds

Clip

Transform

Page 33: Masked Software Occlusion Culling

33

Compute Depth Plane Depth = ax + by + c

– Conservative tile depth: Check sign of a and b

– Can be incrementally updated Update

Depth test

Compute coverage

Traversal setup

Depth plane

Compute bounds

Clip

Transform

-, - +, -

-, + +, +

Clamp to vertex depths

+ a

+ b

Page 34: Masked Software Occlusion Culling

34

Supertile Traversal Order

Update

Depth test

Compute coverage

Traversal setup

Depth plane

Compute bounds

Clip

Transform

Page 35: Masked Software Occlusion Culling

35

AVX Register Layout

Update

Depth test

Compute coverage

Traversal setup

Depth plane

Compute bounds

Clip

Transform

Page 36: Masked Software Occlusion Culling

36

AVX Register Layout

One scanline per SIMD lane

Update

Depth test

Compute coverage

Traversal setup

Depth plane

Compute bounds

Clip

Transform

Page 37: Masked Software Occlusion Culling

Compute slopes (∆y/∆x) once

– Similar to regular scanline rasterizers

37

Edge Slopes

Update

Depth test

Compute coverage

Traversal setup

Depth plane

Compute bounds

Clip

Transform

Page 38: Masked Software Occlusion Culling

38

Compute Intersections

Compute intersections for each scanline

– Eight scanlines in parallel using AVX Update

Depth test

Compute coverage

Traversal setup

Depth plane

Compute bounds

Clip

Transform

Intersections

Page 39: Masked Software Occlusion Culling

39

Compute Coverage Mask

Start with full coverage mask

Update

Depth test

Compute coverage

Traversal setup

Depth plane

Compute bounds

Clip

Transform

Intersections

Page 40: Masked Software Occlusion Culling

40

Compute Coverage Mask

>>>>>>>>>>>>>>>>

Start with full coverage mask

– Shift each lane (scanline) to intersection

– AVX2 and later have per-lane shift instruction Update

Depth test

Compute coverage

Traversal setup

Depth plane

Compute bounds

Clip

Transform

Intersections

Page 41: Masked Software Occlusion Culling

41

Compute Coverage Mask

Repeat the same process for the next edge

Left edge

Right edge

Right edge

Update

Depth test

Compute coverage

Traversal setup

Depth plane

Compute bounds

Clip

Transform

Intersections

Page 42: Masked Software Occlusion Culling

42

Compute Coverage Mask

Repeat the same process for the next edge

– Edge is facing right invert maskUpdate

Depth test

Compute coverage

Traversal setup

Depth plane

Compute bounds

Clip

Transform

Intersections

Page 43: Masked Software Occlusion Culling

43

Compute Coverage Mask

Combine masks of all overlapping edges

Update

Depth test

Compute coverage

Traversal setup

Depth plane

Compute bounds

Clip

Transform

Page 44: Masked Software Occlusion Culling

44

Compute Coverage Mask

Combine masks of all overlapping edges

– Using bitwise ANDUpdate

Depth test

Compute coverage

Traversal setup

Depth plane

Compute bounds

Clip

Transform

Page 45: Masked Software Occlusion Culling

45

Compute Coverage Mask

Combine masks of all overlapping edges

– Using bitwise ANDUpdate

Depth test

Compute coverage

Traversal setup

Depth plane

Compute bounds

Clip

Transform

Page 46: Masked Software Occlusion Culling

46

Shuffle Mask

Shuffle mask to form better shaped tiles

– Before: each SIMD lane is a scanlineUpdate

Depth test

Compute coverage

Traversal setup

Depth plane

Compute bounds

Clip

Transform

Page 47: Masked Software Occlusion Culling

47

Shuffle Mask

Shuffle mask to form better shaped tiles

– Before: each SIMD lane is a scanline

– After: each SIMD lane is a 8x4 tile Update

Depth test

Compute coverage

Traversal setup

Depth plane

Compute bounds

Clip

Transform

Page 48: Masked Software Occlusion Culling

48

Depth Test

Interpolate conservative depth (per 8x4 tile)

Test against bufferUpdate

Depth test

Compute coverage

Traversal setup

Depth plane

Compute bounds

Clip

Transform

Buffer

Page 49: Masked Software Occlusion Culling

49

Update Tile

Update

Depth test

Compute coverage

Traversal setup

Depth plane

Compute bounds

Clip

Transform

Two code paths (can be switched compile time)

– Original update method [AHAM15]

– New update method tailored for SW [HAAM16]

Why use a new update method?

– Faster – same culling power

– Less accurate than original, more dependent on render order

– Works best if you render front-to-back

Page 50: Masked Software Occlusion Culling

50

Update Tile, New Method [HAAM16]

Update

Depth test

Compute coverage

Traversal setup

Depth plane

Compute bounds

Clip

Transform

zmax is the reference layer

– Maximum value for the entire tile

zmax is the working layer

– Maximum value for a subset of the tile

– Updated as

– New depth = max(zmax , zmax)

– New mask = TriangleMask OR LayerMask

Whenever working layer mask is full, overwrite reference layer

1

1

tri

0

Page 51: Masked Software Occlusion Culling

51

Update Tile

Page 52: Masked Software Occlusion Culling

52

Update Tile

Page 53: Masked Software Occlusion Culling

53

Update Tile

Page 54: Masked Software Occlusion Culling

54

Update Tile

Page 55: Masked Software Occlusion Culling

55

Update Tile

Page 56: Masked Software Occlusion Culling

Discard heuristic: If zmax – zmax > zmax – zmax , discard working layer

56

Update Tiletri1 10

Restart

Page 57: Masked Software Occlusion Culling

57

Update Tile

Page 58: Masked Software Occlusion Culling

58

Update Tile

Page 59: Masked Software Occlusion Culling

59

Update Tile

Full overwrite:Restart from new value

Page 60: Masked Software Occlusion Culling

60

Update Tile

Update

Depth test

Compute coverage

Traversal setup

Depth plane

Compute bounds

Clip

Transform

Update is quicker than original [AHAM15]

Test is also quicker

– Need only to test against reference layer (zmax)0

Page 61: Masked Software Occlusion Culling
Page 62: Masked Software Occlusion Culling

62

ResultsIntel Occlusion Culling Sample

Clear: Clearing the depth buffer

Geom: Transform & project geometry

Rast: Triangle setup & occluder rasterization

Gen: Compute HiZ buffer from full resolution z buffer

Test: Perform occlusion queries

3.7x16x

(μs)

Old [CMK16]

New [HAAM16]

Page 63: Masked Software Occlusion Culling

63

Performance comparison for camera animation

Results

First frame

Last frame

Old New Frustum only

Page 64: Masked Software Occlusion Culling

Code is available as open-source

Page 65: Masked Software Occlusion Culling

65

Masked Occlusion Culling API

void SetResolution();

void SetNearClipPlane();

void ClearBuffer();

static void TransformVertices();

Result RenderTriangles();

Result TestTriangles();

Result TestRect();

void ComputePixelDepthBuffer();

OcclusionCullingStatistics GetStatistics();

Setup

Debug

Render &query

Page 66: Masked Software Occlusion Culling

66

Masked Occlusion Culling APIResult RenderTriangles(

float *inVtx,

uint *inTris,

int nTris,

ClipPlanes mask,

ScissorRect *scissor,

VertexLayout &layout

);

Render to the software HiZ buffer

// Clip space vertex positions

// Index array (Indices to inVtx buffer)

// Triangle count (the number of index triplets in inTris)

// Mask for potential frustum bound overlap

// Scissor region

// Vertex format of inTris. There is a fast-path for AoS with

(x, y, z, w) coordinates

Page 67: Masked Software Occlusion Culling

67

Masked Occlusion Culling APIResult RenderTriangles(

float *inVtx,

uint *inTris,

int nTris,

ClipPlanes mask,

ScissorRect *scissor,

VertexLayout &layout

);

Eye

View frustum

Near plane

mask = 0

mask = leftPlane | nearPlane

Clipping is not free...

– If you’re already doing frustum culling, let the API know the outcome

Page 68: Masked Software Occlusion Culling

68

Masked Occlusion Culling APIResult RenderTriangles(

float *inVtx,

uint *inTris,

int nTris,

ClipPlanes mask,

ScissorRect *scissor,

VertexLayout &layout

);

Eye

View frustum

Scissor region (screen space AABB)

Can be used for threading

– One scissor region per thread

Page 69: Masked Software Occlusion Culling

69

Masked Occlusion Culling APIResult TestTriangles(

float *inVtx,

uint *inTris,

int nTris,

ClipPlanes mask,

ScissorRect *scissor,

VertexLayout &layout

);

Test triangles against the software HiZ buffer

– Does not update the buffer

// Returns the collective culling outcome of the triangles

// Clip space vertex positions

// Index array (Indices to inVtx buffer)

// Triangle count (the number of index triplets in inTris)

// Mask for potential frustum bound overlap

// Scissor region

// Vertex format of inTris. There is a fast-path for AoS with

(x, y, z, w) coordinates

Page 70: Masked Software Occlusion Culling

70

Masked Occlusion Culling APIResult TestRect(

float xmin,

float ymin,

float xmax,

float ymax,

float wmin

);

Test rectangle against the software HiZ buffer

– Does not update the buffer

// Returns the culling outcome of the screen space rectangle

/*

Screen space bounds:

[xmin, ymin] – [xmax, ymax]

*/

// Conservative clip space w (typically the w-component of the nearest

bbox vertex in clip space)

Page 71: Masked Software Occlusion Culling

71

Example use case: Scene Bounding Volume Hierarchy (BVH) traversal and culling

ClearBuffer();

prioQueue.push(root);

while (!prioQueue.empty()) {

Node node = prioQueue.pop();

if (FrustumTest(node) == Culled)

continue;

compute_screen_space_bounds(node);

if (TestRect(bounds) == Culled)

continue;

if (node is InnerNode) {

prioQueue.push(node.left, dist);

prioQueue.push(node.right, dist);

} else (node is Leaf) {

TransformVertices(leaf.vertices);

RenderTriangles(xfVertices);

send_leaf_to_GPU();

}

}

RenderFrame

Culled!

Page 72: Masked Software Occlusion Culling

72

Essential Tools We Have Relied On

Intel® VTune™

– https://software.intel.com/en-us/intel-vtune-amplifier-xe

SSE/AVX intrinsics guide

– https://software.intel.com/sites/landingpage/IntrinsicsGuide/

Page 73: Masked Software Occlusion Culling

73

References

[AHAM15] ANDERSSON M., HASSELGREN J., AKENINE-MÖLLER T.: Masked Depth Culling for Graphics Hardware. ACM Transactions on Graphics 34, 6 (2015), pp. 188:1–188:9

[CMK16] CHANDRASEKARAN C., MCNABB D., KUAH K., FAUCONNEAU M., GIESEN F.: Software Occlusion Culling. Published online at: https://software.intel.com/en-us/articles/software-occlusion-culling, (2013–2016)

[Col11] COLLIN D.: Culling the Battlefield. Game Developer’s Conference (presentation), (2011)

[Greene93] GREENE N., KASS M., MILLER G.: Hierarchical Z-Buffer Visibility. In Proceedings of SIGGRAPH, (1993), pp. 231–238

[HA15] HAAR U., AALTONEN S.: GPU-Driven Rendering Pipelines. SIGGRAPH Advances in Real-Time Rendering in Games course, (2015)

[HAAM16] HASSELGREN J., ANDERSSON M., AKENINE-MÖLLER T.: Masked Software Occlusion Culling. High Performance Graphics, (2016)

Page 74: Masked Software Occlusion Culling

74

Check it out!

GitHub: Lightweight library

– https://github.com/GameTechDev/MaskedOcclusionCulling

GitHub: Example integrated in Intel’s Software Occlusion Culling demo

– https://github.com/GameTechDev/OcclusionCulling

Project page: Masked Software Occlusion Culling

– https://software.intel.com/en-us/articles/masked-software-occlusion-culling

Questions and feedback welcome

[email protected]

Page 75: Masked Software Occlusion Culling

Legal Notices and DisclaimersIntel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.

Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Statements in this document that refer to Intel’s plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and uncertainties. A detailed discussion of the factors that could affect Intel’s results and plans is included in Intel’s SEC filings, including the annual report on Form 10-K.

All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. © 2016 Intel Corporation. Intel, the Intel logo, VTune and others are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.

Page 76: Masked Software Occlusion Culling