real-time mesh simplification using the gpu christopher decoro natasha tatarchuk 3d application...

Real-time Mesh Simplification Using the GPU

Christopher DeCoroNatasha Tatarchuk3D Application Research Group

2

Introduction

• Implement Mesh Decimation in real-time• Utilizes new Geometry Shader stage of GPU

• Achieves a 20x speedup over CPU

3

Project Motivation

• Massive Increases in submitted geometry• Geometry rendered per shadow map (6x for cubemap!)• Not always needed at highest resolution

• Geometry not always known at build-time• Dynamically-skinned objects only finalized at run-time• May be customized to users machine based on capabilities, would

need to be adapted at program load time• Could be dynamically generated per level, need to be adapted at

level load time• Simplification therefore needs to be fast (or even real-time)

Also, just as importantly…

• We want applications that exercise & stress GS/GPU• Evaluate new capabilities of the GPU• Learn how to adapt previously CPU-bound algorithms• Develop GPU-centric methodologies

• Identify future feature set for GS/GPU as a whole• Limitations still exist – which should be addressed?

4

Contributions

• Mapping of Decimation to GPU• 20x speedup vs. CPU• Enables load-time or real-time usage

• Detail Preservation by Non-linear Warping• Also applicable to CPU out-of-core decimation

• General-purpose GPU Octree• Adaptive decimation w/ constant memory• Applications not limited to simplification: collision

detection, frustum culling, etc.

5

Outline

• Project Introduction and Motivation• Background

• Decimation with Vertex Clustering• Geometry Shaders in Direct3D 10

• Geometry Shader-based Vertex Clustering• Adaptive Simplification w/ Non-linear Warps• Probabalistic Octrees on the GPU

6

Vertex Clustering

• Reduces mesh resolution• High-res mesh as input • Low-res as output

• All implemented on the GPU• Ideal for processing streamed

out data• Useful when rendering multiple

times (i.e. shadows)• Can handle enormous models

from scanned data

• Based on “Out-of-Core Simplification of Large Polygonal Models,” P. Lindstrom, 2000

Figure from [Lindstrom 2000]

7

Previous Rendering Pipeline

• Vertex Shaders and Pixel Shaders• Limits 1 output per 1 input

• No culling of triangles for decimation• Fixed destination for each stage

• Result meshes cannot be (easily) saved and reused

8

DirectX10 Rendering Pipeline

• Geometry Shader in between VS & PS• Called for each primitive (usually triangle)

• Able to access all vertices of a primitive• Can compute per-face quantities

• Breaks 1:1 input-output limitation• Allows triangles to be culled from pipeline

• Allows stream-out of processed geometry• Decimated meshes can easily be saved and reused

9

Outline

• Project Introduction and Motivation• Background• Geometry Shader-based Vertex Clustering

• Overview• Quadric Generation• Optimal Position Computation• Final Clustering

• Adaptive Simplification w/ Non-linear Warps• Probabilistic Octrees on the GPU

10

Algorithm Overview

• Start with the input mesh• Shown divided into clusters

• Pass 1: Compute the quadric map from mesh• Use GS to compute quadric• Accumulate in cluster map, an RT used as large array

• Pass 2: For each cluster, compute optimal position• Solves a linear system given by quadrics

• Pass 3: Collapse each vertex to representative• 9x9x9 grid shown

Model Courtesy of Stanford Graphics Lab

11

Vertex Clustering Pipeline

• Pass 1: Create Quadric Map• Input: Original Mesh• Computation:

• Determine plane equation, face quadrics for triangle• Compute the cluster and address of each vertex• Pack quadric into RT at appropriate address

• Output: Render Targets representing clusters with packed quadrics and average positions

12

Quadric Map Implementation


• Compute the quadric map from mesh• Use GS to compute quadric• Accumulate in cluster map, an RT used as large array

• For each cluster, compute optimal position

• Collapse each vertex to representative• 9x9x9 grid shown

//Map a point to its location in the cluster map array

float2 writeAddr( float3 vPos )

{

uint iX = clusterId(vPos) / iClusterMapSize.x;

uint iY = clusterId(vPos) % iClusterMapSize.y;

return expand( float2(iX,iY)/float(iClusterMapSize.x) ) + 1.0/iClusterMapSize.x;

}

[maxvertexcount(3)]

void main( triangle ClipVertex input[3], inout PointStream<FragmentData> stream )

{

//For the current triangle, compute the area and normal

float3 vNormal = (cross( input[1].vWorldPos - input[0].vWorldPos, input[2].vWorldPos - input[0].vWorldPos ));

float fArea = length(vNormal)/6;

vNormal = normalize(vNormal);

//Then compute the distance of plane to the origin along the normal

float fDist = -dot(vNormal, input[0].vWorldPos);

//Compute the components of the face quadrics using the plane coefficients

float3x3 qA = fArea*outer(vNormal, vNormal);

float3 qb = fArea*vNormal*fDist;

float qc = fArea*fDist*fDist;

//Loop over each vertex in input triangle primitive

for(int i=0; i<3; i++)

{

//Assign the output position in the quadric map

FragmentData output;

output.vPos = float4(writeAddress(input[i].vPos),0,1);

//Write the quadric to be accumulated in the quadric map

packQuadric( qA, qb, qc, output );

stream.Append( output );

}

}

13


• Pass 2: Find Optimal Positions• Input: Cluster Map Render Targets,

Full-screen Quad• Computation:

• Determine if we can solve for optimal position• If not, fall back to vertex average

• Output: Render Targets representing clusters with optimal position of representative vtx.

14

Optimal Positions

• For each cell, need representative

• Naïve solution: Use averages• Looks very blocky• Does not consider the original faces,

only vertices

• Implemented solution: Use quadrics• Quadrics are a measure of surface• We can solve for optimal position

Original Mesh

Simplified w/ Averages

Simplified w/ Quadrics

15

Optimal Positions Implementation





float3 optimalPosition(float2 vTexcoord)

{

float3 vPos = float3(0,0,0);

float4 dataWorld, dataA0, dataB, dataA1;

//Read the vertex average from the cluster map

dataWorld = tClusterMap0.SampleLevel( sClusterMap0, vTexcoord, 0 );

int iCount = dataWorld.w;

//Only compute optimal position if there are vertices in this cluster

if( iCount != 0 )

{

//Read all the data from the clustermap to reconstruct the quadric

dataA0 = tClusterMap1.SampleLevel( sClusterMap1, vTexcoord, 0 );

dataA1 = tClusterMap2.SampleLevel( sClusterMap2, vTexcoord, 0 );

dataB = tClusterMap3.SampleLevel( sClusterMap3, vTexcoord, 0 );

//Then reassemble the quadric

float3x3 qA = { dataA0.x, dataA0.y, dataA0.z,

dataA0.y, dataA0.w, dataA1.x,

dataA0.z, dataA1.x, dataA1.y };

float3 qB = dataB.xyz;

float qC = dataA1.z;

//Determine if inverting A is stable, if so, compute optimal position

//If not, default to using the average position

const float SINGULAR_THRESHOLD = 1e-11;

if(determinant(quadricA) > SINGULAR_THRESHOLD )

vPos = -mul( inverse(quadricA), quadricB );

else

vPos = dataWorld.xyz / dataWorld.w;

}

return vPos;

}

16


• Pass 3: Decimate Mesh• Input: Cluster Map Render Targets, Input Mesh• Computation:

• Find clusters, Remap vertices to representative• Determine if triangle becomes degenerate• If not, stream output new triangle at new positions

• Output: Low-resolution Mesh

17

Final Clustering Implementation





[maxvertexcount(3)]

void main( triangle ClipVertex input[3], inout TriangleStream<StreamoutVertex> stream )

{

//Only emit a triangle if all three vertices are in diff. clusters

if( all_different(clusterId(input[0].vPos),

clusterId(input[1].vPos),

clusterId(input[2].vPos)) )

{

for(int i=0; i<3; i++)

{

//Lookup optimal position in the RT computed in Step 2

vPos = tClusterMap3.SampleLevel( sClusterMap3, readAddr(input[0].vPos), 0 );

//Output vertex to stream out

stream.Append( vPos );

}

}

return;

}

18


• Alternate Pass 2: Downsample RTs• Input and Output as before• Computation:

• Collapse 8 adjacent cells by adding cluster quadrics• Compute optimal position for 2x larger cell

• Create multiple lower levels of detail without repeatedly incurring Pass 1 overhead (~75%)• Pass 3 can use previous streamed-out mesh• Lower levels of detail almost free

19

Timing Results

• Recorded Time Spent in Decimation• GPU: AMD/ATI XXX• CPU: 3Ghz Intel P4

• Significant Improvement over CPU• Averages ~20x speedup on large models• Scales linearly

20

More Results

• Models shown at varying resolutions

Models Courtesy of Stanford Graphics Lab

Buddha, 45x130x45 grid

Bunny, 90x90x90 grid Dragon, 100x60x20 grid

21

More Results




22

More Results




23

Outline

• Project Introduction and Motivation• Background• Geometry Shader-based Vertex Clustering• Adaptive Simplification w/ Non-linear Warps

• View-dependent Simplification• Region-of-interest Simplification

• Probabalistic Octrees on the GPU

24

View-dependent Simplification

• Standard simplification does not consider view• Preserves uniform amount of detail all over

• Simplify in post-projection space to use view• Preserves more detail closer to viewer (left)

View Direction

25

Arbitrary Warping Functions

• View Transform special case of nonlinear warp• Can use arbitrary warp for adaptive simplification

• Regular grids allow data-independence, parallelism• Constant time mapping from position to grid cell• Maps well onto GPU render targets• Forces uniform resolution throughout output mesh

• Irregular geometry grids allow non-uniform output• Cells can be larger/smaller in certain regions• Corresponds to lower/greater output triangle density• We lose constant-time mapping of position to cell

• Solution: apply inverse warp to vertices• Equivalent to applying forward warp to grid cells• Clustering still performed in uniform grid• Flexibility of irregular geometry w/ speed of regular• One proposal: Gaussian weighting functions

26

Region-of-Interest Specification

• Importance specified w/ biased Gaussian

• Highest preservation at mean• Width of region given by sigma• Bias prevents falloff to zero

• Integrate to produce corresponding warp function

(Derivation given in paper)

27

Region-of-Interest Specification

• Warping allows non-uniform/adaptive level of detail

• Head has most semantic importance

• Detail lost in uniform simplification

• We can warp first to expand center

• Equivalent to grid density increasing

• Adaptive simplification preserves head detail

28

Outline

• Project Introduction and Motivation• Background• Geometry Shader-based Vertex Clustering• Adaptive Simplification w/ Non-linear Warps• Probabalistic Octrees on the GPU

• Motivation• Probablistic Storage• Adaptive Simplification• Randomized Construction • Results

29

Octrees - Motivation

• Basic grid• regular geometry, regular topology• Limitations as we discussed

• Warped grid• irregular geometry, regular topology• Much improved; however, we can do better• May be difficult to know required detail a priori

• CPU Solution: Multi-resolution grid (i.e. octree)• Irregular topology (irregular geometry w/ warping)• Store grid at many levels of detail• Measure error at each level, use coarse as possible• Efficiency requires dynamic memory, storage O(L3)• Requires O(L) writes to produce correct tree

30

GPU Solution – Probabilistic Octrees

• Proposal• Successful storage not guaranteed, w/ Prob. <= 1• However, storage failure detected on read

• Assumptions allow much flexibility• We can have unlimited depth tree (but lim P=0)• Sparse storage of data

• Require conservative algorithms for task• Vertex clustering (conveniently!) is such an example• So is collision detection and frustum culling

• Only studied in brief in this paper, we would like to analyze more for future work

31

Implementation Details

• Storage: Spatial Hashes• Map (position,level) to cell, cell hashed to index• Additive blending for quadric accumulation (app-specific)• Max blending to store (key,-key) with data (i.e. min_key,max_key)

• Retrieval:• Again map (position, level) to index• Retrieve key value from data, collision iff min_key != max_key• Use parent level, which will have higher storage probability

• Usage for Adaptive Simplification• For each vertex, find maximum error level below some threshold• Use this as the representative vertex• Can perform binary search along path• Conservative, because we can maintain validity even when using

parent of optimal node (just adds some error)

33

Probabilistic Octree Results

• Adaptive simplification shown on bunny (~4K tris)• Preserves detail around leg, eyes and ears• Simplifies significantly on large, flat regions

• Using 8% of storage of total tree, we have < 10% collisions• Only ~20% performance hit vs. standard grids

34

Conclusions

• GS is a powerful tool for interactive graphics

• Amplification and decimation are important applications of GS

35

Geometry Shaders and Other Feature Wish-List

• Bring back the Point fill mode• Important for scatter in GPGPU applications

• Data amplification improvements with indexed stream out• Avoiding triangle soups very non-trivial

• Efficient indexable temps

36

Thanks a lot!

• Various people here…

37

Questions?

real-time mesh simplification using the gpu christopher decoro natasha tatarchuk 3d application...

Documents

gpu slide

geometry geometry

cpu slide

reused slide

realtime mesh simplification

gpu ideal

processed geometry

input mesh