parallel tessellation using compute shaders team members: david sierra erwin holzhauser matt faller...

PARALLEL TESSELLATION USING COMPUTE SHADERS

Team Members: David Sierra Erwin Holzhauser Matt Faller

Project Sponsors: Mangesh Nijasure

Todd MartinSaad Arrabi

THE NEED FOR TESSELLATION:• Problems:• More Polygons!

• Meshes have gone from this…

• …to this

• High-poly scenes = high calculation cost

• Better to spend the scene’s polygon “budget” only where needed

• This is called level of detail (LOD)

• Tessellation is a vital LOD technique

• Other techniques not appealing to artists

Credit: Utah Teapot by Martin Newell

Credit: Crytek

TESSELLATION OVERVIEW:• Tessellation subdivides

geometry to increase detail1. A tessellation control shader

(TCS) decides which vertices need additional detail

2. The Tessellator subdivides a selected primitive

3. A tessellation evaluation shader (TES) applies subdivided primitive across selected vertices.

This is called a patch

TESSELLATION OVERVIEW:• DirectX 11 Tessellation Pipeline

• For each patch, a Hull shader outputs:

• Tessellation factors

• Control Points

• The tessellator outputs a subdivided primitive based on these factors

• The output is a series “domain points” that make up a subdivided primitive

• Domain Shader produces final result

• Using the control points…

• Applies math to each domain point according to some high-order surface Domain Shader (TES)

Tessellator

Hull Shader (TCS)

B-Spline Algorithm

Credit: wikipedia

TESSELLATION PROBLEMS:• Look at where each stage of the pipeline

is running:

• The general purpose hardware provides incredible throughput. 1 Tflops or more

• The tessellator has limited throughput, only a few patches at one time

• Including multiple tessellators on a GPU might mitigate the problem…

• A scalable compute implementation could be superior.

Domain Shader (TES)

Tessellator

Hull Shader (TCS)Runs on General PurposeCompute Units (CUs)

Runs on General PurposeCompute Units (CUs)

Runs on fixed functionhardware

Massive Throughput

Massive Throughput

limited throughput

GPU ARCHITECTURE• Wavefront – 16 ALUs

• Compute Unit – 4 Wavefronts

• A GPU can have up to 44 compute units!

• 2816 ALUs for those keeping track

DIRECTCOMPUTE• Three steps:

• Configure

• Tell GPU how to store information

• Dispatch

• Tell GPU to start doing work

• Retrieve

• Copy results of calculations back to main RAM

CPUCPU GPU

Configure Buffers

Dispatch

Copy Results

Data

DIRECTCOMPUTE PROS & CONS

PROS

• Uses familiar C like syntax

• Utilizes intrinsic hardware features• float2, float3, float4, ...

• N-wide primitive data types

• Operations applied simultaneously on all members

• mad(a, b, c)

• Computes a*b + c in one cycle!

CONS

• Branching is inefficient• Both paths are almost always taken

• Copying results of calculations back to CPU memory is a costly operation in processor time

DIRECTCOMPUTE ESSENTIALS• Groups

• Split up work at a high level

• One group for each texture or model

• Dispatch Groups

• Split up individual groups of work

• In each group, a thread does work on a small group of data elements

PROJECT GOALS1. Parallel Compute Shader Implementation

2. Output matches Ref. Tessellator

3. Faster than CPU implementation

4. Better than fixed-function hardware

ISOLINE TESSELLATION• Relatively easy to parallelize

• Initial implementation

• One thread per point

• Compiler ended up assigning one whole compute unit per point

• Terrible performance

• Only using 1/64 threads per compute unit

ISOLINE TESSELLATION• 2nd Generation Implementation

• Each thread computes a nxn grid of

points

• Compiler now splits threads evenly across compute units

• Tested 8x8, 4x4, and 2x2 grids

• 2x2 was by far the fastest

ISOLINE TESSELLATION• 3rd Generation Implementation

• Each thread computes 1 point

• Launch these threads 64 at a time to minimize resources used.

• Very fast

• Threads are launched 8x8 or 64x1

TRI TESSELLATION

• 4 Tess. Factors• Partitioning Mode

PointGeneration

Shader

Vertex Buffer Index Buffer

1st Outer = 2

2nd Outer = 5

3rd Outer = 1

Inner = 4PointConnectivity

Shader

QUAD TESSELLATION

• 6 Tess. Factors• Partitioning Mode

Compute Implementation

Vertex Buffer Index Buffer

1st Outer = 3

2nd Outer = 1

3rd Outer = 1

4th Outer = 1

Inner = 3

Inner = 4

QUAD & TRI TESSELLATION: HIGH LEVEL DESIGNTwo Designs to implementDesign #1

Context Processing

Point Generation

Group

Point Connectivity

Group

Do the following for each patch:

Patch Input Data

Load input on GPU

CPUGPU

Input Data

(RW Structured Buffer)Dispatch TF_Context

Vert. Buff

Ind. Buff

QUAD & TRI TESSELLATION: HIGH LEVEL DESIGNDesign #2

Point Generation Shader

N number of input Patches

Load input on GPU

CPUGPU

Input Data[N]

Dispatch N thread groups

TF_Context (Groupshared Memory)

Vert. Buff Ind. Buff

Point Connectivity Shader

Compute TF_Context

TF_Context (Groupshared Memory)

Compute TF_Context

ComputePoint Locations

ComputeConnectivity

Group Sync Group Sync

TRI TESSELLATION: LOW LEVEL DETAILS

P: 0

P: 1

P: 2 P: 3 P: 4 P: 5 P: 6

P: 7

P: 8

P: 9 P: 10 P: 11

P: 12

P: 13

• Shader must output each point in an exact order

• Order follows a spiral pattern

• Regular pattern allows connectivity and generation to be done in parallel

• Point generation shader computes a point-per-thread, based on its global thread ID• Overhead for thread to figure out

contextual information (edge, offset within edge)

QUAD TESSELLATION: LOW LEVEL DETAILS

P: 0

P: 1

P: 2

P: 3 P: 4

P: 5

P: 6

P: 7

P: 8

P: 9

P: 10

P: 19

P: 20

P: 21

P: 22

P: 11

P: 18

P: 25

P: 24

P: 23

P: 12

P: 17

P: 16

P: 15

P: 14

P: 13• The points are generated in a spiral

pattern• This regular pattern allows connectivity

to be done in parallel

• Implemented in Microsoft reference using nested for-loops

• For-each-ring• For-each-edge

• For-each-point (on edge)

• This for-loop structure makes indexing threads tricky

QUAD TESSELLATION: LOW LEVEL DETAILS

• The connectivity follows the same spiral pattern, assumes each point has correct value

• Triangles are created by connecting three points

0

1

6

6

1

7

7

1

8

1

8

2

• Triangle: 1

• Triangle: 2

• Triangle: 3

• Triangle: 4

QUAD & TRI TESSELLATION: LOW LEVEL DETAILS

• Ideal: Have meaningful work for each thread in group• Number of threads per thread group is a multiple of

64• Each thread in a group must access the appropriate

data and not write to any other thread’s location

• In order to calculate the correct results we must find for any given buffer location:• The current ring number• The current edge• Correct offset based on edge

• One problem with this is that calculating this information introduces divergent flow control for each thread.

• To counter this, each group of threads is instead responsible for placing points and connections on each edge. (although this sacrifices cache performance)

Group Thread IDs

0

1

2

.

.

64

0 – 64

64 – 128

OutputBuffer

WORK DISTRIBUTION

• Quad Tessellation – Matthew

• Triangle Tessellation – Erwin

• Isoline Tessellation – David

• Additional Tools:

• DXQuery (David) – Simplifies the creation of DX11 queries which are used to collect accurate performance data

• Testing environment using Google Test API - David

• Library to simplify writing to and reading from buffers - Matt

BUDGET

• Radeon R9 290X Graphics Card – The cheapest runs from $360 on Newegg, bought with AMD donation

• Practical Rendering and Computation with Direct3D 11 – Runs from ~$50 on Amazon

• Bitbucket – Free for teams under 5 users

• Private code repository

• Visual Studio Professional - Free through UCF’s DreamSpark membership

Credit: Newegg.com

erwinholzhauser

TESTING• The output for the compute shader must

match the reference exactly.

• We test for accuracy using Google’s Test API

• Loop through every possible tessellation factor and mode

• Check output bit-for-bit

• Test API allows for a small margin of error

• IEEE floats vs Fixed-Point decimals

• In most cases our output differs because of a higher degree of accuracy

Reference Implementation

Shader Implementation

• Tess Factors• Part. Mode

Google Test• Pass/Fail?

• Vertex Buff• Index Buff

• Vertex Buff• Index Buff

EXPERIMENTAL RESULTS (ISOLINE I)

1x1 8x8 4x4 2x2 4x1 8x8 bound 64x1 bound0

50

100

150

200

250

300

350

400

Isoline Test Results

HD 8490 R9 290X Intel Integrated

Grid Size

Tim

e (m

s)

EXPERIMENTAL RESULTS (ISOLINE II)

16x16 32x16 64x15 32x32 48x48 64x640

0.005

0.01

0.015

0.02

0.025

Isoline Test Results

CPU GPU

Grid Size

Tim

e (m

s)

EXPERIMENTAL RESULTS

16x16x1

6x16x1

6x16

32x32x3

2x32x3

2x32

48x48x4

8x48x4

8x48

64x64x6

4x64x6

4x64

32x16x8

x6x4

x2

64x15x2

3x14x1

3x46

0

0.5

1

1.5

2

2.5

(Quads) CPU vs GPU

GPUCPU

Tessellation Factors

Tim

e (M

S)

EXPERIMENTAL RESULTS

16xALL 32xALL 48xALL 64xALL0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

(Quads) Naïve vs Parallel implementaions

Parallel Naive

Tess Factors

Tim

e (M

S)

EXPERIMENTAL RESULTS (TRI)

32x16x8x6 16x16x16x16 64x15x23x14 32x32x32x32 48x48x48x48 64x64x64x640

0.2

0.4

0.6

0.8

1

1.2

1.4

Tri Tessellaton CPU vs GPU Implementation in Integer Par -titioning Mode

CPU GPU

Tess factors

Tim

e (m

s)

QUESTIONS?

parallel tessellation using compute shaders team members: david sierra erwin holzhauser matt faller...

Documents

tessellation pipelinefor

hull shader outputs

crytek3 tessellation

gpu architecturewavefront

configuretell gpu

wavefrontsa gpu

informationdispatchtell

incredible throughput