gpu computation strategies & tricks ian buck nvidia

35
GPU Computation GPU Computation Strategies & Tricks Strategies & Tricks Ian Buck Ian Buck NVIDIA NVIDIA

Upload: leonard-jennings

Post on 04-Jan-2016

233 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: GPU Computation Strategies & Tricks Ian Buck NVIDIA

GPU Computation GPU Computation Strategies & TricksStrategies & Tricks

Ian BuckIan Buck

NVIDIANVIDIA

Page 2: GPU Computation Strategies & Tricks Ian Buck NVIDIA

2

Recent TrendsRecent Trends

Page 3: GPU Computation Strategies & Tricks Ian Buck NVIDIA

3

Compute is CheapCompute is Cheap

• parallelismparallelism• to keep 100s of ALUs to keep 100s of ALUs

per chip busyper chip busy

• shading is highly shading is highly parallelparallel• millions of fragments millions of fragments

per frameper frame

90nm Chip64-bit FPU(to scale)

12mm

0.5mm

courtesy of Bill Dally

Page 4: GPU Computation Strategies & Tricks Ian Buck NVIDIA

4

...but Bandwidth is Expensive...but Bandwidth is Expensive

courtesy of Bill Dally

• latency tolerancelatency tolerance• to cover 500 cycle to cover 500 cycle

remote memory remote memory access timeaccess time

• localitylocality• to match 20Tb/s ALU to match 20Tb/s ALU

bandwidth to bandwidth to ~100Gb/s chip ~100Gb/s chip bandwidthbandwidth

90nm Chip

12mm

0.5mm

1 clock

Page 5: GPU Computation Strategies & Tricks Ian Buck NVIDIA

5

Optimizing for GPUsOptimizing for GPUs

• shading is shading is compute intensivecompute intensive• 100s of floating point 100s of floating point

operationsoperations

• output 1 32-bit color output 1 32-bit color valuevalue

• compute to compute to bandwidth ratiobandwidth ratio• arithmetic intensityarithmetic intensity

90nm Chip

12mm

0.5mm

1 clock

courtesy of Bill Dally

Page 6: GPU Computation Strategies & Tricks Ian Buck NVIDIA

6

Compute vs. BandwidthCompute vs. Bandwidth

R300R300 R360R360 R420R420

GFLOPSGFLOPS

GFloats/secGFloats/sec

Page 7: GPU Computation Strategies & Tricks Ian Buck NVIDIA

7

Arithmetic IntensityArithmetic Intensity

R300R300 R360R360 R420R420

GFLOPSGFLOPS

GFloats/secGFloats/sec

7x Gap

Page 8: GPU Computation Strategies & Tricks Ian Buck NVIDIA

8

Arithmetic IntensityArithmetic Intensity

GPU wins when…GPU wins when…• Arithmetic Arithmetic

intensityintensity SegmentSegment

3.7 ops per 3.7 ops per wordword

11 GFLOPS11 GFLOPS

GeForce 7800 GTX

Pentium 4 3.0 GHz

Page 9: GPU Computation Strategies & Tricks Ian Buck NVIDIA

9

Arithmetic IntensityArithmetic Intensity

• Overlapping computation with Overlapping computation with communicationcommunication

Page 10: GPU Computation Strategies & Tricks Ian Buck NVIDIA

10

Memory BandwidthMemory Bandwidth

GPU wins when…GPU wins when…• Streaming memory Streaming memory

bandwidthbandwidth

SAXPYSAXPY

FFTFFT

GeForce 7800 GTX

Pentium 4 3.0 GHz

Page 11: GPU Computation Strategies & Tricks Ian Buck NVIDIA

11

Memory BandwidthMemory Bandwidth

• Streaming Memory Streaming Memory SystemSystem• Optimized for sequential Optimized for sequential

performanceperformance

• GPU cache is limitedGPU cache is limited• Optimized for texture Optimized for texture

filteringfiltering

• Read-onlyRead-only

• SmallSmall

• Local storageLocal storage• CPU >> GPUCPU >> GPU

GeForce 7800 GTXGeForce 7800 GTX Pentium 4Pentium 4

Page 12: GPU Computation Strategies & Tricks Ian Buck NVIDIA

12

Computational IntensityComputational Intensity

• Considering GPU transfer costs: TConsidering GPU transfer costs: Trr

GPU

Memory

Memory

CPU

Page 13: GPU Computation Strategies & Tricks Ian Buck NVIDIA

13

Computational IntensityComputational Intensity

• Considering GPU transfer costs: TConsidering GPU transfer costs: Trr

• Computational intensity: Computational intensity:

• to outperform the CPU:to outperform the CPU:

speedup: s speedup: s K Kcpucpu / K / Kgpugpu

KKgpugpu / T / Trr

work per word transferredwork per word transferred

1s - 1 >

Page 14: GPU Computation Strategies & Tricks Ian Buck NVIDIA

14

Kernel OverheadKernel Overhead

• Considering CPU cost to issuing a kernelConsidering CPU cost to issuing a kernel• Generating compute geometryGenerating compute geometry

• Graphics driverGraphics driver

CPUCPUlimitedlimited

GPUGPUlimitedlimited

Page 15: GPU Computation Strategies & Tricks Ian Buck NVIDIA

15

Floating Point PrecisionFloating Point Precision

•NVIDIA FP32NVIDIA FP32•s23e8 s23e8

•ATI 24-bit floatATI 24-bit float•s16e7 s16e7

•NVIDIA FP16NVIDIA FP16•s10e5s10e5

mantissamantissaexponentexponentss

sign * 1.mantissa * 2(exponent+bias)

Page 16: GPU Computation Strategies & Tricks Ian Buck NVIDIA

16

Floating Point PrecisionFloating Point Precision

• Common BugCommon Bug• Pack large 1D array in 2D texturePack large 1D array in 2D texture

• Compute 1D address in shaderCompute 1D address in shader

• Convert 1D address into 2DConvert 1D address into 2D

• FP precision will leave FP precision will leave unaddressable texels!unaddressable texels!

NVIDIA FP32: 16,777,217NVIDIA FP32: 16,777,217ATI 24-bit float: 131,073ATI 24-bit float: 131,073NVIDIA FP16: 2,049NVIDIA FP16: 2,049

Largest Counting Number

Page 17: GPU Computation Strategies & Tricks Ian Buck NVIDIA

17

Scatter TechniquesScatter Techniques

• Problem: a[i] = pProblem: a[i] = p• Indirect writeIndirect write

• Can’t set the x,y of fragment in pixel shaderCan’t set the x,y of fragment in pixel shader

• Often want to do: a[i] += pOften want to do: a[i] += p

Page 18: GPU Computation Strategies & Tricks Ian Buck NVIDIA

18

Scatter TechniquesScatter Techniques

• Solution 1: Convert to GatherSolution 1: Convert to Gather

m1m1 m2m2

f2f2f3f3

f1f1

for each springf = computed force

mass_force[left] += f; mass_force[right] -= f;

Page 19: GPU Computation Strategies & Tricks Ian Buck NVIDIA

19

Scatter TechniquesScatter Techniques

• Solution 1: Convert to GatherSolution 1: Convert to Gather

m1m1 m2m2

f2f2f3f3

f1f1

for each springf = computed force

for each massmass_force = f[left] -

f[right];

Page 20: GPU Computation Strategies & Tricks Ian Buck NVIDIA

20

Scatter TechniquesScatter Techniques

• Solution 2: Address SortingSolution 2: Address Sorting• Sort & SearchSort & Search

•Shader outputs destination address and dataShader outputs destination address and data

•Bitonic sort based on addressBitonic sort based on address

•Run binary search shader over destination bufferRun binary search shader over destination buffer– Each fragment searches for source dataEach fragment searches for source data

Page 21: GPU Computation Strategies & Tricks Ian Buck NVIDIA

21

Scatter TechniquesScatter Techniques

• Solution 3: Vertex processorSolution 3: Vertex processor• Render pointsRender points

•Use vertex shader to set destinationUse vertex shader to set destination

•or just read back the data and re-issueor just read back the data and re-issue

• Vertex TexturesVertex Textures

•Render data and address to textureRender data and address to texture

• Issue points, set point x,y in vertex shader using Issue points, set point x,y in vertex shader using address textureaddress texture

•Requires texld instruction in vertex programRequires texld instruction in vertex program

Page 22: GPU Computation Strategies & Tricks Ian Buck NVIDIA

22

ConditionalsConditionals

Strategies & Tricks:Strategies & Tricks:

Page 23: GPU Computation Strategies & Tricks Ian Buck NVIDIA

23

ConditionalsConditionals

• Problem:Problem:

• Limited fragment shader conditional supportLimited fragment shader conditional support

if (a) b = f();else b = g();

Page 24: GPU Computation Strategies & Tricks Ian Buck NVIDIA

24

Pre-computationPre-computation

• Pre-compute anything that will not Pre-compute anything that will not change every iteration!change every iteration!

• Example: static obstacles in fluid simExample: static obstacles in fluid sim• When user draws obstacles, compute texture When user draws obstacles, compute texture

containing boundary info for cellscontaining boundary info for cells

• Reuse that texture until obstacles are modifiedReuse that texture until obstacles are modified

• Combine with Z-cull for higher performance!Combine with Z-cull for higher performance!

Page 25: GPU Computation Strategies & Tricks Ian Buck NVIDIA

25

Static Branch ResolutionStatic Branch Resolution

• Avoid branches where outcome is Avoid branches where outcome is fixedfixed• One region is always true, another falseOne region is always true, another false

• Separate FPs for each region, no branchesSeparate FPs for each region, no branches

• Example: boundariesExample: boundaries

Page 26: GPU Computation Strategies & Tricks Ian Buck NVIDIA

26

Branching with Occlusion QueryBranching with Occlusion Query• Use it for iteration terminationUse it for iteration termination

Do Do

{ // outer loop on CPU{ // outer loop on CPU

BeginOcclusionQueryBeginOcclusionQuery

{{

// Render with fragment program that // Render with fragment program that // discards fragments that satisfy // discards fragments that satisfy

// termination criteria // termination criteria

} } EndQueryEndQuery

} } While query returns > 0While query returns > 0

Page 27: GPU Computation Strategies & Tricks Ian Buck NVIDIA

27

ConditionalsConditionals

• Using the depth bufferUsing the depth buffer• Set Z buffer to aSet Z buffer to a

• Z-test can prevent shader executionZ-test can prevent shader execution

•glEnable(GL_DEPTH_TEST)glEnable(GL_DEPTH_TEST)

• Locality in conditionalLocality in conditional

if (a) b = f();else b = g();

Page 28: GPU Computation Strategies & Tricks Ian Buck NVIDIA

28

ConditionalsConditionals

• Using the depth bufferUsing the depth buffer• Optimization disabled with:Optimization disabled with:

ATI: ATI: • Writing Z in shaderWriting Z in shader

• Enabling Alpha testEnabling Alpha test

• Using texkill in shaderUsing texkill in shader

NVIDIA:NVIDIA:

• Changing depth test Changing depth test direction in framedirection in frame

• Writing stencil while Writing stencil while rejecting based on stencil rejecting based on stencil

• Changing stencil Changing stencil func/ref/mask in framefunc/ref/mask in frame

Page 29: GPU Computation Strategies & Tricks Ian Buck NVIDIA

29

Depth ConditionalsDepth Conditionals

GeForce 7800 GTX

Page 30: GPU Computation Strategies & Tricks Ian Buck NVIDIA

30

ConditionalsConditionals

• PredicationPredication

• Execute bothExecute both

• f and gf and g

• Use CMP instructionUse CMP instruction• CMP b, -a, f, gCMP b, -a, f, g

• Executes all conditional codeExecutes all conditional code

if (a) b = f();else b = g();

Page 31: GPU Computation Strategies & Tricks Ian Buck NVIDIA

31

ConditionalsConditionals

• PredicationPredication

• Use DP4 instructionUse DP4 instruction• DP4 b.x, a, f DP4 b.x, a, f

• Executes all conditional codeExecutes all conditional code

if (a.x) b = x;else if (a.y) b = y;else if (a.z) b = z;else if (a.w) b = w;

a = (0, 1, 0, 0)f = (x, y, z, w)

Page 32: GPU Computation Strategies & Tricks Ian Buck NVIDIA

32

ConditionalsConditionals

• Conditional InstructionsConditional Instructions• Available with NV_fragment_program2Available with NV_fragment_program2

MOVC CC, R0;IF GT.x;MOV R0, R1; # executes if R0.x > 0ELSE;MOV R0, R2; # executes if R0.x <= 0ENDIF;

Page 33: GPU Computation Strategies & Tricks Ian Buck NVIDIA

33

GeForce 6+ Series BranchingGeForce 6+ Series Branching

• True, SIMD branchingTrue, SIMD branching• Lots of incoherent branching can hurt performanceLots of incoherent branching can hurt performance

• Should have coherent regions of ~1000 pixelsShould have coherent regions of ~1000 pixels

•That is only about 30x30 pixels, so still very useable!That is only about 30x30 pixels, so still very useable!

• Don’t ignore overhead of branch Don’t ignore overhead of branch instructionsinstructions• Branching over < 5 instructions may not be worth itBranching over < 5 instructions may not be worth it

• Use branching for early exit from loopsUse branching for early exit from loops• Save a lot of computationSave a lot of computation

Page 34: GPU Computation Strategies & Tricks Ian Buck NVIDIA

34

Conditional InstructionsConditional Instructions

GeForce 7800 GTX

Page 35: GPU Computation Strategies & Tricks Ian Buck NVIDIA

35

Branching TechniquesBranching Techniques• Fragment program branches can be expensiveFragment program branches can be expensive

• No true fragment branching on GeForce FX or Radeon 9x00-No true fragment branching on GeForce FX or Radeon 9x00-X850X850

• SIMD branching on GeForce 6/7 SeriesSIMD branching on GeForce 6/7 Series

• Incoherent branching hurts performanceIncoherent branching hurts performance

• Sometimes better to move decisions up the Sometimes better to move decisions up the pipelinepipeline• Pre-computationPre-computation

• Replace with mathReplace with math

• Occlusion QueryOcclusion Query

• Static Branch ResolutionStatic Branch Resolution

• Depth BufferDepth Buffer