brook for gpus ian buck, tim foley, daniel horn, jeremy sugerman pat hanrahan february 10th, 2003

Post on 19-Dec-2015

216 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Brook for GPUs

Ian Buck, Tim Foley, Daniel Horn, Jeremy SugermanPat Hanrahan

February 10th, 2003

February 11th, 2004 2

Brook: general purpose streaming language

• developed for PCA Program/Merrimac

– compiler: RStream• Reservoir Labs

– DARPA PCA Program• Stanford: SmartMemories• UT Austin: TRIPS• MIT: RAW

– Brook version 0.2 spec: http://merrimac.stanford.edu

– Brook for GPUs: http://brook.sourceforce.net

StreamExecution Unit

StreamRegister File

MemorySystem

NetworkInterface

ScalarExecution

Unit

texttext

DRDRAMNetwork

February 11th, 2004 3

Brook: general purpose streaming language

• stream programming model– enforce data parallel computing

• streams

– encourage arithmetic intensity• kernels

• C with streams

February 11th, 2004 4

Brook for gpus

• demonstrate gpu streaming coprocessor– make programming gpus easier

• hide texture/pbuffer data management• hide graphics based constructs in CG/HLSL• hide rendering passes• virtualize resources

– performance!• … on applications that matter

– highlight gpu areas for improvement• features required general purpose stream

computing

February 11th, 2004 5

system outline

.brBrook source files

brccsource to source

compiler

brtBrook run-time library

February 11th, 2004 6

Brook language

streams• streams

– collection of records requiring similar computation

• particle positions, voxels, FEM cell, …

float3 positions<200>;

float3 velocityfield<100,100,100>;

– encourage data parallelism

February 11th, 2004 7

Brook language

kernels• kernels

– functions applied to streams• similar to for_all construct

kernel void foo (float a<>, float b<>, out float result<>) {

result = a + b;}

float a<100>;float b<100>;float c<100>;

foo(a,b,c);for (i=0; i<100; i++)

c[i] = a[i]+b[i];

– no dependencies between stream elements• encourage high arithmetic intensity

February 11th, 2004 8

Brook language

kernels• Ray Triangle Intersection

kernel void krnIntersectTriangle(Ray ray<>, Triangle tris[], RayState oldraystate<>, GridTrilist trilist[], out Hit candidatehit<>) { float idx, det, inv_det; float3 edge1, edge2, pvec, tvec, qvec; if(oldraystate.state.y > 0) { idx = trilist[oldraystate.state.w].trinum; edge1 = tris[idx].v1 - tris[idx].v0; edge2 = tris[idx].v2 - tris[idx].v0; pvec = cross(ray.d, edge2); det = dot(edge1, pvec); inv_det = 1.0f/det; tvec = ray.o - tris[idx].v0; candidatehit.data.y = dot( tvec, pvec ) * inv_det; qvec = cross( tvec, edge1 ); candidatehit.data.z = dot( ray.d, qvec ) * inv_det; candidatehit.data.x = dot( edge2, qvec ) * inv_det; candidatehit.data.w = idx; } else { candidatehit.data = float4(0,0,0,-1); }}

February 11th, 2004 9

Brook language

additional features• reductions

– scalar– stream

• stride & repeat• GatherOp & ScatterOp

– a[i] += p – p = a[i]++

February 11th, 2004 10

brcc compiler

infrastructure• based on ctool

– http://ctool.sourceforge.net

• parser– build code tree– extend C grammar to accept Brook

• convert– tree transformations

• codegen– generate cg & hlsl code– call cgc, fxc– generate stub function

February 11th, 2004 11

Applications

Ray-tracerFFTSegmentationLinear Algebra:

– BLAS, LINPACK, LAPACK

February 11th, 2004 12

Brook Performance

February 11th, 2004 13

GPU Gotchas

Time

Registers Used

February 11th, 2004 14

GPU Gotchas

NVIDIA NV3x: Register usage vs. Time

Time

Registers Used

February 11th, 2004 15

GPU Gotchas

NVIDIA:• Register Penalty• Render to Texture Limitation

– Requires explicit copy or heavy pbuffer solution– Superbuffer extension neededhttp://mirror.ati.com/developer/SIGGRAPH03/Percy_OpenGL_Extensions SIG03.pdf

February 11th, 2004 16

GPU Gotchas

ATI Radeon 9800 Pro• Limited dependent

texture lookup• 96 instructions• 24-bit floating point

– s16e7Integers up to 131,072(s23e8: 16,777,216)

Memory Refs

Math Ops

Memory Refs

Math Ops

Memory Refs

Math Ops

Memory Refs

Math Ops

11

22

33

44

February 11th, 2004 17

GPU Catch-Up!

• Integer & Bit Ops & Double Precision• Memory Addressing• CGC/FXC Performance

– Hand code performance critical code

• No native reduction support• No native scatter support

– p[i] = a (indirect write)

• No programmable blend– GatherOp / ScatterOp

• Limited 4x4 output– Brook virtualized kernel outputs

• Readback still slow– NV35 OpenGL: 600 MB/sec Download 170 MB/sec Readback– ATI DirectX: 550 MB/sec Download 50 MB/sec Readback

February 11th, 2004 18

GPUs of the future (we hope)

• Complete Instruction Sets– Integers, Bit Ops, Doubles, Mem Access

• Integration– Streaming coprocessor not just a rendering

device

• Streaming architectures

SDRAM

SDRAM

SDRAM

SDRAM

Str

eam

R

egis

ter

Fil

e ALU Cluster

ALU Cluster

ALU Cluster

February 11th, 2004 19

Brook for GPUs

• Release v0.3 available on Sourceforge• Project Page

– http://graphics.stanford.edu/projects/brook

• Source– http://www.sourceforge.net/projects/brook

• Over 4K downloads!• Questions?

Fly-fishing fly images from The English Fly Fishing Shop

top related