graphics on gramps

FLASHG 15 Oct 2007 1

Graphics on GRAMPSJeremy SugermanKayvon Fatahalian


Background Context: Broader research investigation

generalizing GPU/Cell/”compute” cores and combining them with CPUs.

Fundamental Beliefs:– Real data parallel apps still have performance

critical non-data parallel pieces– Existing parallel programming models are too

constrained (GPUs) or too hard/vague (CPUs)– Queues are an excellent idiom to capture

producer-consumer parallelism– thread and data– Fixed function execution units are not a

problem, but fixed control paths are


Compute Cores CPUs designed for single threads per core

Minimal FLOPS per core Compute cores design for lots of math per core

Many “threads” per coreSometimes wider SIMD per threadSIMD width * # hardware threads ops / core

And, more compute than CPU cores fit per chip

Many examples: GPU, Cell, Niagara, Larrabee


Simplified Direct3D Pipeline Application launches some drawing…

1. Vertex Assembly (Fixed, Non-Data Parallel)2. Vertex Processing (Programmable, Data Parallel)3. Primitive Assembly (Fixed, Non-Data Parallel)4. Primitive Processing (Programmable, Data Parallel)5. Fragment Assembly (Fixed, Non-Data Parallel)6. Fragment Processing (Programmable, Data

Parallel)7. Pixel / Image Assembly (Fixed, Non-Data Parallel)

Only Data Parallel stages are programmable!


Direct3D Pipeline Properties There is a reason only data parallel stages are

programmable. ‘Shader’ stages are inherently per-element

(e.g. vertex / primitive / fragment) and stateless between them.

‘Assembly’ stages also run on many elements, but they have inter-element dependencies– State can be remembered (vertex caching)– Inputs can be used by multiple outputs

(strips) Programmable ‘Assembly’ requires heavier

(more serial) threads than ‘Shaders’.


Question Can fixed-function control be decoupled

from efficient graphics performance on a compute- heavy architecture?

Does not necessarily exclude fixed-function execution blocks (eg. rasterizer, texture units…)


This Talk GRAMPS: Our current model for

programming compute cores. Implementing Direct3D 10 “in software”

with GRAMPS. (Potentially) thoughts about how REYES,

ray tracers map to GRAMPS.

No explicit discussion of heterogeneous cores.

No fancy scheduling algorithms (yet?)


Example: Simple 3D Pipeline

InputVertices

TransformedVertices

VertexShading

PrimitiveAssembly

Primitives

ImageAssembly

FragmentShading

Rasterize(Assemble)

FramebufferPixels

ShadedFragments

Fragments


GRAMPS General Runtime/Architecture for Multicore

Parallel Systems Models execution graph of queues connected

by threads Graph specified by host program

Simulator for exploring compute cores– Currently conflates “hardware” and

runtime– # of cores, thread contexts, SIMD width are

all parameters


Simple GRAMPS core

Thread 0

Thread 1

Thread T-1

Thread 2

…

ALU 0 ALU 1 ALU 2 ALU 3 ALU 4 ALU S-1…

L1 data cache (or scratchpad)

T - threads/core S - SIMD ALUs/core R - registers/thread

1 thread runs in each clock

Threads issue vector instructions (think S-wide SSE)

R


D3D10 Setup

1. App defines 3 shading environments– Vertex, geometry, fragment– Attach programs and resources

2. App configure fixed function units– Fixed number of “modes”– Attach resources

3. App submits work (vertices) to pipeline

4. Graphics runtime executes until completion


GRAMPS Setup

1. App defines a set of queues2. App defines a set of thread environments3. App attaches queues as thread inputs

and outputs4. App bootstraps computation by inserting

data into queue

5. Runtime executes threads until completion


GRAMPS Entities: Execution Threads: Assemble, Shader, Fixed

– Assemble: Stateful, akin to a regular thread

– Fixed: Special purpose hardware wrapped to appear an Assemble thread

– Shader: Stateless and data parallel


GRAMPS Entities: Data Queues for producer-consumer parallelism Queues for aggregating coherent work Queues support push and reserve/commit

for in-place Assembly Chunks are the units / granularity at which

Queues are manipulated.


GRAMPS Scheduling GRAMPS assigns Threads to hw contexts

– Based on graph, current Queue contents

Tiered scheduling model Tier-0: Trivially puts threads onto hw

threads Tier-1: Builds schedules for Tier-0. Tier-N: Arbitrarily clever. Doesn’t exist.


System(how it works today)


D3D10 on GRAMPSIndex queuepostVtxShade queue

preVtxShade queue

vtxShade

prePrimAssemble queue

primShade

primAssemble

prePrimShade queue

postPrimShade queue

rastAssemble

preRast queue

tri setup / clip / cull

tri queue 0

rasterize

preFragShade queue

fragShade

postFragShade queue

blend / ztest

= shader thread

idxVtxAssemble

= assemble thread

= fixed function in GPU

tri queue 1

rasterize

preFragShade queue

fragShade

postFragShade queue

blend / ztest

tri queue 2

rasterize

preFragShade queue

fragShade

postFragShade queue

blend / ztest

tri queue N

rasterize

preFragShade queue

fragShade

postFragShade queue

blend / ztest


Internal Queues Queues just memory + state struct (see below)

– For now: Queues are finite – Queues are contiguous array of chunks

Chunks = granularity of manipulation

queue { BYTE ptr[num_chunks * chunk_byte_width]; int num_chunks; int chunk_byte_width; int head; int tail; int reclaim; bool done[num_chunks];};


Ex: GRAMPS has chunks

Index queuepostVtxShade queue

preVtxShade queue

vtxShade

idxVtxAssemble

index_queue chunks contain vertex indicespreVtxShade_queue chunks contain 16 pre-transformed verticespostVtxShade_queue chunks contain 16 transformed vertices


Ex: GRAMPS has chunks

preFragshade_queue chunks contain:Interpolated inputs for 16 fragmentsliveness mask per fragmentx,y position per quaduniform data shared across all fragments

rasterize

preFragShade queue

fragShade


Queue API Window = view into a contiguous range of

chunks for assemble threads Symmetric for producing/consuming access

qwin { BYTE* ptr; int num; int id;};

Shader threads just have “push”


Queue manipulation

void produce() “push”

qwin* reserve(qwin* q, int num_chunks)qwin* commit(qwin* q, int num_chunks)

(Assemble shader only)

(All threads)


Internal threads Defines a “type” of thread

ThreadEnv { type = {shader, assemble, fixed-func} Program Code uniforms/constant data sampler/texture/resource id bindings

List of input queues List of output queues};


Shader threads Shading language unchanged (HLSL)

– Still write shaders in terms of single elements– Compilation produces code to operate on

chunks

void hlsl_likefn(const element* inputEl, element* outputEl, const sampler foo, const tex3d tex)


Internal shader threads Shader thread code processes chunks Input:

– GRAMPS pre-reserved chunks from in/out queues– Environment info (uniforms, consts, etc)

void shaderFn(const chunk* in_chunks[], chunk* out_chunks[], const env* env)

Dispatched shader threads run to completion Completion implies:

inChunks are released outChunks are commited


Assemble threads Assemble threads build chunks Access queue data via windows Commit/reserve/consume may block thread

void assembleFn(qwin* in_win[], qwin* out_win[],

const env* env)


Ex: primitive assembly Input chunks = 16 verts Output chunks = 16 prims Prim structure depends on type of prim

– Points lines, triangles, triangle /w adj, etc

Creating prims from verts dependent on topology– Strips or lists– Triangle strip: data for output chunk

comes from multiple input chunks

prePrimAssemble queue

primAssemble

prePrimShade queue


Ex: frag assembly (rast)

For (each input triangle) {

Add triangle uniform data to chunk

while (chunk not full && triangle not done) { rasterize next tile of quads… for (each nonempty quad) { add 4 fragments to chunk add quad description per chunk } }

if (chunk is full) { qwin_out = commit(qwin_out, 1); grow window with reserve() if necessary… }

} Building chunks: 1. Compact valid quads 2. Data at various frequencies


Execution: Tier 1

T 0T 1

T T-1T 2 L1 $

Thread_Done() (implicit commit)Produce()Reserve()Commit()

queue

queue

queue

queue

queue

queue shaderthreadEnv

shaderthreadEnv

shaderthreadEnv

shaderthreadEnv

assemblethreadEnv

assemblethreadEnv

assemblethreadEnv

assemblethreadEnv

ShaderThr dispatchAssembleThr resume

Tier 1 to Tier 0 FIFO


Execution: Tier 0

Tier 1 to Tier 0 FIFO

Thread 0

Thread 1

Thread T-1

Thread 2

…

ALU 0 ALU 1 ALU 2 ALU 3 ALU 4 ALU S-1…

L1 data cache (or scratchpad)

R

Tier 0Scheduler

Each cycle: round robin runnable threads

Thread stalls: place on wait list

When thread completes: 1. Pull next thread from

fifo, assign to empty thread slot

2. Send completion message to tier 0


Validation “Fat enough” cores for assemble threads

can deliver sufficient FLOPS

Assemble threads can keep compute cores + fixed-function units busy

Can give up domain-specific heuristics in the scheduling

graphics on gramps

Documents