graphics on gramps

31
FLASHG 15 Oct 2007 1 Graphics on GRAMPS Jeremy Sugerman Kayvon Fatahalian

Upload: lucus

Post on 08-Jan-2016

33 views

Category:

Documents


0 download

DESCRIPTION

Graphics on GRAMPS. Jeremy Sugerman Kayvon Fatahalian. Background. Context: Broader research investigation generalizing GPU/Cell/”compute” cores and combining them with CPUs. Fundamental Beliefs: Real data parallel apps still have performance critical non-data parallel pieces - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Graphics on GRAMPS

FLASHG 15 Oct 2007 1

Graphics on GRAMPSJeremy SugermanKayvon Fatahalian

Page 2: Graphics on GRAMPS

FLASHG 15 Oct 2007 2

Background Context: Broader research investigation

generalizing GPU/Cell/”compute” cores and combining them with CPUs.

Fundamental Beliefs:– Real data parallel apps still have performance

critical non-data parallel pieces– Existing parallel programming models are too

constrained (GPUs) or too hard/vague (CPUs)– Queues are an excellent idiom to capture

producer-consumer parallelism– thread and data– Fixed function execution units are not a

problem, but fixed control paths are

Page 3: Graphics on GRAMPS

FLASHG 15 Oct 2007 3

Compute Cores CPUs designed for single threads per core

Minimal FLOPS per core Compute cores design for lots of math per core

Many “threads” per coreSometimes wider SIMD per threadSIMD width * # hardware threads ops / core

And, more compute than CPU cores fit per chip

Many examples: GPU, Cell, Niagara, Larrabee

Page 4: Graphics on GRAMPS

FLASHG 15 Oct 2007 4

Simplified Direct3D Pipeline Application launches some drawing…

1. Vertex Assembly (Fixed, Non-Data Parallel)2. Vertex Processing (Programmable, Data Parallel)3. Primitive Assembly (Fixed, Non-Data Parallel)4. Primitive Processing (Programmable, Data Parallel)5. Fragment Assembly (Fixed, Non-Data Parallel)6. Fragment Processing (Programmable, Data

Parallel)7. Pixel / Image Assembly (Fixed, Non-Data Parallel)

Only Data Parallel stages are programmable!

Page 5: Graphics on GRAMPS

FLASHG 15 Oct 2007 5

Direct3D Pipeline Properties There is a reason only data parallel stages are

programmable. ‘Shader’ stages are inherently per-element

(e.g. vertex / primitive / fragment) and stateless between them.

‘Assembly’ stages also run on many elements, but they have inter-element dependencies– State can be remembered (vertex caching)– Inputs can be used by multiple outputs

(strips) Programmable ‘Assembly’ requires heavier

(more serial) threads than ‘Shaders’.

Page 6: Graphics on GRAMPS

FLASHG 15 Oct 2007 6

Question Can fixed-function control be decoupled

from efficient graphics performance on a compute- heavy architecture?

Does not necessarily exclude fixed-function execution blocks (eg. rasterizer, texture units…)

Page 7: Graphics on GRAMPS

FLASHG 15 Oct 2007 7

This Talk GRAMPS: Our current model for

programming compute cores. Implementing Direct3D 10 “in software”

with GRAMPS. (Potentially) thoughts about how REYES,

ray tracers map to GRAMPS.

No explicit discussion of heterogeneous cores.

No fancy scheduling algorithms (yet?)

Page 8: Graphics on GRAMPS

FLASHG 15 Oct 2007 8

Example: Simple 3D Pipeline

InputVertices

TransformedVertices

VertexShading

PrimitiveAssembly

Primitives

ImageAssembly

FragmentShading

Rasterize(Assemble)

FramebufferPixels

ShadedFragments

Fragments

Page 9: Graphics on GRAMPS

FLASHG 15 Oct 2007 9

GRAMPS General Runtime/Architecture for Multicore

Parallel Systems Models execution graph of queues connected

by threads Graph specified by host program

Simulator for exploring compute cores– Currently conflates “hardware” and

runtime– # of cores, thread contexts, SIMD width are

all parameters

Page 10: Graphics on GRAMPS

FLASHG 15 Oct 2007 10

Simple GRAMPS core

Thread 0

Thread 1

Thread T-1

Thread 2

ALU 0 ALU 1 ALU 2 ALU 3 ALU 4 ALU S-1…

L1 data cache (or scratchpad)

T - threads/core S - SIMD ALUs/core R - registers/thread

1 thread runs in each clock

Threads issue vector instructions (think S-wide SSE)

R

Page 11: Graphics on GRAMPS

FLASHG 15 Oct 2007 11

D3D10 Setup

1. App defines 3 shading environments– Vertex, geometry, fragment– Attach programs and resources

2. App configure fixed function units– Fixed number of “modes”– Attach resources

3. App submits work (vertices) to pipeline

4. Graphics runtime executes until completion

Page 12: Graphics on GRAMPS

FLASHG 15 Oct 2007 12

GRAMPS Setup

1. App defines a set of queues2. App defines a set of thread environments3. App attaches queues as thread inputs

and outputs4. App bootstraps computation by inserting

data into queue

5. Runtime executes threads until completion

Page 13: Graphics on GRAMPS

FLASHG 15 Oct 2007 13

GRAMPS Entities: Execution Threads: Assemble, Shader, Fixed

– Assemble: Stateful, akin to a regular thread

– Fixed: Special purpose hardware wrapped to appear an Assemble thread

– Shader: Stateless and data parallel

Page 14: Graphics on GRAMPS

FLASHG 15 Oct 2007 14

GRAMPS Entities: Data Queues for producer-consumer parallelism Queues for aggregating coherent work Queues support push and reserve/commit

for in-place Assembly Chunks are the units / granularity at which

Queues are manipulated.

Page 15: Graphics on GRAMPS

FLASHG 15 Oct 2007 15

GRAMPS Scheduling GRAMPS assigns Threads to hw contexts

– Based on graph, current Queue contents

Tiered scheduling model Tier-0: Trivially puts threads onto hw

threads Tier-1: Builds schedules for Tier-0. Tier-N: Arbitrarily clever. Doesn’t exist.

Page 16: Graphics on GRAMPS

FLASHG 15 Oct 2007 16

System(how it works today)

Page 17: Graphics on GRAMPS

FLASHG 15 Oct 2007 17

D3D10 on GRAMPSIndex queuepostVtxShade queue

preVtxShade queue

vtxShade

prePrimAssemble queue

primShade

primAssemble

prePrimShade queue

postPrimShade queue

rastAssemble

preRast queue

tri setup / clip / cull

tri queue 0

rasterize

preFragShade queue

fragShade

postFragShade queue

blend / ztest

= shader thread

idxVtxAssemble

= assemble thread

= fixed function in GPU

tri queue 1

rasterize

preFragShade queue

fragShade

postFragShade queue

blend / ztest

tri queue 2

rasterize

preFragShade queue

fragShade

postFragShade queue

blend / ztest

tri queue N

rasterize

preFragShade queue

fragShade

postFragShade queue

blend / ztest

Page 18: Graphics on GRAMPS

FLASHG 15 Oct 2007 18

Internal Queues Queues just memory + state struct (see below)

– For now: Queues are finite – Queues are contiguous array of chunks

Chunks = granularity of manipulation

queue { BYTE ptr[num_chunks * chunk_byte_width]; int num_chunks; int chunk_byte_width; int head; int tail; int reclaim; bool done[num_chunks];};

Page 19: Graphics on GRAMPS

FLASHG 15 Oct 2007 19

Ex: GRAMPS has chunks

Index queuepostVtxShade queue

preVtxShade queue

vtxShade

idxVtxAssemble

index_queue chunks contain vertex indicespreVtxShade_queue chunks contain 16 pre-transformed verticespostVtxShade_queue chunks contain 16 transformed vertices

Page 20: Graphics on GRAMPS

FLASHG 15 Oct 2007 20

Ex: GRAMPS has chunks

preFragshade_queue chunks contain:Interpolated inputs for 16 fragmentsliveness mask per fragmentx,y position per quaduniform data shared across all fragments

rasterize

preFragShade queue

fragShade

Page 21: Graphics on GRAMPS

FLASHG 15 Oct 2007 21

Queue API Window = view into a contiguous range of

chunks for assemble threads Symmetric for producing/consuming access

qwin { BYTE* ptr; int num; int id;};

Shader threads just have “push”

Page 22: Graphics on GRAMPS

FLASHG 15 Oct 2007 22

Queue manipulation

void produce() “push”

qwin* reserve(qwin* q, int num_chunks)qwin* commit(qwin* q, int num_chunks)

(Assemble shader only)

(All threads)

Page 23: Graphics on GRAMPS

FLASHG 15 Oct 2007 23

Internal threads Defines a “type” of thread

ThreadEnv { type = {shader, assemble, fixed-func} Program Code uniforms/constant data sampler/texture/resource id bindings

List of input queues List of output queues};

Page 24: Graphics on GRAMPS

FLASHG 15 Oct 2007 24

Shader threads Shading language unchanged (HLSL)

– Still write shaders in terms of single elements– Compilation produces code to operate on

chunks

void hlsl_likefn(const element* inputEl, element* outputEl, const sampler foo, const tex3d tex)

Page 25: Graphics on GRAMPS

FLASHG 15 Oct 2007 25

Internal shader threads Shader thread code processes chunks Input:

– GRAMPS pre-reserved chunks from in/out queues– Environment info (uniforms, consts, etc)

void shaderFn(const chunk* in_chunks[], chunk* out_chunks[], const env* env)

Dispatched shader threads run to completion Completion implies:

inChunks are released outChunks are commited

Page 26: Graphics on GRAMPS

FLASHG 15 Oct 2007 26

Assemble threads Assemble threads build chunks Access queue data via windows Commit/reserve/consume may block thread

void assembleFn(qwin* in_win[], qwin* out_win[],

const env* env)

Page 27: Graphics on GRAMPS

FLASHG 15 Oct 2007 27

Ex: primitive assembly Input chunks = 16 verts Output chunks = 16 prims Prim structure depends on type of prim

– Points lines, triangles, triangle /w adj, etc

Creating prims from verts dependent on topology– Strips or lists– Triangle strip: data for output chunk

comes from multiple input chunks

prePrimAssemble queue

primAssemble

prePrimShade queue

Page 28: Graphics on GRAMPS

FLASHG 15 Oct 2007 28

Ex: frag assembly (rast)

For (each input triangle) {

Add triangle uniform data to chunk

while (chunk not full && triangle not done) { rasterize next tile of quads… for (each nonempty quad) { add 4 fragments to chunk add quad description per chunk } }

if (chunk is full) { qwin_out = commit(qwin_out, 1); grow window with reserve() if necessary… }

} Building chunks: 1. Compact valid quads 2. Data at various frequencies

Page 29: Graphics on GRAMPS

FLASHG 15 Oct 2007 29

Execution: Tier 1

T 0T 1

T T-1T 2 L1 $

Thread_Done() (implicit commit)Produce()Reserve()Commit()

queue

queue

queue

queue

queue

queue shaderthreadEnv

shaderthreadEnv

shaderthreadEnv

shaderthreadEnv

assemblethreadEnv

assemblethreadEnv

assemblethreadEnv

assemblethreadEnv

ShaderThr dispatchAssembleThr resume

Tier 1 to Tier 0 FIFO

Page 30: Graphics on GRAMPS

FLASHG 15 Oct 2007 30

Execution: Tier 0

Tier 1 to Tier 0 FIFO

Thread 0

Thread 1

Thread T-1

Thread 2

ALU 0 ALU 1 ALU 2 ALU 3 ALU 4 ALU S-1…

L1 data cache (or scratchpad)

R

Tier 0Scheduler

Each cycle: round robin runnable threads

Thread stalls: place on wait list

When thread completes: 1. Pull next thread from

fifo, assign to empty thread slot

2. Send completion message to tier 0

Page 31: Graphics on GRAMPS

FLASHG 15 Oct 2007 31

Validation “Fat enough” cores for assemble threads

can deliver sufficient FLOPS

Assemble threads can keep compute cores + fixed-function units busy

Can give up domain-specific heuristics in the scheduling