graphics on gramps
DESCRIPTION
Graphics on GRAMPS. Jeremy Sugerman Kayvon Fatahalian. Background. Context: Broader research investigation generalizing GPU/Cell/”compute” cores and combining them with CPUs. Fundamental Beliefs: Real data parallel apps still have performance critical non-data parallel pieces - PowerPoint PPT PresentationTRANSCRIPT
FLASHG 15 Oct 2007 1
Graphics on GRAMPSJeremy SugermanKayvon Fatahalian
FLASHG 15 Oct 2007 2
Background Context: Broader research investigation
generalizing GPU/Cell/”compute” cores and combining them with CPUs.
Fundamental Beliefs:– Real data parallel apps still have performance
critical non-data parallel pieces– Existing parallel programming models are too
constrained (GPUs) or too hard/vague (CPUs)– Queues are an excellent idiom to capture
producer-consumer parallelism– thread and data– Fixed function execution units are not a
problem, but fixed control paths are
FLASHG 15 Oct 2007 3
Compute Cores CPUs designed for single threads per core
Minimal FLOPS per core Compute cores design for lots of math per core
Many “threads” per coreSometimes wider SIMD per threadSIMD width * # hardware threads ops / core
And, more compute than CPU cores fit per chip
Many examples: GPU, Cell, Niagara, Larrabee
FLASHG 15 Oct 2007 4
Simplified Direct3D Pipeline Application launches some drawing…
1. Vertex Assembly (Fixed, Non-Data Parallel)2. Vertex Processing (Programmable, Data Parallel)3. Primitive Assembly (Fixed, Non-Data Parallel)4. Primitive Processing (Programmable, Data Parallel)5. Fragment Assembly (Fixed, Non-Data Parallel)6. Fragment Processing (Programmable, Data
Parallel)7. Pixel / Image Assembly (Fixed, Non-Data Parallel)
Only Data Parallel stages are programmable!
FLASHG 15 Oct 2007 5
Direct3D Pipeline Properties There is a reason only data parallel stages are
programmable. ‘Shader’ stages are inherently per-element
(e.g. vertex / primitive / fragment) and stateless between them.
‘Assembly’ stages also run on many elements, but they have inter-element dependencies– State can be remembered (vertex caching)– Inputs can be used by multiple outputs
(strips) Programmable ‘Assembly’ requires heavier
(more serial) threads than ‘Shaders’.
FLASHG 15 Oct 2007 6
Question Can fixed-function control be decoupled
from efficient graphics performance on a compute- heavy architecture?
Does not necessarily exclude fixed-function execution blocks (eg. rasterizer, texture units…)
FLASHG 15 Oct 2007 7
This Talk GRAMPS: Our current model for
programming compute cores. Implementing Direct3D 10 “in software”
with GRAMPS. (Potentially) thoughts about how REYES,
ray tracers map to GRAMPS.
No explicit discussion of heterogeneous cores.
No fancy scheduling algorithms (yet?)
FLASHG 15 Oct 2007 8
Example: Simple 3D Pipeline
InputVertices
TransformedVertices
VertexShading
PrimitiveAssembly
Primitives
ImageAssembly
FragmentShading
Rasterize(Assemble)
FramebufferPixels
ShadedFragments
Fragments
FLASHG 15 Oct 2007 9
GRAMPS General Runtime/Architecture for Multicore
Parallel Systems Models execution graph of queues connected
by threads Graph specified by host program
Simulator for exploring compute cores– Currently conflates “hardware” and
runtime– # of cores, thread contexts, SIMD width are
all parameters
FLASHG 15 Oct 2007 10
Simple GRAMPS core
Thread 0
Thread 1
Thread T-1
Thread 2
…
ALU 0 ALU 1 ALU 2 ALU 3 ALU 4 ALU S-1…
L1 data cache (or scratchpad)
T - threads/core S - SIMD ALUs/core R - registers/thread
1 thread runs in each clock
Threads issue vector instructions (think S-wide SSE)
R
FLASHG 15 Oct 2007 11
D3D10 Setup
1. App defines 3 shading environments– Vertex, geometry, fragment– Attach programs and resources
2. App configure fixed function units– Fixed number of “modes”– Attach resources
3. App submits work (vertices) to pipeline
4. Graphics runtime executes until completion
FLASHG 15 Oct 2007 12
GRAMPS Setup
1. App defines a set of queues2. App defines a set of thread environments3. App attaches queues as thread inputs
and outputs4. App bootstraps computation by inserting
data into queue
5. Runtime executes threads until completion
FLASHG 15 Oct 2007 13
GRAMPS Entities: Execution Threads: Assemble, Shader, Fixed
– Assemble: Stateful, akin to a regular thread
– Fixed: Special purpose hardware wrapped to appear an Assemble thread
– Shader: Stateless and data parallel
FLASHG 15 Oct 2007 14
GRAMPS Entities: Data Queues for producer-consumer parallelism Queues for aggregating coherent work Queues support push and reserve/commit
for in-place Assembly Chunks are the units / granularity at which
Queues are manipulated.
FLASHG 15 Oct 2007 15
GRAMPS Scheduling GRAMPS assigns Threads to hw contexts
– Based on graph, current Queue contents
Tiered scheduling model Tier-0: Trivially puts threads onto hw
threads Tier-1: Builds schedules for Tier-0. Tier-N: Arbitrarily clever. Doesn’t exist.
FLASHG 15 Oct 2007 16
System(how it works today)
FLASHG 15 Oct 2007 17
D3D10 on GRAMPSIndex queuepostVtxShade queue
preVtxShade queue
vtxShade
prePrimAssemble queue
primShade
primAssemble
prePrimShade queue
postPrimShade queue
rastAssemble
preRast queue
tri setup / clip / cull
tri queue 0
rasterize
preFragShade queue
fragShade
postFragShade queue
blend / ztest
= shader thread
idxVtxAssemble
= assemble thread
= fixed function in GPU
tri queue 1
rasterize
preFragShade queue
fragShade
postFragShade queue
blend / ztest
tri queue 2
rasterize
preFragShade queue
fragShade
postFragShade queue
blend / ztest
tri queue N
rasterize
preFragShade queue
fragShade
postFragShade queue
blend / ztest
FLASHG 15 Oct 2007 18
Internal Queues Queues just memory + state struct (see below)
– For now: Queues are finite – Queues are contiguous array of chunks
Chunks = granularity of manipulation
queue { BYTE ptr[num_chunks * chunk_byte_width]; int num_chunks; int chunk_byte_width; int head; int tail; int reclaim; bool done[num_chunks];};
FLASHG 15 Oct 2007 19
Ex: GRAMPS has chunks
Index queuepostVtxShade queue
preVtxShade queue
vtxShade
idxVtxAssemble
index_queue chunks contain vertex indicespreVtxShade_queue chunks contain 16 pre-transformed verticespostVtxShade_queue chunks contain 16 transformed vertices
FLASHG 15 Oct 2007 20
Ex: GRAMPS has chunks
preFragshade_queue chunks contain:Interpolated inputs for 16 fragmentsliveness mask per fragmentx,y position per quaduniform data shared across all fragments
rasterize
preFragShade queue
fragShade
FLASHG 15 Oct 2007 21
Queue API Window = view into a contiguous range of
chunks for assemble threads Symmetric for producing/consuming access
qwin { BYTE* ptr; int num; int id;};
Shader threads just have “push”
FLASHG 15 Oct 2007 22
Queue manipulation
void produce() “push”
qwin* reserve(qwin* q, int num_chunks)qwin* commit(qwin* q, int num_chunks)
(Assemble shader only)
(All threads)
FLASHG 15 Oct 2007 23
Internal threads Defines a “type” of thread
ThreadEnv { type = {shader, assemble, fixed-func} Program Code uniforms/constant data sampler/texture/resource id bindings
List of input queues List of output queues};
FLASHG 15 Oct 2007 24
Shader threads Shading language unchanged (HLSL)
– Still write shaders in terms of single elements– Compilation produces code to operate on
chunks
void hlsl_likefn(const element* inputEl, element* outputEl, const sampler foo, const tex3d tex)
FLASHG 15 Oct 2007 25
Internal shader threads Shader thread code processes chunks Input:
– GRAMPS pre-reserved chunks from in/out queues– Environment info (uniforms, consts, etc)
void shaderFn(const chunk* in_chunks[], chunk* out_chunks[], const env* env)
Dispatched shader threads run to completion Completion implies:
inChunks are released outChunks are commited
FLASHG 15 Oct 2007 26
Assemble threads Assemble threads build chunks Access queue data via windows Commit/reserve/consume may block thread
void assembleFn(qwin* in_win[], qwin* out_win[],
const env* env)
FLASHG 15 Oct 2007 27
Ex: primitive assembly Input chunks = 16 verts Output chunks = 16 prims Prim structure depends on type of prim
– Points lines, triangles, triangle /w adj, etc
Creating prims from verts dependent on topology– Strips or lists– Triangle strip: data for output chunk
comes from multiple input chunks
prePrimAssemble queue
primAssemble
prePrimShade queue
FLASHG 15 Oct 2007 28
Ex: frag assembly (rast)
For (each input triangle) {
Add triangle uniform data to chunk
while (chunk not full && triangle not done) { rasterize next tile of quads… for (each nonempty quad) { add 4 fragments to chunk add quad description per chunk } }
if (chunk is full) { qwin_out = commit(qwin_out, 1); grow window with reserve() if necessary… }
} Building chunks: 1. Compact valid quads 2. Data at various frequencies
FLASHG 15 Oct 2007 29
Execution: Tier 1
T 0T 1
T T-1T 2 L1 $
Thread_Done() (implicit commit)Produce()Reserve()Commit()
queue
queue
queue
queue
queue
queue shaderthreadEnv
shaderthreadEnv
shaderthreadEnv
shaderthreadEnv
assemblethreadEnv
assemblethreadEnv
assemblethreadEnv
assemblethreadEnv
ShaderThr dispatchAssembleThr resume
Tier 1 to Tier 0 FIFO
FLASHG 15 Oct 2007 30
Execution: Tier 0
Tier 1 to Tier 0 FIFO
Thread 0
Thread 1
Thread T-1
Thread 2
…
ALU 0 ALU 1 ALU 2 ALU 3 ALU 4 ALU S-1…
L1 data cache (or scratchpad)
R
Tier 0Scheduler
Each cycle: round robin runnable threads
Thread stalls: place on wait list
When thread completes: 1. Pull next thread from
fifo, assign to empty thread slot
2. Send completion message to tier 0
FLASHG 15 Oct 2007 31
Validation “Fat enough” cores for assemble threads
can deliver sufficient FLOPS
Assemble threads can keep compute cores + fixed-function units busy
Can give up domain-specific heuristics in the scheduling