eecs 583 – class 20 research topic 2: stream compilation, gpu compilation

EECS 583 – Class 20Research Topic 2: Stream Compilation, GPU Compilation

University of Michigan

December 3, 2012

Guest Speakers Today: Daya Khudia and Mehrzad Samadi

- 2 -

Announcements & Reading Material

Exams graded and will be returned in Wednesday’s class

This class» “Orchestrating the Execution of Stream Programs on Multicore

Platforms,” M. Kudlur and S. Mahlke, Proc. ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation, Jun. 2008.

Next class – Research Topic 3: Security» “Dynamic taint analysis for automatic detection, analysis, and

signature generation of exploits on commodity software,” James Newsome and Dawn Song, Proceedings of the Network and Distributed System Security Symposium, Feb. 2005

Stream Graph Modulo Scheduling

- 4 -

Stream Graph Modulo Scheduling (SGMS)

Coarse grain software pipelining» Equal work distribution» Communication/computation

overlap» Synchronization costs

Target : Cell processor» Cores with disjoint address spaces» Explicit copy to access remote data

DMA engine independent of PEs

Filters = operations, cores = function units

SPU

256 KB LS

MFC(DMA)

SPU

256 KB LS

MFC(DMA)

SPU

256 KB LS

MFC(DMA)

EIB

PPE(Power PC)

DRAM

SPE0 SPE1 SPE7

- 5 -

Preliminaries Synchronous Data Flow (SDF) [Lee ’87] StreamIt [Thies ’02]

int->int filter FIR(int N, int wgts[N]) {

work pop 1 push 1 { int i, sum = 0;

for(i=0; i<N; i++) sum += peek(i)*wgts[i]; push(sum); pop(); }}

Push and pop items from input/output FIFOs

Stateless

int wgts[N];

wgts = adapt(wgts);

Stateful

- 6 -

SGMS Overview

PE0 PE1 PE2 PE3

PE0

T1

T4

T4

T1 ≈ 4

DMA

DMA

DMA

DMA

DMA

DMA

DMA

DMA

Prologue

Epilogue

- 7 -

SGMS Phases

Fission +Processor

assignment

Stageassignment

Codegeneration

Load balance CausalityDMA overlap

- 8 -

Processor Assignment: Maximizing Throughputs

iij

jij

IIaiw

a

)(

1 for all filter i = 1, …, N

for all PE j = 1,…,P

Minimize II

B C

E

D

F

W: 20

W: 20 W: 20

W: 30

W: 50

W: 30

A

B C

E

D

F

A

A

DB

CE

F

Minimum II: 50Balanced workload!Maximum throughput

T2 = 50

A

B

C

D

E

F

PE0

T1 = 170

T1/T2 = 3. 4

PE0 PE1 PE2 PE3Four Processing Elements

• Assigns each filter to a processor

PE1PE0 PE2 PE3

W: workload

- 9 -

Need More Than Just Processor Assignment

Assign filters to processors» Goal : Equal work distribution

Graph partitioning? Bin packing?

A

B

D

C

A

B C

D

5

5

40 10

Original stream program

PE0 PE1

B

A

C

D Speedup = 60/40 = 1.5

A

B1

D

C

B2

J

S

Modified stream program

B2

CJ

B1

AS

D Speedup = 60/32 ~ 2

- 10 -

Filter Fission Choices

PE0 PE1 PE2 PE3

Speedup ~ 4 ?

- 11 -

Integrated Fission + PE Assign Exact solution based on Integer Linear Programming (ILP)

…

Split/Join overheadfactored in

Objective function- Maximal load on any PE» Minimize

Result» Number of times to “split” each filter

» Filter → processor mapping

- 12 -

Step 2: Forming the Software Pipeline To achieve speedup

» All chunks should execute concurrently» Communication should be overlapped

Processor assignment alone is insufficient information

A

B

C

A

CB

PE0 PE1

PE0 PE1

Tim

e AB

A1

B1

A2

A1

B1

A2A→B

A1

B1

A2A→B

A3A→B

Overlap Ai+2 with Bi

X

- 13 -

Stage Assignment

i

j

PE 1

Sj ≥ Si

i

j

DMA

PE 1

PE 2

Si

SDMA > Si

Sj = SDMA+1

Preserve causality(producer-consumer dependence)

Communication-computationoverlap

Data flow traversal of the stream graph» Assign stages using above two rules

- 14 -

Stage Assignment Example

A

B1

D

C

B2

J

S

AS

B1Stage 0

DMA DMA DMA Stage 1

CB2

J

Stage 2

D

DMA Stage 3

Stage 4

DMA

PE 0

PE 1

- 15 -

Step 3: Code Generation for Cell

Target the Synergistic Processing Elements (SPEs)» PS3 – up to 6 SPEs

» QS20 – up to 16 SPEs

One thread / SPE Challenge

» Making a collection of independent threads implement a software pipeline

» Adapt kernel-only code schema of a modulo schedule

- 16 -

Complete Example

void spe1_work(){ char stage[5] = {0}; stage[0] = 1; for(i=0; i<MAX; i++) { if (stage[0]) { A(); S(); B1(); } if (stage[1]) { } if (stage[2]) { JtoD(); CtoD(); } if (stage[3]) { } if (stage[4]) { D(); } barrier(); }}

A

S

B1

DMA DMA DMA

CB2

J

D

DMA DMA

AS

B1

AS

B1

B1toJ

StoB2

AtoC

AS

B1

B2

J

C

B1toJ

StoB2

AtoC

AS

B1

JtoD

CtoD B2

J

C

B1toJ

StoB2

AtoC

AS

B1

JtoD

D

CtoD B2

J

C

B1toJ

StoB2

AtoC

SPE1 DMA1 SPE2 DMA2T

ime

- 17 -

SGMS(ILP) vs. Greedy

0

1

2

3

4

5

6

7

8

9

bitonic channel dct des fft filterbank fmradio tde mpeg2 vocoder radar

Benchmarks

Rel

ativ

e S

peed

up

ILP Partitioning Greedy Partitioning Exposed DMA

(MIT method, ASPLOS’06)

• Solver time < 30 seconds for 16 processors

- 18 -

SGMS Conclusions

Streamroller» Efficient mapping of stream programs to multicore

» Coarse grain software pipelining

Performance summary» 14.7x speedup on 16 cores

» Up to 35% better than greedy solution (11% on average)

Scheduling framework» Tradeoff memory space vs. load balance

Memory constrained (embedded) systems Cache based system

- 19 -

Discussion Points

Is it possible to convert stateful filters into stateless? What if the application does not behave as you expect?

» Filters change execution time?

» Memory faster/slower than expected?

Could this be adapted for a more conventional multiprocessor with caches?

Can C code be automatically streamized? Now you have seen 3 forms of software pipelining:

» 1) Instruction level modulo scheduling, 2) Decoupled software pipelining, 3) Stream graph modulo scheduling

» Where else can it be used?

Compilation for GPUs

- 21 -

Why GPUs?

21

- 22 -

Efficiency of GPUs

22

High Memory

Bandwidth

GTX 285 : 159 GB/Seci7 : 32 GB/Sec

High FlopRate

i7 :102 GFLOPS

GTX 285 :1062 GFLOPS

i7 : 51 GFLOPS

GTX 285 : 88.5 GFLOPS

GTX 480 : 168 GFLOPS

High FlopPer Watt

GTX 285 : 5.2 GFLOP/W

i7 : 0.78 GFLOP/W

High FlopPer Dollar

GTX 285 : 3.54 GFLOP/$i7 : 0.36 GFLOP/$

- 23 -

GPU Architecture

23

Shared

Regs

0 1

2 3

4 5

6 7

Interconnection Network

Global Memory (Device Memory)PCIe

Bridge

CPU HostMemory

Shared

Regs

0 1

2 3

4 5

6 7

Shared

Regs

0 1

2 3

4 5

6 7

Shared

Regs

0 1

2 3

4 5

6 7

SM 0 SM 1 SM 2 SM 29

- 24 -

CUDA

“Compute Unified Device Architecture”

General purpose programming model» User kicks off batches of threads on the GPU

Advantages of CUDA» Interface designed for compute - graphics free API

» Orchestration of on chip cores

» Explicit GPU memory management

» Full support for Integer and bitwise operations

24

- 25 -

Programming Model

25

Host

Kernel 1

Kernel 2

Grid 1

Grid 2

DeviceTi

me

- 26 -

Grid 1

GPU Scheduling

26

SM 0

Shared

Regs

0 1

2 3

4 5

6 7

SM 1

Shared

Regs

0 1

2 3

4 5

6 7

SM 2

Shared

Regs

0 1

2 3

4 5

6 7

SM 3

Shared

Regs

0 1

2 3

4 5

6 7

SM 30

Shared

Regs

0 1

2 3

4 5

6 7

- 27 -

Warp Generation

Block 0

Block 1

Block 3

Shared

Registers

0 1

2

4 5

3

6 7

SM0

27

Block 2

ThreadId

0 31 32 63Warp 0 Warp 1

- 28 -

Memory Hierarchy

28

Per BlockShared Memory

__shared__ int SharedVar

Block 0

Per-threadRegister

int LocalVarArray[10]

Per-threadLocal Memory

int RegisterVarThread 0

Grid 0

Per appGlobal Memory

Host

__global__ int GlobalVar

__constant__ int ConstVar

Texture<float,1,ReadMode> TextureVar

Per appTexture Memory

Per appConstant Memory

Devic

e

eecs 583 – class 20 research topic 2: stream compilation, gpu compilation

Documents

gpu scheduling

stream compilation

intint filter firint

int wgtsn

int globalvardevice

execution of stream

stream graphassign stages

processor assignment