eecs 583 – class 20 research topic 2: stream compilation, gpu compilation

28
EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation University of Michigan December 3, 2012 Guest Speakers Today: Daya Khudia and Mehrzad Samadi

Upload: hubert

Post on 09-Jan-2016

28 views

Category:

Documents


0 download

DESCRIPTION

EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation. University of Michigan December 3, 2012 Guest Speakers Today: Daya Khudia and Mehrzad Samadi. Announcements & Reading Material. Exams graded and will be returned in Wednesday’s class This class - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation

EECS 583 – Class 20Research Topic 2: Stream Compilation, GPU Compilation

University of Michigan

December 3, 2012

Guest Speakers Today: Daya Khudia and Mehrzad Samadi

Page 2: EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation

- 2 -

Announcements & Reading Material

Exams graded and will be returned in Wednesday’s class

This class» “Orchestrating the Execution of Stream Programs on Multicore

Platforms,” M. Kudlur and S. Mahlke, Proc. ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation, Jun. 2008.

Next class – Research Topic 3: Security» “Dynamic taint analysis for automatic detection, analysis, and

signature generation of exploits on commodity software,” James Newsome and Dawn Song, Proceedings of the Network and Distributed System Security Symposium, Feb. 2005

Page 3: EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation

Stream Graph Modulo Scheduling

Page 4: EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation

- 4 -

Stream Graph Modulo Scheduling (SGMS)

Coarse grain software pipelining» Equal work distribution» Communication/computation

overlap» Synchronization costs

Target : Cell processor» Cores with disjoint address spaces» Explicit copy to access remote data

DMA engine independent of PEs

Filters = operations, cores = function units

SPU

256 KB LS

MFC(DMA)

SPU

256 KB LS

MFC(DMA)

SPU

256 KB LS

MFC(DMA)

EIB

PPE(Power PC)

DRAM

SPE0 SPE1 SPE7

Page 5: EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation

- 5 -

Preliminaries Synchronous Data Flow (SDF) [Lee ’87] StreamIt [Thies ’02]

int->int filter FIR(int N, int wgts[N]) {

work pop 1 push 1 { int i, sum = 0;

for(i=0; i<N; i++) sum += peek(i)*wgts[i]; push(sum); pop(); }}

Push and pop items from input/output FIFOs

Stateless

int wgts[N];

wgts = adapt(wgts);

Stateful

Page 6: EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation

- 6 -

SGMS Overview

PE0 PE1 PE2 PE3

PE0

T1

T4

T4

T1 ≈ 4

DMA

DMA

DMA

DMA

DMA

DMA

DMA

DMA

Prologue

Epilogue

Page 7: EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation

- 7 -

SGMS Phases

Fission +Processor

assignment

Stageassignment

Codegeneration

Load balance CausalityDMA overlap

Page 8: EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation

- 8 -

Processor Assignment: Maximizing Throughputs

iij

jij

IIaiw

a

)(

1 for all filter i = 1, …, N

for all PE j = 1,…,P

Minimize II

B C

E

D

F

W: 20

W: 20 W: 20

W: 30

W: 50

W: 30

A

B C

E

D

F

A

A

DB

CE

F

Minimum II: 50Balanced workload!Maximum throughput

T2 = 50

A

B

C

D

E

F

PE0

T1 = 170

T1/T2 = 3. 4

PE0 PE1 PE2 PE3Four Processing Elements

• Assigns each filter to a processor

PE1PE0 PE2 PE3

W: workload

Page 9: EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation

- 9 -

Need More Than Just Processor Assignment

Assign filters to processors» Goal : Equal work distribution

Graph partitioning? Bin packing?

A

B

D

C

A

B C

D

5

5

40 10

Original stream program

PE0 PE1

B

A

C

D Speedup = 60/40 = 1.5

A

B1

D

C

B2

J

S

Modified stream program

B2

CJ

B1

AS

D Speedup = 60/32 ~ 2

Page 10: EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation

- 10 -

Filter Fission Choices

PE0 PE1 PE2 PE3

Speedup ~ 4 ?

Page 11: EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation

- 11 -

Integrated Fission + PE Assign Exact solution based on Integer Linear Programming (ILP)

Split/Join overheadfactored in

Objective function- Maximal load on any PE» Minimize

Result» Number of times to “split” each filter

» Filter → processor mapping

Page 12: EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation

- 12 -

Step 2: Forming the Software Pipeline To achieve speedup

» All chunks should execute concurrently» Communication should be overlapped

Processor assignment alone is insufficient information

A

B

C

A

CB

PE0 PE1

PE0 PE1

Tim

e AB

A1

B1

A2

A1

B1

A2A→B

A1

B1

A2A→B

A3A→B

Overlap Ai+2 with Bi

X

Page 13: EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation

- 13 -

Stage Assignment

i

j

PE 1

Sj ≥ Si

i

j

DMA

PE 1

PE 2

Si

SDMA > Si

Sj = SDMA+1

Preserve causality(producer-consumer dependence)

Communication-computationoverlap

Data flow traversal of the stream graph» Assign stages using above two rules

Page 14: EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation

- 14 -

Stage Assignment Example

A

B1

D

C

B2

J

S

AS

B1Stage 0

DMA DMA DMA Stage 1

CB2

J

Stage 2

D

DMA Stage 3

Stage 4

DMA

PE 0

PE 1

Page 15: EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation

- 15 -

Step 3: Code Generation for Cell

Target the Synergistic Processing Elements (SPEs)» PS3 – up to 6 SPEs

» QS20 – up to 16 SPEs

One thread / SPE Challenge

» Making a collection of independent threads implement a software pipeline

» Adapt kernel-only code schema of a modulo schedule

Page 16: EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation

- 16 -

Complete Example

void spe1_work(){ char stage[5] = {0}; stage[0] = 1; for(i=0; i<MAX; i++) { if (stage[0]) { A(); S(); B1(); } if (stage[1]) { } if (stage[2]) { JtoD(); CtoD(); } if (stage[3]) { } if (stage[4]) { D(); } barrier(); }}

A

S

B1

DMA DMA DMA

CB2

J

D

DMA DMA

AS

B1

AS

B1

B1toJ

StoB2

AtoC

AS

B1

B2

J

C

B1toJ

StoB2

AtoC

AS

B1

JtoD

CtoD B2

J

C

B1toJ

StoB2

AtoC

AS

B1

JtoD

D

CtoD B2

J

C

B1toJ

StoB2

AtoC

SPE1 DMA1 SPE2 DMA2T

ime

Page 17: EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation

- 17 -

SGMS(ILP) vs. Greedy

0

1

2

3

4

5

6

7

8

9

bitonic channel dct des fft filterbank fmradio tde mpeg2 vocoder radar

Benchmarks

Rel

ativ

e S

peed

up

ILP Partitioning Greedy Partitioning Exposed DMA

(MIT method, ASPLOS’06)

• Solver time < 30 seconds for 16 processors

Page 18: EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation

- 18 -

SGMS Conclusions

Streamroller» Efficient mapping of stream programs to multicore

» Coarse grain software pipelining

Performance summary» 14.7x speedup on 16 cores

» Up to 35% better than greedy solution (11% on average)

Scheduling framework» Tradeoff memory space vs. load balance

Memory constrained (embedded) systems Cache based system

Page 19: EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation

- 19 -

Discussion Points

Is it possible to convert stateful filters into stateless? What if the application does not behave as you expect?

» Filters change execution time?

» Memory faster/slower than expected?

Could this be adapted for a more conventional multiprocessor with caches?

Can C code be automatically streamized? Now you have seen 3 forms of software pipelining:

» 1) Instruction level modulo scheduling, 2) Decoupled software pipelining, 3) Stream graph modulo scheduling

» Where else can it be used?

Page 20: EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation

Compilation for GPUs

Page 21: EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation

- 21 -

Why GPUs?

21

Page 22: EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation

- 22 -

Efficiency of GPUs

22

High Memory

Bandwidth

GTX 285 : 159 GB/Seci7 : 32 GB/Sec

High FlopRate

i7 :102 GFLOPS

GTX 285 :1062 GFLOPS

i7 : 51 GFLOPS

GTX 285 : 88.5 GFLOPS

GTX 480 : 168 GFLOPS

High FlopPer Watt

GTX 285 : 5.2 GFLOP/W

i7 : 0.78 GFLOP/W

High FlopPer Dollar

GTX 285 : 3.54 GFLOP/$i7 : 0.36 GFLOP/$

Page 23: EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation

- 23 -

GPU Architecture

23

Shared

Regs

0 1

2 3

4 5

6 7

Interconnection Network

Global Memory (Device Memory)PCIe

Bridge

CPU HostMemory

Shared

Regs

0 1

2 3

4 5

6 7

Shared

Regs

0 1

2 3

4 5

6 7

Shared

Regs

0 1

2 3

4 5

6 7

SM 0 SM 1 SM 2 SM 29

Page 24: EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation

- 24 -

CUDA

“Compute Unified Device Architecture”

General purpose programming model» User kicks off batches of threads on the GPU

Advantages of CUDA» Interface designed for compute - graphics free API

» Orchestration of on chip cores

» Explicit GPU memory management

» Full support for Integer and bitwise operations

24

Page 25: EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation

- 25 -

Programming Model

25

Host

Kernel 1

Kernel 2

Grid 1

Grid 2

DeviceTi

me

Page 26: EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation

- 26 -

Grid 1

GPU Scheduling

26

SM 0

Shared

Regs

0 1

2 3

4 5

6 7

SM 1

Shared

Regs

0 1

2 3

4 5

6 7

SM 2

Shared

Regs

0 1

2 3

4 5

6 7

SM 3

Shared

Regs

0 1

2 3

4 5

6 7

SM 30

Shared

Regs

0 1

2 3

4 5

6 7

Page 27: EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation

- 27 -

Warp Generation

Block 0

Block 1

Block 3

Shared

Registers

0 1

2

4 5

3

6 7

SM0

27

Block 2

ThreadId

0 31 32 63Warp 0 Warp 1

Page 28: EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation

- 28 -

Memory Hierarchy

28

Per BlockShared Memory

__shared__ int SharedVar

Block 0

Per-threadRegister

int LocalVarArray[10]

Per-threadLocal Memory

int RegisterVarThread 0

Grid 0

Per appGlobal Memory

Host

__global__ int GlobalVar

__constant__ int ConstVar

Texture<float,1,ReadMode> TextureVar

Per appTexture Memory

Per appConstant Memory

Devic

e