orchestration by approximation mapping stream programs onto multicore architectures s. m. farhad...

31
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz

Upload: dennis-powers

Post on 29-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures

S. M. Farhad (University of Sydney)Joint work with

Yousun Ko Bernd Burgstaller

Bernhard Scholz

2

Outline

Motivation Research question Contributions Summary

2

3

Cores are the New Gates

1

1975

2

4

8

16

32

64

128

256

512

1980 1985 1990 1995 2000 2005 2010

400480088080 8086 286 386 486 Pentium P2 P3 P4

Athlon Itanium Itanium2

Power4 PA8800400480088080

PA8800

Opteron CoreDuo

Power6Xbox 360

BCM 1480Opteron 4P

Xeon

Niagara Cell

RAW

RAZA XLR Cavium

Unicore

Homogeneous Multicore

Heterogeneous MulticoreCISCO CSR1

Larrabee

PicoChip AMBRIC

AMD Fusion

NVIDIA G80

Core

Core2Duo

Core2Quad

# co

res/

chip

(Shekhar Borkar, Intel)

Courtesy: Scott’08

C/C++/Java

CUDA

X10Peakstream

Fortress

Accelerator

Ct

C T M

Rstream

Rapidmind

Stream Programming

3

4

Stream Programming Paradigm

Programs expressed as stream graphs

Streams: Sequence of data elements

Actor: Functions applied to streams

4

Actor/Filter

Streams

Streams

5

StreamIt Language

Basic and hierarchical structures

Each construct has single input/output stream

parallel computation

may be any StreamIt language construct

joinersplitter

pipeline

feedback loop

joiner splitter

splitjoin

filter

5

6

Outline

Motivation Research question Contributions Summary

6

7

How to Orchestrate a Stream Graph?

Mapping actors

Eliminate bottlenecks (aka. hot actors)

7

Mapping Actors

8

B 60

C 60

D 5

5ACore 1 Core 2 Core 3

5A

D 5B 60 C 60

Load=10 Load=60 Load=60

Make span = 60, Speedup = 130/60 = 2.17

Ideally speedup = 3

Actors B and C are the bottlenecks

Bottlenecks Elimination

9

B 60

C 60

D 5

5A

D 15

15A

B_1 60

6s1

6j1

B_2 60 B_3 60

C_1 60

6s2

6j2

C_2 60 C_3 60

Hot actor duplication

Core 1 Core 2 Core 3

15A

s1 6

B_1 60

C_1 60

B_2 60

j1 6

s2 6

C_2 60

j2 6

B_3 60

C_3 60

D 15

Load = 141 Load = 138 Load = 135

Make span = 141, Speedup = 130x3/141 = 2.77

28% increased speedup

10

Orchestration of Stream Program Contd.

Current state of the art Integer Linear Programming

Intractable Heuristics

Unknown performance

How to find a fast and good solution? Approximation algorithms that have

Polynomial runtime Quality bound for solution

10

11

Outline

Motivation Research question Contributions Summary

11

12

Data Transfer Model

Arrival rate depends on the data rate of the actors (maximize)

Data transfer model forms a system of sim. functional linear equation

Compute a closed form of the output data rate We also consider a processor utilization function

for each actor12

A B C1 5 1 1

z

zxA AB xx 2.0 BC xx

zxA zxB 2.0 zxC 2.0

13

Bottleneck Analysis

The arrival rate is limited by Processor capacity of the cores Memory bandwidth

A quantitative analysis determines An upper bound of the arrival rate imposed by an

actor An upper bound of the arrival rate imposed by the

parallel system Hot actor

Upper bound (actor) < upper bound (system)

13

14

Approximation of Actor Allocation Problem The actor allocation problem (AAP) is NP-hard For a fixed arrival rate, the AAP reduces to

standard bin-packing problem (closed form) There exist approximation algorithms for bin-

packing Polynomial running time Solution quality is bounded apxapx

opt

z

z

z

z )(ub1

15

Summary (ASPLOS 2011)

Novel data transfer model

A simple quantitative analysis to detect and eliminate bottlenecks

A novel 2-approximation algorithm for deploying stream graphs on multicore platforms

Results are within 5% of the optimal solution

Achieves a geometric mean speedup of 6.95x for 8 processors over single processor execution

15

16

Related Works

[1] Static Scheduling of SDF Programs for DSP [IEEE ’87][2] StreamIt: A language for streaming applications [Springer ‘02][3] Phased Scheduling of Stream Programs [LCTES ’03][4] Exploiting Coarse Grained Task, Data, and Pipeline Parallelism in Stream Programs [ASPLOS ‘06][5] Orchestrating the Execution of Stream Programs on Cell [PLDI ’08][6] Software Pipelined Execution of Stream Programs on GPUs [IEEE ‘09][7] Synergistic Execution of Stream Programs on Multicores with Accelerators [IEEE ‘09][8] An empirical characterization of stream programs and its implications for

language and compiler design [PACT ’10]

16

Questions?

181818

Focus of Our Work

Stream Graph Scheduling

Stream Graph Partitioning

Layout on Target Architecture

Communication Scheduling

Linear Functional Equation Solver

Actor Allocation on Processors

Bottleneck Resolver

StreamIt Compiler Phases

19

Actor Allocation Constraint

19

100%

Actors with their utilizations

Each core has 100% utilization

1 3 45 n2

1n52

3

20

Binary Search

20

0 ub(z)1.0

Solution space

rightleft mid

Allocation possible?

21

Binary Search

21

0 ub(z)1.0

rightleft

Allocation possible?

mid

21

22

Binary Search

22

0 ub(z)1.0

rightleft

Allocation possible?

mid

22

23

Actor Allocation of Bottleneck Free Program

23

D 5

5A

B_1 20

2s1

B_2 20 B_3 20

C_1 20

2j1

C_2 20 C_3 20

Core 1

5A

B_1 20

C_1 20

Core 2

s1 2

B_2 20

j1 2

C_2 20

Core 3

B_3 20

C_3 20

D 5

Load = 45 Load = 44 Load = 45

Make span = 45, Speedup = 130/45 = 2.89

Mapping

Efficient Bottleneck Resolving

24

Experiments

Our method implemented as an extension of StreamIt compiler

We compare to ILP based method [Scott 08](solved with CPLEX)

Hardware Setup 2.33GHz dual quad-core Intel Xeon processors

16GB memory Linux kernel version 2.6.23

Profiler uses the x86-64’s hardware cycle counters

24

25

Experiments Contd.

Experimental Process Profiling Computing closed form Resolve bottlenecks Compute the mapping Compute the layout scheduling Invoke the StreamIt back end Finally we measure the performance

25

26

Experiments Contd.

Benchmark Actors StatefulDCT 22 18FMRadio 67 23TDE 55 27FFT 26 14MergeSort 31 2FilterBankNew 53 34RadixSort 13 2EqualizerProgram 65 22BitonicSort 452 2DES 375 180MPEG 39 7MatrixMult 52 2

26

27

Experimental Results for 2 – 4 Processors

Benchmark Proc#ILP Time (Optimal)

(s)

DCT2 0.27

3 1585.69

4 2285.01

FMRadio2 0.08

3 3.22

4 1.29

TDE2 0.09

3 0.17

4 274.69

FFT2 0.46

3 44694.25

4 249240.09

27Our method’s run time: <1s

Benchmark Proc#ILP Time (Optimal)

(s)

Equalizer2 0.06

3 5.29

4 57553.83

BitonicSort2 0.3

3 3.06

4 16371.99

DES2 0.51

3 2.73

4 11.24

MPEG2 0.12

3 1.37

4 0.44

28

Experimental Results for 2 – 4 Processors

Benchmark Proc#Arrival rate ratio (Appx/Optimal)

Apx. Arrival Rate (MBps)

DCT2 0.99 42.56

3 0.97 62.46

4 0.96 82.57

FMRadio2 1.00 2.69

3 1.00 4.00

4 0.99 5.34

TDE2 1.00 13.31

3 1.00 19.80

4 1.00 28.81

FFT2 0.98 43.91

3 0.98 46.85

4 0.95 95.50

28

Benchmark Proc#Arrival rate ratio (Appx/Optimal)

Apx. Arrival Rate (MBps)

Equalizer2 1.00 0.56

3 1.00 0.83

4 1.00 1.59

BitonicSort2 1.00 3.16

3 1.00 4.73

4 1.00 10.14

DES2 1.00 0.12

3 1.00 0.18

4 1.00 0.24

MPEG2 1.00 36.59

3 0.99 54.68

4 1.00 73.22

29

Speedup Results

29

30

Summary

Approximation algorithm for solving actor allocation problem

Data rate transfer model that resolves bottlenecks

We separate the bottleneck elimination from the actor allocation

We implemented our approach and compared with an optimal approach

Optimal approach has unpredictable time

Our approach has negligible time for all benchmarks

Quality of our approach is at most 5% off the optimum

For up to 8 processors we achieve a geometric mean speedup of 6.95x

over single processor execution

30

Questions?