orchestration by approximation mapping stream programs onto multicore architectures s. m. farhad...
TRANSCRIPT
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures
S. M. Farhad (University of Sydney)Joint work with
Yousun Ko Bernd Burgstaller
Bernhard Scholz
3
Cores are the New Gates
1
1975
2
4
8
16
32
64
128
256
512
1980 1985 1990 1995 2000 2005 2010
400480088080 8086 286 386 486 Pentium P2 P3 P4
Athlon Itanium Itanium2
Power4 PA8800400480088080
PA8800
Opteron CoreDuo
Power6Xbox 360
BCM 1480Opteron 4P
Xeon
Niagara Cell
RAW
RAZA XLR Cavium
Unicore
Homogeneous Multicore
Heterogeneous MulticoreCISCO CSR1
Larrabee
PicoChip AMBRIC
AMD Fusion
NVIDIA G80
Core
Core2Duo
Core2Quad
# co
res/
chip
(Shekhar Borkar, Intel)
Courtesy: Scott’08
C/C++/Java
CUDA
X10Peakstream
Fortress
Accelerator
Ct
C T M
Rstream
Rapidmind
Stream Programming
3
4
Stream Programming Paradigm
Programs expressed as stream graphs
Streams: Sequence of data elements
Actor: Functions applied to streams
4
Actor/Filter
Streams
Streams
5
StreamIt Language
Basic and hierarchical structures
Each construct has single input/output stream
parallel computation
may be any StreamIt language construct
joinersplitter
pipeline
feedback loop
joiner splitter
splitjoin
filter
5
Mapping Actors
8
B 60
C 60
D 5
5ACore 1 Core 2 Core 3
5A
D 5B 60 C 60
Load=10 Load=60 Load=60
Make span = 60, Speedup = 130/60 = 2.17
Ideally speedup = 3
Actors B and C are the bottlenecks
Bottlenecks Elimination
9
B 60
C 60
D 5
5A
D 15
15A
B_1 60
6s1
6j1
B_2 60 B_3 60
C_1 60
6s2
6j2
C_2 60 C_3 60
Hot actor duplication
Core 1 Core 2 Core 3
15A
s1 6
B_1 60
C_1 60
B_2 60
j1 6
s2 6
C_2 60
j2 6
B_3 60
C_3 60
D 15
Load = 141 Load = 138 Load = 135
Make span = 141, Speedup = 130x3/141 = 2.77
28% increased speedup
10
Orchestration of Stream Program Contd.
Current state of the art Integer Linear Programming
Intractable Heuristics
Unknown performance
How to find a fast and good solution? Approximation algorithms that have
Polynomial runtime Quality bound for solution
10
12
Data Transfer Model
Arrival rate depends on the data rate of the actors (maximize)
Data transfer model forms a system of sim. functional linear equation
Compute a closed form of the output data rate We also consider a processor utilization function
for each actor12
A B C1 5 1 1
z
zxA AB xx 2.0 BC xx
zxA zxB 2.0 zxC 2.0
13
Bottleneck Analysis
The arrival rate is limited by Processor capacity of the cores Memory bandwidth
A quantitative analysis determines An upper bound of the arrival rate imposed by an
actor An upper bound of the arrival rate imposed by the
parallel system Hot actor
Upper bound (actor) < upper bound (system)
13
14
Approximation of Actor Allocation Problem The actor allocation problem (AAP) is NP-hard For a fixed arrival rate, the AAP reduces to
standard bin-packing problem (closed form) There exist approximation algorithms for bin-
packing Polynomial running time Solution quality is bounded apxapx
opt
z
z
z
z )(ub1
15
Summary (ASPLOS 2011)
Novel data transfer model
A simple quantitative analysis to detect and eliminate bottlenecks
A novel 2-approximation algorithm for deploying stream graphs on multicore platforms
Results are within 5% of the optimal solution
Achieves a geometric mean speedup of 6.95x for 8 processors over single processor execution
15
16
Related Works
[1] Static Scheduling of SDF Programs for DSP [IEEE ’87][2] StreamIt: A language for streaming applications [Springer ‘02][3] Phased Scheduling of Stream Programs [LCTES ’03][4] Exploiting Coarse Grained Task, Data, and Pipeline Parallelism in Stream Programs [ASPLOS ‘06][5] Orchestrating the Execution of Stream Programs on Cell [PLDI ’08][6] Software Pipelined Execution of Stream Programs on GPUs [IEEE ‘09][7] Synergistic Execution of Stream Programs on Multicores with Accelerators [IEEE ‘09][8] An empirical characterization of stream programs and its implications for
language and compiler design [PACT ’10]
16
181818
Focus of Our Work
Stream Graph Scheduling
Stream Graph Partitioning
Layout on Target Architecture
Communication Scheduling
Linear Functional Equation Solver
Actor Allocation on Processors
Bottleneck Resolver
StreamIt Compiler Phases
19
Actor Allocation Constraint
19
100%
Actors with their utilizations
Each core has 100% utilization
1 3 45 n2
1n52
3
23
Actor Allocation of Bottleneck Free Program
23
D 5
5A
B_1 20
2s1
B_2 20 B_3 20
C_1 20
2j1
C_2 20 C_3 20
Core 1
5A
B_1 20
C_1 20
Core 2
s1 2
B_2 20
j1 2
C_2 20
Core 3
B_3 20
C_3 20
D 5
Load = 45 Load = 44 Load = 45
Make span = 45, Speedup = 130/45 = 2.89
Mapping
Efficient Bottleneck Resolving
24
Experiments
Our method implemented as an extension of StreamIt compiler
We compare to ILP based method [Scott 08](solved with CPLEX)
Hardware Setup 2.33GHz dual quad-core Intel Xeon processors
16GB memory Linux kernel version 2.6.23
Profiler uses the x86-64’s hardware cycle counters
24
25
Experiments Contd.
Experimental Process Profiling Computing closed form Resolve bottlenecks Compute the mapping Compute the layout scheduling Invoke the StreamIt back end Finally we measure the performance
25
26
Experiments Contd.
Benchmark Actors StatefulDCT 22 18FMRadio 67 23TDE 55 27FFT 26 14MergeSort 31 2FilterBankNew 53 34RadixSort 13 2EqualizerProgram 65 22BitonicSort 452 2DES 375 180MPEG 39 7MatrixMult 52 2
26
27
Experimental Results for 2 – 4 Processors
Benchmark Proc#ILP Time (Optimal)
(s)
DCT2 0.27
3 1585.69
4 2285.01
FMRadio2 0.08
3 3.22
4 1.29
TDE2 0.09
3 0.17
4 274.69
FFT2 0.46
3 44694.25
4 249240.09
27Our method’s run time: <1s
Benchmark Proc#ILP Time (Optimal)
(s)
Equalizer2 0.06
3 5.29
4 57553.83
BitonicSort2 0.3
3 3.06
4 16371.99
DES2 0.51
3 2.73
4 11.24
MPEG2 0.12
3 1.37
4 0.44
28
Experimental Results for 2 – 4 Processors
Benchmark Proc#Arrival rate ratio (Appx/Optimal)
Apx. Arrival Rate (MBps)
DCT2 0.99 42.56
3 0.97 62.46
4 0.96 82.57
FMRadio2 1.00 2.69
3 1.00 4.00
4 0.99 5.34
TDE2 1.00 13.31
3 1.00 19.80
4 1.00 28.81
FFT2 0.98 43.91
3 0.98 46.85
4 0.95 95.50
28
Benchmark Proc#Arrival rate ratio (Appx/Optimal)
Apx. Arrival Rate (MBps)
Equalizer2 1.00 0.56
3 1.00 0.83
4 1.00 1.59
BitonicSort2 1.00 3.16
3 1.00 4.73
4 1.00 10.14
DES2 1.00 0.12
3 1.00 0.18
4 1.00 0.24
MPEG2 1.00 36.59
3 0.99 54.68
4 1.00 73.22
30
Summary
Approximation algorithm for solving actor allocation problem
Data rate transfer model that resolves bottlenecks
We separate the bottleneck elimination from the actor allocation
We implemented our approach and compared with an optimal approach
Optimal approach has unpredictable time
Our approach has negligible time for all benchmarks
Quality of our approach is at most 5% off the optimum
For up to 8 processors we achieve a geometric mean speedup of 6.95x
over single processor execution
30