auto-tuning performance on multicore computers · auto-tuning performance on multicore computers...
Post on 25-Sep-2020
11 Views
Preview:
TRANSCRIPT
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
1
BERKELEY PAR LAB
Auto-tuning Performanceon Multicore Computers
Samuel WilliamsDavid Patterson, advisor and chairKathy YelickSara McMains
Ph.D. Dissertation Talk
samw@cs.berkeley.edu
Auto-tuning Performanceon Multicore Computers
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
2
BERKELEY PAR LAB
Multicore ProcessorsMulticore Processors
3
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Memory Bandwidth
Superscalar Era
Single thread performancescaled at 50% per year
Bandwidth increases muchmore slowly, but we couldadd additional bits orchannels.
Lack of diversity inarchitecture = lack ofindividual tuning
Power wall has cappedsingle thread performance
Log(
Per
form
ance
)
1990 1995 2000 2005
~50%
/ yea
r
ProcessorMemory Latency
~25% / year~12% / year
0% / year
7% / yearMulticore is the
agreed solution
4
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Superscalar Era
Single thread performancescaled at 50% per year
Bandwidth increases muchmore slowly, but we couldadd additional bits orchannels.
Lack of diversity inarchitecture = lack ofindividual tuning
Power wall has cappedsingle thread performance
Log(
Per
form
ance
)
1990 1995 2000 2005
~50%
/ yea
r
~25% / year
ProcessorMemory
~12% / year
0% / year
But, no agreement on
what it should look like
5
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Take existing power limitedsuperscalar processors, andadd cores.
Serial code shows noimprovement
Parallel performance mightimprove at 25% per year.= improvement in powerefficiency from process tech.
DRAM bandwidth is currentlyonly improving by ~12% peryear.
Intel / AMD model.
Multicore EraLo
g(P
erfo
rman
ce)
2003 2007
+0% / year (serial)
2005 2009
+25% / year (parallel)
~12% / year
6
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Radically different approach Give up superscalar OOO
cores in favor of small,efficient, in-order cores
Huge, abrupt, shocking drop insingle thread performance
May eliminate power wall,allowing many cores, andperformance to scale at up to40% per year
Troubling, the number ofDRAM channels many need todouble every three years
Niagara, Cell, GPUs
Multicore EraLo
g(P
erfo
rman
ce)
2003 20072005 2009
+0% / year (serial)
+40%
/ yea
r (pa
rallel
)
+12% / year
7
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Another option would be toreduce per core performance(and power) every year
Graceful degradation in serialperformance
Bandwidth is still an issue
Multicore EraLo
g(P
erfo
rman
ce)
2003 20072005 2009
+40%
/ yea
r (pa
rallel
)
-12% / year (serial)
+12% / year
8
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Multicore Era
There are still many other architectural knobs Superscalar? Dual issue? VLIW ? Pipeline depth SIMD width ? Multithreading (vertical, simultaneous)? Shared caches vs. Private Caches ? FMA vs. MUL+ADD vs. MUL or ADD Clustering of cores Crossbars vs. Rings vs. Meshes
Currently, no consensus on optimal configuration As a result, there is a plethora of multicore architectures
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
9
BERKELEY PAR LAB
Computational MotifsComputational Motifs
10
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Computational Motifs
Evolved from the Phil Colella’s Seven Dwarfs of Parallel Computing Numerical Methods common throughout scientific computing
Dense and Sparse Linear Algebra Computations on Structured and Unstructured Grids Spectral Methods N-Body Particle Methods Monte Carlo
Within each dwarf, there are a number of computational kernels
The Berkeley View, and subsequently Par Lab expanded these to many otherdomains in computing (embedded, SPEC, DB, Games) Graph Algorithms Combinational Logic Finite State Machines etc…
rechristened Computational Motifs
Each could be black-boxed into libraries or frameworks by domain experts. But how do we get good performance given the diversity of architecture?
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
11
BERKELEY PAR LAB
Auto-tuningAuto-tuning
12
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Auto-tuning(motivation)
Given a huge diversity of processor architectures, an code hand-optimizedfor one architecture will likely deliver poor performance on another.Moreover, code optimized for one input data set may deliver poorperformance on another.
We want a single code base that delivers performance portability across thebreadth of architectures today and into the future.
Auto-tuners are composed of two principal components: A code generator based on high-level functionality rather than parsing C And the auto-tuner proper that searches for the optimal parameters for each
optimization.
Auto-tuner’s don’t invent or discover optimizations, they search through theparameter space for a variety of known optimizations.
Proven value in Dense Linear Algebra(ATLAS), Spectral(FFTW,SPIRAL),and Sparse Methods(OSKI)
13
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Auto-tuning(code generation)
The code generator produces many code variants for somenumerical kernel using known optimizations and transformations.
For example, cache blocking adds several parameterized loop nests. Prefetching adds parameterized intrinsics to the code Loop unrolling and reordering explicitly unrolls the code. Each unrolling
is a unique code variant.
Kernels can have dozens of different optimizations, some of whichcan produce hundreds of code variants.
The code generators used in this work are kernel-specific, and werewritten in Perl.
14
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Auto-tuning(search)
In this work, we use two search techniques: Exhaustive - examine every combination of parameter for every
optimization. (often intractable) Heuristics - use knowledge of architecture or algorithm to restrict
the search space.
Parameter Space forOptimization A
Para
met
er S
pace
for
Opt
imiz
atio
n B
Parameter Space forOptimization A
Para
met
er S
pace
for
Opt
imiz
atio
n B
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
15
BERKELEY PAR LAB
OverviewMulticore SMPsThe Roofline ModelAuto-tuning LBMHDAuto-tuning SpMVSummaryFuture Work
Auto-tuning Performanceon Multicore ComputersAuto-tuning Performanceon Multicore Computers
16
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Thesis Contributions
Introduced the Roofline Model Extended auto-tuning to the structured grid motif
(specifically LBMHD) Extended auto-tuning to multicore
Fundamentally different from running auto-tuned serial code onmulticore SMPs.
Apply the concept to LBMHD and SpMV.
Analyzed the breadth of multicore architectures in thecontext of auto-tuned SpMV and LBMHD
Discussed future directions in auto-tuning
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
17
BERKELEY PAR LAB
Multicore SMPsChapter 3
Multicore SMPsOverviewMulticore SMPsThe Roofline ModelAuto-tuning LBMHDAuto-tuning SpMVSummaryFuture Work
18
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Multicore SMP Systems_
Intel Xeon E5345(Clovertown)
AMD Opteron 2356(Barcelona)
Sun UltraSPARC T2+ T5140(Victoria Falls)
IBM QS20(Cell Blade)
667MHz FBDIMMs
MCH (4x64b controllers)
10.66 GB/s(write)21.33 GB/s(read)
FSB10.66 GB/s
Cor
eC
ore
4MBL2
Cor
eC
ore
4MBL2
Cor
eC
ore
4MBL2
Cor
eC
ore
4MBL2
10.66 GB/sFSB
667MHz DDR2DIMMs
10.6 GB/s
Opt
eron
512K
4GB
/s(e
ach
dire
ctio
n)
Opt
eron
Opt
eron
Opt
eron
512K
512K
512K
2MB victimSRI / xbar
2x64b controllers
Hyp
erTr
ansp
ort
667MHz DDR2DIMMs
10.6 GB/s
Opt
eron
512K
Opt
eron
Opt
eron
Opt
eron
512K
512K
512K
2MB victimSRI / xbar
2x64b controllers
Hyp
erTr
ansp
ort
8 x
6.4
GB
/s(1
per
hub
per
dire
ctio
n)
667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs2x128b controllers
MT
SP
AR
C
179 GB/s 90 GB/s
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
C
Crossbar
667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs2x128b controllers
MT
SP
AR
C
179 GB/s 90 GB/s
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
C
Crossbar
BIF
512MB XDR DRAM
25.6 GB/s
EIB (ring network)
XDR memory controllers
VMTPPE
512KL2
SP
E25
6KM
FC
<20G
B/s
(eac
h di
rect
ion)
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FC
BIF
512MB XDR DRAM
25.6 GB/s
EIB (ring network)
XDR memory controllers
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
VMTPPE
512KL2
19
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Multicore SMP Systems(Conventional cache-based memory hierarchy)
Intel Xeon E5345(Clovertown)
AMD Opteron 2356(Barcelona)
Sun UltraSPARC T2+ T5140(Victoria Falls)
IBM QS20(Cell Blade)
667MHz FBDIMMs
MCH (4x64b controllers)
10.66 GB/s(write)21.33 GB/s(read)
FSB10.66 GB/s
Cor
eC
ore
4MBL2
Cor
eC
ore
4MBL2
Cor
eC
ore
4MBL2
Cor
eC
ore
4MBL2
10.66 GB/sFSB
667MHz DDR2DIMMs
10.6 GB/s
Opt
eron
512K
4GB
/s(e
ach
dire
ctio
n)
Opt
eron
Opt
eron
Opt
eron
512K
512K
512K
2MB victimSRI / xbar
2x64b controllers
Hyp
erTr
ansp
ort
667MHz DDR2DIMMs
10.6 GB/s
Opt
eron
512K
Opt
eron
Opt
eron
Opt
eron
512K
512K
512K
2MB victimSRI / xbar
2x64b controllers
Hyp
erTr
ansp
ort
8 x
6.4
GB
/s(1
per
hub
per
dire
ctio
n)
667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs2x128b controllers
MT
SP
AR
C
179 GB/s 90 GB/s
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
C
Crossbar
667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs2x128b controllers
MT
SP
AR
C
179 GB/s 90 GB/s
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
C
Crossbar
512MB XDR DRAM
25.6 GB/s
SP
E25
6KM
FC
<20G
B/s
(eac
h di
rect
ion)
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FC512MB XDR DRAM
25.6 GB/s
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
EIB (ring network)
XDR memory controllers
EIB (ring network)
XDR memory controllers
VMTPPE
512KL2
BIF
VMTPPE
512KL2
BIF
20
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Multicore SMP Systems(local store-based memory hierarchy)
Intel Xeon E5345(Clovertown)
AMD Opteron 2356(Barcelona)
Sun UltraSPARC T2+ T5140(Victoria Falls)
667MHz FBDIMMs
MCH (4x64b controllers)
10.66 GB/s(write)21.33 GB/s(read)
FSB10.66 GB/s
Cor
eC
ore
4MBL2
Cor
eC
ore
4MBL2
Cor
eC
ore
4MBL2
Cor
eC
ore
4MBL2
10.66 GB/sFSB
667MHz DDR2DIMMs
10.6 GB/s
Opt
eron
512K
4GB
/s(e
ach
dire
ctio
n)
Opt
eron
Opt
eron
Opt
eron
512K
512K
512K
2MB victimSRI / xbar
2x64b controllers
Hyp
erTr
ansp
ort
667MHz DDR2DIMMs
10.6 GB/s
Opt
eron
512K
Opt
eron
Opt
eron
Opt
eron
512K
512K
512K
2MB victimSRI / xbar
2x64b controllers
Hyp
erTr
ansp
ort
8 x
6.4
GB
/s(1
per
hub
per
dire
ctio
n)
667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs2x128b controllers
MT
SP
AR
C
179 GB/s 90 GB/s
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
C
Crossbar
667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs2x128b controllers
MT
SP
AR
C
179 GB/s 90 GB/s
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
C
Crossbar
IBM QS20(Cell Blade)
BIF
512MB XDR DRAM
25.6 GB/s
VMTPPE
512KL2
<20G
B/s
(eac
h di
rect
ion)
BIF
512MB XDR DRAM
25.6 GB/s
VMTPPE
512KL2
XDR memory controllers XDR memory controllers
EIB (ring network) EIB (ring network)
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
21
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Multicore SMP Systems(CMT = Chip Multithreading)
Intel Xeon E5345(Clovertown)
AMD Opteron 2356(Barcelona)
667MHz FBDIMMs
MCH (4x64b controllers)
10.66 GB/s(write)21.33 GB/s(read)
FSB10.66 GB/s
Cor
eC
ore
4MBL2
Cor
eC
ore
4MBL2
Cor
eC
ore
4MBL2
Cor
eC
ore
4MBL2
10.66 GB/sFSB
667MHz DDR2DIMMs
10.6 GB/s
Opt
eron
512K
4GB
/s(e
ach
dire
ctio
n)
Opt
eron
Opt
eron
Opt
eron
512K
512K
512K
2MB victimSRI / xbar
2x64b controllers
Hyp
erTr
ansp
ort
667MHz DDR2DIMMs
10.6 GB/s
Opt
eron
512K
Opt
eron
Opt
eron
Opt
eron
512K
512K
512K
2MB victimSRI / xbar
2x64b controllers
Hyp
erTr
ansp
ort
IBM QS20(Cell Blade)
BIF
512MB XDR DRAM
25.6 GB/s
VMTPPE
512KL2
SP
E25
6KM
FC
<20G
B/s
(eac
h di
rect
ion)
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FC
BIF
512MB XDR DRAM
25.6 GB/s
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
VMTPPE
512KL2
XDR memory controllers XDR memory controllers
EIB (ring network) EIB (ring network)
Sun UltraSPARC T2+ T5140(Victoria Falls)
8 x
6.4
GB
/s(1
per
hub
per
dire
ctio
n)
667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs2x128b controllers
MT
SP
AR
C
179 GB/s 90 GB/s
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
C
Crossbar
667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs2x128b controllers
MT
SP
AR
C
179 GB/s 90 GB/s
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
C
Crossbar
22
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Multicore SMP Systems(peak double-precision FLOP rates)
Intel Xeon E5345(Clovertown)
AMD Opteron 2356(Barcelona)
Sun UltraSPARC T2+ T5140(Victoria Falls)
IBM QS20(Cell Blade)
667MHz FBDIMMs
MCH (4x64b controllers)
10.66 GB/s(write)21.33 GB/s(read)
FSB10.66 GB/s
Cor
eC
ore
4MBL2
Cor
eC
ore
4MBL2
Cor
eC
ore
4MBL2
Cor
eC
ore
4MBL2
10.66 GB/sFSB
667MHz DDR2DIMMs
10.6 GB/s
Opt
eron
512K
4GB
/s(e
ach
dire
ctio
n)
Opt
eron
Opt
eron
Opt
eron
512K
512K
512K
2MB victimSRI / xbar
2x64b controllers
Hyp
erTr
ansp
ort
667MHz DDR2DIMMs
10.6 GB/s
Opt
eron
512K
Opt
eron
Opt
eron
Opt
eron
512K
512K
512K
2MB victimSRI / xbar
2x64b controllers
Hyp
erTr
ansp
ort
8 x
6.4
GB
/s(1
per
hub
per
dire
ctio
n)
667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs2x128b controllers
MT
SP
AR
C
179 GB/s 90 GB/s
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
C
Crossbar
667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs2x128b controllers
MT
SP
AR
C
179 GB/s 90 GB/s
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
C
Crossbar
BIF
512MB XDR DRAM
25.6 GB/s
EIB (ring network)
XDR memory controllers
VMTPPE
512KL2
SP
E25
6KM
FC
<20G
B/s
(eac
h di
rect
ion)
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FC
BIF
512MB XDR DRAM
25.6 GB/s
EIB (ring network)
XDR memory controllers
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
VMTPPE
512KL2
74.66 GFLOP/s 73.60 GFLOP/s
29.25 GFLOP/s18.66 GFLOP/s
23
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Multicore SMP Systems(DRAM pin bandwidth)
Intel Xeon E5345(Clovertown)
AMD Opteron 2356(Barcelona)
Sun UltraSPARC T2+ T5140(Victoria Falls)
IBM QS20(Cell Blade)
667MHz FBDIMMs
MCH (4x64b controllers)
10.66 GB/s(write)21.33 GB/s(read)
FSB10.66 GB/s
Cor
eC
ore
4MBL2
Cor
eC
ore
4MBL2
Cor
eC
ore
4MBL2
Cor
eC
ore
4MBL2
10.66 GB/sFSB
667MHz DDR2DIMMs
10.6 GB/s
Opt
eron
512K
4GB
/s(e
ach
dire
ctio
n)
Opt
eron
Opt
eron
Opt
eron
512K
512K
512K
2MB victimSRI / xbar
2x64b controllers
Hyp
erTr
ansp
ort
667MHz DDR2DIMMs
10.6 GB/s
Opt
eron
512K
Opt
eron
Opt
eron
Opt
eron
512K
512K
512K
2MB victimSRI / xbar
2x64b controllers
Hyp
erTr
ansp
ort
8 x
6.4
GB
/s(1
per
hub
per
dire
ctio
n)
667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs2x128b controllers
MT
SP
AR
C
179 GB/s 90 GB/s
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
C
Crossbar
667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs2x128b controllers
MT
SP
AR
C
179 GB/s 90 GB/s
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
C
Crossbar
BIF
512MB XDR DRAM
25.6 GB/s
EIB (ring network)
XDR memory controllers
VMTPPE
512KL2
SP
E25
6KM
FC
<20G
B/s
(eac
h di
rect
ion)
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FC
BIF
512MB XDR DRAM
25.6 GB/s
EIB (ring network)
XDR memory controllers
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
VMTPPE
512KL2
21 GB/s(read)10 GB/s(write)
21 GB/s
51 GB/s42 GB/s(read)21 GB/s(write)
24
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Multicore SMP Systems(Non-Uniform Memory Access)
Intel Xeon E5345(Clovertown)
AMD Opteron 2356(Barcelona)
Sun UltraSPARC T2+ T5140(Victoria Falls)
IBM QS20(Cell Blade)
667MHz FBDIMMs
MCH (4x64b controllers)
10.66 GB/s(write)21.33 GB/s(read)
FSB10.66 GB/s
Cor
eC
ore
4MBL2
Cor
eC
ore
4MBL2
Cor
eC
ore
4MBL2
Cor
eC
ore
4MBL2
10.66 GB/sFSB
667MHz DDR2DIMMs
10.6 GB/s
Opt
eron
512K
4GB
/s(e
ach
dire
ctio
n)
Opt
eron
Opt
eron
Opt
eron
512K
512K
512K
2MB victimSRI / xbar
2x64b controllers
Hyp
erTr
ansp
ort
667MHz DDR2DIMMs
10.6 GB/s
Opt
eron
512K
Opt
eron
Opt
eron
Opt
eron
512K
512K
512K
2MB victimSRI / xbar
2x64b controllers
Hyp
erTr
ansp
ort
8 x
6.4
GB
/s(1
per
hub
per
dire
ctio
n)
667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs2x128b controllers
MT
SP
AR
C
179 GB/s 90 GB/s
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
C
Crossbar
667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs2x128b controllers
MT
SP
AR
C
179 GB/s 90 GB/s
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
C
Crossbar
BIF
512MB XDR DRAM
25.6 GB/s
EIB (ring network)
XDR memory controllers
VMTPPE
512KL2
SP
E25
6KM
FC
<20G
B/s
(eac
h di
rect
ion)
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FC
BIF
512MB XDR DRAM
25.6 GB/s
EIB (ring network)
XDR memory controllers
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
VMTPPE
512KL2
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
25
BERKELEY PAR LAB
Roofline ModelChapter 4
Roofline ModelOverviewMulticore SMPsThe Roofline ModelAuto-tuning LBMHDAuto-tuning SpMVSummaryFuture Work
26
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Memory Traffic
Total bytes to/from DRAM
Can categorize into: Compulsory misses Capacity misses Conflict misses Write allocations …
Oblivious of lack of sub-cache line spatial locality
27
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Arithmetic Intensity
For purposes of this talk, we’ll deal with floating-point kernels Arithmetic Intensity ~ Total FLOPs / Total DRAM Bytes Includes cache effects Many interesting problems have constant AI (w.r.t. problem size)
Bad given slowly increasing DRAM bandwidth Bandwidth and Traffic are key optimizations
A r i t h m e t i c I n t e n s i t y
O( N )O( log(N) )O( 1 )
SpMV, BLAS1,2
Stencils (PDEs)
Lattice Methods
FFTsDense Linear Algebra
(BLAS3)Particle Methods
28
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Basic Idea
Synthesize communication, computation, and locality into a singlevisually-intuitive performance figure using bound and bottleneckanalysis.
Given a kernel’s arithmetic intensity (based on DRAM traffic afterbeing filtered by the cache), programmers can inspect the figure,and bound performance.
Moreover, provides insights as to which optimizations will potentiallybe beneficial.
AttainablePerformanceij
= minFLOP/s with Optimizations1-i
AI * Bandwidth with Optimizations1-j
29
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Constructing a Roofline Model(computational ceilings)
Plot on log-log scale Given AI, we can easily
bound performance But architectures are much
more complicated
We will bound performanceas we eliminate specificforms of in-core parallelism
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
Opteron 2356(Barcelona)
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
peak DP
Stream
Ban
dwidth
30
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Constructing a Roofline Model(computational ceilings)
Opterons have dedicatedmultipliers and adders.
If the code is dominated byadds, then attainableperformance is half of peak.
We call these Ceilings They act like constraints on
performance
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
Opteron 2356(Barcelona)
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
peak DP
Stream
Ban
dwidthmul / add imbalance
31
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Constructing a Roofline Model(computational ceilings)
Opterons have 128-bitdatapaths.
If instructions aren’tSIMDized, attainableperformance will be halved
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
Opteron 2356(Barcelona)
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
peak DP
Stream
Ban
dwidthmul / add imbalance
w/out SIMD
32
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Constructing a Roofline Model(computational ceilings)
On Opterons, floating-pointinstructions have a 4 cyclelatency.
If we don’t express 4-wayILP, performance will dropby as much as 4x
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
Opteron 2356(Barcelona)
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
w/out SIMD
w/out ILP
peak DP
Stream
Ban
dwidthmul / add imbalance
33
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Constructing a Roofline Model(communication ceilings)
We can perform a similarexercise taking awayparallelism from thememory subsystem
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
Opteron 2356(Barcelona)
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
peak DP
Stream
Ban
dwidth
34
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Constructing a Roofline Model(communication ceilings)
Explicit software prefetchinstructions are required toachieve peak bandwidth
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
Opteron 2356(Barcelona)
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
peak DP
Stream
Ban
dwidth
w/out S
W prefe
tch
35
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Constructing a Roofline Model(communication ceilings)
Opterons are NUMA As such memory traffic
must be correctly balancedamong the two sockets toachieve good Streambandwidth.
We could continue this byexamining strided orrandom memory accesspatterns
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
Opteron 2356(Barcelona)
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
peak DP
Stream
Ban
dwidth
w/out S
W prefe
tch
w/out N
UMA
36
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Constructing a Roofline Model(computation + communication)
We may boundperformance based on thecombination of expressedin-core parallelism andattained bandwidth.
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
Opteron 2356(Barcelona)
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
w/out SIMD
peak DP
mul / add imbalance
w/out ILPStre
am B
andwidth
w/out S
W prefe
tch
w/out N
UMA
37
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Constructing a Roofline Model(locality walls)
Remember, memory trafficincludes more than justcompulsory misses.
As such, actual arithmeticintensity may besubstantially lower.
Walls are unique to thearchitecture-kernelcombination
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
w/out SIMD
mul / add imbalance
w/out ILP
w/out S
W prefe
tch
w/out N
UMA
Opteron 2356(Barcelona)
peak DP
Stream
Ban
dwidth
only compulsory m
iss traffic
FLOPsCompulsory Misses
AI =
38
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Constructing a Roofline Model(locality walls)
Remember, memory trafficincludes more than justcompulsory misses.
As such, actual arithmeticintensity may besubstantially lower.
Walls are unique to thearchitecture-kernelcombination
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
w/out SIMD
mul / add imbalance
w/out ILP
w/out S
W prefe
tch
w/out N
UMA
Opteron 2356(Barcelona)
peak DP
Stream
Ban
dwidth
only compulsory m
iss traffic+w
rite allocation traffic
FLOPsAllocations + Compulsory Misses
AI =
39
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Constructing a Roofline Model(locality walls)
Remember, memory trafficincludes more than justcompulsory misses.
As such, actual arithmeticintensity may besubstantially lower.
Walls are unique to thearchitecture-kernelcombination
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
w/out SIMD
mul / add imbalance
w/out ILP
w/out S
W prefe
tch
w/out N
UMA
Opteron 2356(Barcelona)
peak DP
Stream
Ban
dwidth
only compulsory m
iss traffic+w
rite allocation traffic
+capacity miss traffic
FLOPsCapacity + Allocations + Compulsory
AI =
40
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Constructing a Roofline Model(locality walls)
Remember, memory trafficincludes more than justcompulsory misses.
As such, actual arithmeticintensity may besubstantially lower.
Walls are unique to thearchitecture-kernelcombination
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
w/out SIMD
mul / add imbalance
w/out ILP
w/out S
W prefe
tch
w/out N
UMA
Opteron 2356(Barcelona)
peak DP
Stream
Ban
dwidth
only compulsory m
iss traffic+w
rite allocation traffic
+capacity miss traffic
+conflict miss traffic
FLOPsConflict + Capacity + Allocations + Compulsory
AI =
41
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Roofline Models for SMPs
0.5
1.0
1/8actual FLOP:byte ratio
atta
inab
le G
FLO
P/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
QS20 Cell Blade(PPEs)
w/out FMA
w/out ILP
peak DP
Stream
Ban
dwidt
h
w/out S
W prefe
tch
w/out N
UMA
0.5
1.0
1/8actual FLOP:byte ratio
atta
inab
le G
FLO
P/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
Xeon E5345(Clovertown)
mul / add imbalance
w/out SIMD
w/out ILP
Bandw
idth o
n small
datas
ets
peak DP
Bandw
idth o
n larg
e data
sets
0.5
1.0
1/8actual FLOP:byte ratio
atta
inab
le G
FLO
P/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
Opteron 2356(Barcelona)
mul / add imbalance
w/out SIMD
w/out ILP
peak DP
Stream
Ban
dwidt
h
w/out S
W prefe
tch
w/out N
UMA
0.5
1.0
1/8actual FLOP:byte ratio
atta
inab
le G
FLO
P/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
QS20 Cell Blade(SPEs)
w/out SIMD
w/out ILP
w/out FMA
peak DP
Stream
Ban
dwidt
h
misalig
ned D
MA
w/out N
UMA
Note, the multithreadedNiagara is limited by theinstruction mix rather than alack of expressed in-coreparallelism
Clearly some architectures aremore dependent on bandwidthoptimizations while others onin-core optimizations.
0.5
1.0
1/8actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
UltraSparc T2+ T5140(Victoria Falls)
25% FP
12% FP
6% FP
peak DP
Stream
Ban
dwidt
h
w/out S
W prefe
tch
w/out N
UMA
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
42
BERKELEY PAR LAB
Auto-tuningLattice-Boltzmann
Magnetohydrodynamics (LBMHD)Chapter 6
Auto-tuningLattice-Boltzmann
Magnetohydrodynamics (LBMHD)OverviewMulticore SMPsThe Roofline ModelAuto-tuning LBMHDAuto-tuning SpMVSummaryFuture Work
43
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Introduction to Lattice Methods
Structured grid code, with a series of time steps Popular in CFD Allows for complex boundary conditions No temporal locality between points in space within one time step Higher dimensional phase space
Simplified kinetic model that maintains the macroscopic quantities Distribution functions (e.g. 5-27 velocities per point in space) are used to
reconstruct macroscopic quantities Significant Memory capacity requirements
14
12
4
16
13
5
9
8
212
0
25
3
1
24
22
23
2618
15
6
19
17
7
11
10
20
+Z
+Y
+X
44
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
LBMHD(general characteristics)
Plasma turbulence simulation Two distributions:
momentum distribution (27 scalar components) magnetic distribution (15 vector components)
Three macroscopic quantities: Density Momentum (vector) Magnetic Field (vector)
Must read 73 doubles, and update 79 doubles per point in space Requires about 1300 floating point operations per point in space Just over 1.0 FLOPs/byte (ideal)
momentum distribution
14
4
13
16
5
8
9
21
12
+Y
2
25
1
3
24
23
22
26
0
18
6
17
19
7
10
11
20
15
+Z
+X
magnetic distribution
14
13
16
21
12
25
24
23
22
26
18
17
19
20
15
+Y
+Z
+X
macroscopic variables
+Y
+Z
+X
45
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
LBMHD(implementation details)
Data Structure choices: Array of Structures: no spatial locality, strided access Structure of Arrays: huge number of memory streams per thread, but guarantees
spatial locality, unit-stride, and vectorizes well
Parallelization Fortran version used MPI to communicate between nodes.
= bad match for multicore The version in this work uses pthreads for multicore
(this thesis is not about innovation in the threading model or programming language) MPI is not used when auto-tuning
Two problem sizes: 643 (~330MB) 1283 (~2.5GB)
46
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
SOA Memory Access Pattern
Consider a simple D2Q9 lattice method using SOA There are 9 read arrays, and 9 write arrays,
but all accesses are unit stride LBMHD has 73 read and 79 write streams per thread
(+1,0)
(0,+1)
(0,-1)
(-1,0) (0,0)
(+1,+1)
(+1,-1)(-1,-1)
(-1,+1)
x dimension
write_array[ ][ ]
read_array[ ][ ]
?
47
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Roofline Models for SMPs
LBMHD has a AI of 0.7 onwrite allocate architectures,and 1.0 on those with cachebypass or no write allocate.
MUL / ADD imbalance Some architectures will be
bandwidth-bound, while otherscompute bound.
0.5
1.0
1/8actual FLOP:byte ratio
atta
inab
le G
FLO
P/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
QS20 Cell Blade(PPEs)
w/out FMA
w/out ILP
peak DP
Stream
Ban
dwidt
h
w/out S
W prefe
tch
w/out N
UMA
0.5
1.0
1/8actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
UltraSparc T2+ T5140(Victoria Falls)
25% FP
12% FP
6% FP
peak DP
Stream
Ban
dwidt
h
w/out S
W prefe
tch
w/out N
UMA
0.5
1.0
1/8actual FLOP:byte ratio
atta
inab
le G
FLO
P/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
Xeon E5345(Clovertown)
mul / add imbalance
w/out SIMD
w/out ILP
Bandw
idth o
n small
datas
ets
peak DP
Bandw
idth o
n larg
e data
sets
0.5
1.0
1/8actual FLOP:byte ratio
atta
inab
le G
FLO
P/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
Opteron 2356(Barcelona)
mul / add imbalance
w/out SIMD
w/out ILP
peak DP
Stream
Ban
dwidt
h
w/out S
W prefe
tch
w/out N
UMA
0.5
1.0
1/8actual FLOP:byte ratio
atta
inab
le G
FLO
P/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
QS20 Cell Blade(SPEs)
w/out SIMD
w/out ILP
w/out FMA
peak DP
Stream
Ban
dwidt
h
misalig
ned D
MA
w/out N
UMA
48
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
LBMHD Performance(reference implementation)
Standard cache-basedimplementation can beeasily parallelized withpthreads.
NUMA is implicitly exploited Although scalability looks
good, is performance ?Per
form
ance
Concurrency &Problem Size
49
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
LBMHD Performance(reference implementation)
Standard cache-basedimplementation can beeasily parallelized withpthreads.
NUMA is implicitly exploited Although scalability looks
good, is performance ?
50
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
LBMHD Performance(reference implementation)
Superscalar performance issurprisingly good given thecomplexity of the memoryaccess pattern.
Cell PPE performance isabysmal.
0.5
1.0
1/8actual FLOP:byte ratio
atta
inab
le G
FLO
P/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
QS20 Cell Blade(PPEs)
w/out FMA
w/out ILP
peak DP
Stream
Ban
dwidt
h
w/out S
W prefe
tch
w/out N
UMA
0.5
1.0
1/8actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
UltraSparc T2+ T5140(Victoria Falls)
25% FP
12% FP
6% FP
peak DP
Stream
Ban
dwidt
h
w/out S
W prefe
tch
w/out N
UMA
0.5
1.0
1/8actual FLOP:byte ratio
atta
inab
le G
FLO
P/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
Xeon E5345(Clovertown)
mul / add imbalance
w/out SIMD
w/out ILP
Bandw
idth o
n small
datas
ets
peak DP
Bandw
idth o
n larg
e data
sets
0.5
1.0
1/8actual FLOP:byte ratio
atta
inab
le G
FLO
P/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
Opteron 2356(Barcelona)
mul / add imbalance
w/out SIMD
w/out ILP
peak DP
Stream
Ban
dwidt
h
w/out S
W prefe
tch
w/out N
UMA
0.5
1.0
1/8actual FLOP:byte ratio
atta
inab
le G
FLO
P/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
QS20 Cell Blade(SPEs)
w/out SIMD
w/out ILP
w/out FMA
peak DP
Stream
Ban
dwidt
h
misalig
ned D
MA
w/out N
UMA
51
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
LBMHD Performance(array padding)
LBMHD touches >150arrays.
Most caches have limitedassociativity
Conflict misses are likely Apply heuristic to pad
arrays
52
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
LBMHD Performance(vectorization)
LBMHD touches > 150arrays.
Most TLBs have << 128entries.
Vectorization techniquecreates a vector of pointsthat are being updated.
Loops are interchangedand strip mined
Exhaustively search forthe optimal “vector length”that balances page localitywith L1 cache misses.
53
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
LBMHD Performance(other optimizations)
Heuristic-based softwareprefetching
Exhaustive search for low-level optimizations
Loop unrolling/reordering SIMDization Cache Bypass increases
arithmetic intensity by 50% Small TLB pages on VF
54
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
LBMHD Performance(Cell SPE implementation)
We can write a local storeimplementation and run iton the Cell SPEs.
Ultimately, Cell’s weak DPhampers performance
55
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
LBMHD Performance(speedup for largest problem)
We can write a local storeimplementation and run iton the Cell SPEs.
Ultimately, Cell’s weak DPhampers performance1.6x 4x
3x 130x
56
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
LBMHD Performance(vs. Roofline)
Most architectures reach theirroofline bound performance
Clovertown’s snoop filter isineffective.
Niagara suffers frominstruction mix issues
Cell PPEs are latency limited Cell SPEs are compute-bound
0.5
1.0
1/8actual FLOP:byte ratio
atta
inab
le G
FLO
P/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
QS20 Cell Blade(PPEs)
w/out FMA
w/out ILP
peak DP
Stream
Ban
dwidt
h
w/out S
W prefe
tch
w/out N
UMA
0.5
1.0
1/8actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
UltraSparc T2+ T5140(Victoria Falls)
25% FP
12% FP
6% FP
peak DP
Stream
Ban
dwidt
h
w/out S
W prefe
tch
w/out N
UMA
0.5
1.0
1/8actual FLOP:byte ratio
atta
inab
le G
FLO
P/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
Xeon E5345(Clovertown)
mul / add imbalance
w/out SIMD
w/out ILP
Bandw
idth o
n small
datas
ets
peak DP
Bandw
idth o
n larg
e data
sets
0.5
1.0
1/8actual FLOP:byte ratio
atta
inab
le G
FLO
P/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
Opteron 2356(Barcelona)
mul / add imbalance
w/out SIMD
w/out ILP
peak DP
Stream
Ban
dwidt
h
w/out S
W prefe
tch
w/out N
UMA
0.5
1.0
1/8actual FLOP:byte ratio
atta
inab
le G
FLO
P/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
QS20 Cell Blade(SPEs)
w/out SIMD
w/out ILP
w/out FMA
peak DP
Stream
Ban
dwidt
h
misalig
ned D
MA
w/out N
UMA
57
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
LBMHD Performance(Summary)
Reference code is clearly insufficient Portable C code is insufficient on Barcelona and Cell Cell gets all its performance from the SPEs
despite only 2x the area, and 2x the peak DP FLOPs
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
58
BERKELEY PAR LAB
Auto-tuningSparse Matrix-Vector Multiplication
Chapter 8
Auto-tuningSparse Matrix-Vector Multiplication
OverviewMulticore SMPsThe Roofline ModelAuto-tuning LBMHDAuto-tuning SpMVSummaryFuture Work
59
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Sparse Matrix-Vector Multiplication
Sparse Matrix Most entries are 0.0 Performance advantage in only
storing/operating on the nonzeros Requires significant meta data
Evaluate: y = Ax A is a sparse matrix x & y are dense vectors
Challenges Difficult to exploit ILP(bad for superscalar), Difficult to exploit DLP(bad for SIMD) Irregular memory access to source vector Difficult to load balance
A x y
60
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Dataset (Matrices)
Pruned original SPARSITY suite down to 14 none should fit in cache Subdivided them into 4 categories Rank ranges from 2K to 1M
Dense
Protein FEM /Spheres
FEM /Cantilever
WindTunnel
FEM /Harbor QCD FEM /
Ship Economics Epidemiology
FEM /Accelerator Circuit webbase
LP
2K x 2K Dense matrixstored in sparse format
Well Structured(sorted by nonzeros/row)
Poorly Structuredhodgepodge
Extreme Aspect Ratio(linear programming)
61
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Roofline Models for SMPs
Reference SpMV implementationhas an AI of 0.166, but can readilyexploit FMA.
The best we can hope for is an AIof 0.25 for non-symmetricmatrices.
All architectures are memory-bound, but some may need in-core optimizations
0.5
1.0
1/8actual FLOP:byte ratio
atta
inab
le G
FLO
P/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
QS20 Cell Blade(PPEs)
w/out ILPw/out FMA
peak DP
Stream
Ban
dwidt
h
w/out S
W prefe
tch
w/out N
UMA
0.5
1.0
1/8actual FLOP:byte ratio
atta
inab
le G
FLO
P/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
UltraSparc T2+ T5140(Victoria Falls)
25% FP
12% FP
6% FP
peak DP
Stream
Ban
dwidt
h
w/out S
W prefe
tch
w/out N
UMA
0.5
1.0
1/8actual FLOP:byte ratio
atta
inab
le G
FLO
P/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
Xeon E5345(Clovertown)
mul / add imbalance
w/out SIMD
w/out ILP
Bandw
idth o
n small
datas
ets
peak DP
Bandw
idth o
n larg
e data
sets
0.5
1.0
1/8actual FLOP:byte ratio
atta
inab
le G
FLO
P/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
Opteron 2356(Barcelona)
mul / add imbalance
w/out SIMD
w/out ILP
peak DP
Stream
Ban
dwidt
h
w/out S
W prefe
tch
w/out N
UMA
0.5
1.0
1/8actual FLOP:byte ratio
atta
inab
le G
FLO
P/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
QS20 Cell Blade(SPEs)
w/out ILP
w/out FMA
w/out SIMD
peak DP
Stream
Ban
dwidt
h
misalig
ned D
MA
w/out N
UMA
62
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
SpMV Performance(Reference Implementation)
Reference implementationis CSR.
Simple parallelization byrows balancing nonzeros.
No implicit NUMAexploitation
Despite superscalar’s useof 8 cores, they see littlespeedup.
Niagara and PPEs shownear linear speedups
Per
form
ance
Matrix
63
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
SpMV Performance(Reference Implementation)
Reference implementationis CSR.
Simple parallelization byrows balancing nonzeros.
No implicit NUMAexploitation
Despite superscalar’s useof 8 cores, they see littlespeedup.
Niagara and PPEs shownear linear speedups
64
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
SpMV Performance(Reference Implementation)
Roofline for dense matrix insparse format.
Superscalars achieve bandwidth-limited performance
Niagara comes very close tobandwidth limit.
Clearly, NUMA and prefetchingwill be essential.
0.5
1.0
1/8actual FLOP:byte ratio
atta
inab
le G
FLO
P/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
QS20 Cell Blade(PPEs)
w/out ILPw/out FMA
peak DP
Stream
Ban
dwidt
h
w/out S
W prefe
tch
w/out N
UMA
0.5
1.0
1/8actual FLOP:byte ratio
atta
inab
le G
FLO
P/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
UltraSparc T2+ T5140(Victoria Falls)
25% FP
12% FP
6% FP
peak DP
Stream
Ban
dwidt
h
w/out S
W prefe
tch
w/out N
UMA
0.5
1.0
1/8actual FLOP:byte ratio
atta
inab
le G
FLO
P/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
Xeon E5345(Clovertown)
mul / add imbalance
w/out SIMD
w/out ILP
Bandw
idth o
n small
datas
ets
peak DP
Bandw
idth o
n larg
e data
sets
0.5
1.0
1/8actual FLOP:byte ratio
atta
inab
le G
FLO
P/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
Opteron 2356(Barcelona)
mul / add imbalance
w/out SIMD
w/out ILP
peak DP
Stream
Ban
dwidt
h
w/out S
W prefe
tch
w/out N
UMA
0.5
1.0
1/8actual FLOP:byte ratio
atta
inab
le G
FLO
P/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
QS20 Cell Blade(SPEs)
w/out ILP
w/out FMA
w/out SIMD
peak DP
Stream
Ban
dwidt
h
misalig
ned D
MA
w/out N
UMA
65
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
SpMV Performance(NUMA and Software Prefetching)
NUMA-aware allocation isessential on memory-boundNUMA SMPs.
Explicit softwareprefetching can boostbandwidth and changecache replacement policies
Cell PPEs are likelylatency-limited.
used exhaustive search
66
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
SpMV Performance(Matrix Compression)
After maximizing memorybandwidth, the only hope isto minimize memory traffic.
exploit: register blocking other formats smaller indices
Use a traffic minimizationheuristic rather thansearch
Benefit is clearlymatrix-dependent.
Register blocking enablesefficient softwareprefetching (one per cacheline)
67
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
SpMV Performance(Cache and TLB Blocking)
Based on limitedarchitectural knowledge,create heuristic to choosea good cache and TLBblock size
Hierarchically store theresultant blocked matrix.
Benefit can be significanton the most challengingmatrix
68
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
SpMV Performance(Cell SPE implementation)
Cache blocking can beeasily transformed intolocal store blocking.
With a few small tweaks forDMA, we can run asimplified version of theauto-tuner on Cell BCOO only 2x1 and larger always blocks
69
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
SpMV Performance(median speedup)
Cache blocking can beeasily transformed intolocal store blocking.
With a few small tweaks forDMA, we can run asimplified version of theauto-tuner on Cell BCOO only 2x1 and larger always blocks
2.8x 2.6x
1.4x 15x
70
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
SpMV Performance(vs. Roofline)
Roofline for dense matrix insparse format.
Compression improves AI Auto-tuning can allow us to slightly
exceed Stream bandwidth(but not pin bandwidth)
Cell PPEs perennially deliver poorperformance
0.5
1.0
1/8actual FLOP:byte ratio
atta
inab
le G
FLO
P/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
QS20 Cell Blade(PPEs)
w/out ILPw/out FMA
peak DP
Stream
Ban
dwidt
h
w/out S
W prefe
tch
w/out N
UMA
0.5
1.0
1/8actual FLOP:byte ratio
atta
inab
le G
FLO
P/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
UltraSparc T2+ T5140(Victoria Falls)
25% FP
12% FP
6% FP
peak DP
Stream
Ban
dwidt
h
w/out S
W prefe
tch
w/out N
UMA
0.5
1.0
1/8actual FLOP:byte ratio
atta
inab
le G
FLO
P/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
Xeon E5345(Clovertown)
mul / add imbalance
w/out SIMD
w/out ILP
Bandw
idth o
n small
datas
ets
peak DP
Bandw
idth o
n larg
e data
sets
0.5
1.0
1/8actual FLOP:byte ratio
atta
inab
le G
FLO
P/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
Opteron 2356(Barcelona)
mul / add imbalance
w/out SIMD
w/out ILP
peak DP
Stream
Ban
dwidt
h
w/out S
W prefe
tch
w/out N
UMA
0.5
1.0
1/8actual FLOP:byte ratio
atta
inab
le G
FLO
P/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
QS20 Cell Blade(SPEs)
w/out ILP
w/out FMA
w/out SIMD
peak DP
Stream
Ban
dwidt
h
misalig
ned D
MA
w/out N
UMA
71
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
SpMV Performance(summary)
Median SpMV performance aside, unlike LBMHD, SSE was unnecessary to achieve
performance Cell still requires a non-portable, ISA-specific implementation to
achieve good performance. Novel SpMV implementations may require ISA-specific (SSE) code
to achieve better performance.
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
72
BERKELEY PAR LAB
SummarySummaryOverviewMulticore SMPsThe Roofline ModelAuto-tuning LBMHDAuto-tuning SpMVSummaryFuture Work
73
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Summary
Introduced the Roofline Model Apply bound and bottleneck analysis Performance and requisite optimizations are inferred visually
Extended auto-tuning to multicore Fundamentally different from running auto-tuned serial code on multicore SMPs. Apply the concept to LBMHD and SpMV.
Auto-tuning LBMHD and SpMV Multicore has had a transformative effect on auto-tuning.
(move from latency limited to bandwidth limited) Maximizing memory bandwidth and minimizing memory traffic is key. Compilers are reasonably effective at in-core optimizations,
but totally ineffective at cache and memory issues. Library or framework is a necessity in managing these issues.
Comments on architecture Ultimately machines are bandwidth-limited without new algorithms Architectures with caches required significantly more tuning than the local store-
based Cell
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
74
BERKELEY PAR LAB
Future Directionsin Auto-tuning
Chapter 9
Future Directionsin Auto-tuning
OverviewMulticore SMPsThe Roofline ModelAuto-tuning LBMHDAuto-tuning SpMVSummaryFuture Work
75
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Future Work(Roofline)
Automatic generation of Roofline figures Kernel oblivious Select computational metric of interest Select communication channel of interest Designate common “optimizations” Requires a benchmark
Using performance counters to generate runtime Roofline figures Given a real kernel, we wish to understand the bottlenecks to
performance. Much friendlier visualization of performance counter data
76
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Future Work(Making search tractable)
Parameter Space forOptimization A
Para
met
er S
pace
for
Opt
imiz
atio
n B
Parameter Space forOptimization A
Para
met
er S
pace
for
Opt
imiz
atio
n B
Given the explosion in optimizations, exhaustive search is clearlynot tractable.
Moreover, heuristics require extensive architectural knowledge. In our SC08 work, we tried a greedy approach (one optimization at a
time) We could make it iterative or, we could make it look like steepest
descent (with some local search)
77
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Future Work(Auto-tuning Motifs)
We could certainly auto-tune other individual kernels in any motif,but this requires building a kernel-specific auto-tuner
However, we should strive for motif-wide auto-tuning. Moreover, we want to decouple data type (e.g. double precision)
from the parallelization structure.
1. A motif description or pattern language for each motif. e.g. taxonomy of structured grids + code snippet for stencil write auto-tuner parses these, and produces optimized code.
2. A series of DAG rewrite rules for each motif.Rules allow:Insertion of additional nodesDuplication of nodesReordering
78
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Rewrite Rules
x
yA0
0
0
0
0
0
0
0
x
yA0
0
0
0
0
0
0
0
Consider SpMV In FP, each node in the DAG is a MAC DAG makes locality explicit (e.g. local store blocking) BCSR adds zeros to the DAG We can cut edges and reconnect them to enable parallelization. We can reorder operations Any other data type/node type conforming to these rules can reuse all our
auto-tuning effortsx
yA0
0
0
0
0
0
0
0
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
79
BERKELEY PAR LAB
AcknowledgmentsAcknowledgments
80
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Acknowledgments
Berkeley ParLab Thesis Committee: David Patterson, Kathy Yelick, Sara McMains BeBOP group: Jim Demmel, Kaushik Datta, Shoaib Kamil, Rich Vuduc,
Rajesh Nishtala, etc… Rest of ParLab
Lawrence Berkeley National Laboratory FTG group: Lenny Oliker, John Shalf, Jonathan Carter, …
Hardware Donations and Remote Access Sun Microsystems IBM AMD FZ Julich Intel
This research was supported by the ASCR Office in the DOE Office of Science undercontract number DE-AC02-05CH11231, by Microsoft and Intel funding through award#20080469, and by matching funding by U.C. Discovery through award #DIG07-10227.
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
81
BERKELEY PAR LAB
Questions?Questions?
top related