performance optimization and auto-tuning...performance optimization and auto-tuning samuel williams1...

148
LAWRENCE BERKELEY NATIONAL LABORATORY F U T U R E T E C H N O L O G I E S G R O U P Performance Optimization and Auto-tuning Samuel Williams 1 Jonathan Carter 1 , Richard Vuduc 3 , Leonid Oliker 1 , John Shalf 1 , Katherine Yelick 1,2 , James Demmel 1,2 1 1 Lawrence Berkeley National Laboratory 2 University of California Berkeley 3 Georgia Institute of Technology [email protected]

Upload: others

Post on 11-Jun-2020

45 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P

Performance Optimization and Auto-tuning

Samuel Williams1

Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2, James Demmel1,2

1

1Lawrence Berkeley National Laboratory

2University of California Berkeley 3Georgia Institute of Technology

[email protected]

Page 2: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Outline

  Fundamentals   Software Solutions to Challenges

  Sequential Programming Model   Shared Memory World   Message Passing World

  Optimizing across an evolving hardware base   Automating Performance Tuning (auto-tuning)   Example 1: LBMHD   Example 2: SpMV

  Summary

2

Page 3: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P

Fundamentals

3

Page 4: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Performance Optimization: Contending Forces

  Contending forces of Efficiency and Computational Complexity   We improve time to solution by improving throughput (efficiency)

and reducing computational complexity

4

Improve Efficiency

(Gflop/s, GB/s, etc…)

Reduce Computational Complexity (Flop’s, GB’s, etc…)

Page 5: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Performance Optimization: Contending Forces

  Contending forces of Efficiency and Computational Complexity   We improve time to solution by improving throughput (efficiency)

and reducing computational complexity

  In practice, we’re willing to sacrifice one in order to improve the time to solution.

5

Restructure to satisfy

Little’s Law

Implementation & Algorithmic Optimization

Page 6: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Basic Efficiency Quantities

  At all levels of the system (register files through networks), there are Three Fundamental (efficiency-oriented) Quantities:   Latency every operation requires time to execute

(i.e. instruction, memory or network latency)   Bandwidth # of (parallel) operations completed per cycle

(i.e. #FPUs, DRAM, Network, etc…)   Concurrency Total # of operations in flight

6

Page 7: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Little’s Law

  Little’s Law relates these three: Concurrency = Latency * Bandwidth - or - Effective Throughput = Expressed Concurrency / Latency

  This concurrency must be filled with parallel operations   Can’t exceed peak throughput with superfluous concurrency.

(each channel has a maximum throughput)

7

Page 8: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Computational Complexity Quantities

  Complexity often expressed in terms of   #Floating-point operations (FLOPs)   #Bytes from (registers, cache, DRAM, network)

  Just as channels have throughput limits, kernels and algorithms can have lower bounds to complexity (traffic).

8

Page 9: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Architects, Mathematicians, Programmers

  Architects invent paradigms to improve (peak) throughput (efficiency?) and facilitate(?) Little’s Law.

  Mathematicians invent new algorithms to improve performance by reducing (bottleneck) complexity or traffic

  As programmers, we must restructure algorithms and implementations to these new features.

  Often boils down to several key challenges:   Management of data/task locality   Management of data dependencies   Management of communication   Management of variable and dynamic parallelism

9

Page 10: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P

Software Solutions to Challenges

10

Page 11: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Challenges: Sequential

  Even with only one thread, there is parallelism due to pipelining and SIMD

  Data Dependencies:   HW (despite any out of order execution) manages data dependencies from ILP   user/compiler manages those from DLP (SIMDize only if possible)

  Data Locality:   compilers do a pretty good job of register file locality   For consumer apps, caches hide the complexity of attaining good on-chip locality fairly well   However, for performance-critical HPC apps, working sets can be so large and

unpredictable, caches do poorly. When coupled with finite memory bandwidth, performance can suffer.

cache block (reorder loops) or change data structures/types to improve arithmetic intensity.

  Communication (limited to processor-DRAM):   modern architectures predominately used HW stream prefetching to hide latency.

structure memory access patterns into N unit stride streams   Variable/Dynamic Parallelism:

  OOO processors can mitigate the complexity of variable parallelism (ILP/DLP) within the instruction set so long as it occur within a ~few dozen instruction window

11

Page 12: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 12

Arithmetic Intensity

  True Arithmetic Intensity (AI) ~ Total Flops / Total DRAM Bytes

  Some HPC kernels have an arithmetic intensity that scales with problem size (increased temporal locality), but remains constant on others

  Arithmetic intensity is ultimately limited by compulsory traffic   Arithmetic intensity is diminished by conflict or capacity misses.

A r i t h m e t i c I n t e n s i t y

O( N ) O( log(N) )

O( 1 )

SpMV, BLAS1,2

Stencils (PDEs)

Lattice Methods

FFTs Dense Linear Algebra

(BLAS3) Particle Methods

Page 13: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline Model

13

  Visualizes how bandwidth, compute, locality, and optimization bound performance.

  Based on application characteristics, one can infer what needs to be done to improve performance.

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

w/out SIMD

mul / add imbalance

w/out ILP

Opteron 2356 (Barcelona)

peak DP

only compulsory m

iss traffic +w

rite allocation traffic

+capacity miss traffic

+conflict miss traffic

Page 14: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Challenges: Shared Memory

  Multiples threads in a shared memory environment

  Data Dependencies:   no inherent support for managing data dependencies.

Bulk Synchronous (barriers), locks/semaphores, atomics   Data Locality:

  Must manage NUMA and NUCA as well as on-/off-chip locality Use Linux affinity routines (parallelize accordingly), cache block (reorder loops), or change data structures/types.

  Communication:   Message aggregation, concurrency, throttling etc…   Ideally, HW/SW (cache coherency/GASNet) runtime should manage this:

" Use collectives and/or memory copies in UPC   Variable/Dynamic Parallelism

  variable TLP is a huge challenge " Task queues (?)

14

Page 15: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Challenges: Message Passing

  Multiples Processes in a message passing environment

  Data Dependencies:   no inherent support for managing data dependencies.   user must express all data dependencies via send/recv/wait

  Data Locality:   No shared memory, so all communication is via MPI

User manages partitioning/replication of data at process level   Communication:

  Message aggregation, concurrency throttling etc…   ideally, MPI should manage this

" Use collectives, larger messages, limit the number of send/recv’s at a time.   Variable/Dynamic Parallelism

  no good solution for variable TLP

15

Page 16: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Coping with Diversity of Hardware

  There are dozens of processor and machine architectures in use today.

  The best implementation of an algorithm is dependent on:   machine   data set   concurrency   machine load   …

  Hand optimizing each architecture/dataset combination is not feasible

16

Page 17: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P

Auto-tuning

17

Page 18: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Auto-tuning

  Automatic Performance Tuning (Auto-tuning) is an empirical feedback driven technique designed to automate performance engineering.

  Our auto-tuning approach finds a good performance solution by a combination of heuristics and exhaustive search   Perl script generates many possible kernels   (Generate SIMD optimized kernels)   Auto-tuning benchmark examines kernels and reports back with the

best one for the current architecture/dataset/compiler/…   Performance depends on the optimizations generated   Heuristics are often desirable when the search space isn’t tractable

  Proven value in Dense Linear Algebra(ATLAS), Spectral(FFTW,SPIRAL), and Sparse Methods(OSKI)

18

Page 19: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 19

Auto-tuning

  Provides performance portability across the existing breadth and evolution of microprocessors

  One time up front productivity cost is amortized by the number of machines its used on

  Auto-tuning does not invent new optimizations   Auto-tuning automates the code generation and exploration of

the optimization and parameter space   Two components:

  parameterized code generator (we wrote ours in Perl)   Auto-tuning exploration benchmark

(combination of heuristics and exhaustive search)   Can be extended with ISA specific optimizations (e.g. DMA, SIMD)

Page 20: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

  Over the last 15 years, the set of optimizations available to auto-tuning has grown immensely.

Algorithmic Parameters FMM, KSM/Akx, MG V-cycle, …

Advancing the State of the Art in Optimizations

Parallelism GTC, LBMHD, Stencils, PLASMA/MAGMA, etc…

20

Data Structure Transformations

OSKI, GTC, …

Loop/Code Transformations ATLAS, FFTW, SPIRAL,

etc…

Page 21: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P

Lattice Boltzmann Magnetohydrodynamics

(LBMHD) Samuel Williams, Jonathan Carter, Leonid Oliker, John Shalf, Katherine Yelick, "Extracting Ultra-Scale Lattice Boltzmann Performance via Hierarchical and Distributed Auto-Tuning", Supercomputing (SC), 2011.

Samuel Williams, Jonathan Carter, Leonid Oliker, John Shalf, Katherine Yelick, "Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms", International Parallel & Distributed Processing Symposium (IPDPS), 2008. Best Paper, Applications Track

21

Page 22: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 22

LBMHD

  Lattice Boltzmann Magnetohydrodynamics (CFD+Maxwell’s Equations)   Plasma turbulence simulation via Lattice Boltzmann Method for simulating

astrophysical phenomena and fusion devices   Three macroscopic quantities:

  Density   Momentum (vector)   Magnetic Field (vector)

  Two distributions:   momentum distribution (27 scalar components)   magnetic distribution (15 Cartesian vector components)

momentum distribution

14

4

13

16

5

8

9

21

12

+Y

2

25

1

3

24

23

22

26

0

18

6

17

19

7

10

11

20

15 +Z

+X

magnetic distribution

14

13

16

21

12

25

24

23

22

26

18

17

19

20

15

+Y

+Z

+X

macroscopic variables

+Y

+Z

+X

Page 23: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 23

LBMHD

  Code Structure   time evolution through a series of collision( ) and stream( ) functions

  stream( )   performs a ghost zone exchange of data to facilitate distributed memory

implementations as well as boundary conditions   should constitute 10% of the runtime

  collision( )’s Arithmetic Intensity:   Must read 73 doubles, and update 79 doubles per lattice update (1216 bytes)   Requires about 1300 floating point operations per lattice update   Just over 1.0 flops/byte (ideal architecture)   Suggests LBMHD is memory-bound on the Cray XT4/XE6.

  Structure-of-arrays layout (component’s are separated) ensures that cache capacity requirements are independent of problem size

  However, TLB capacity requirement increases to >150 entries

  periodic boundary conditions

Page 24: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

LBMHD Stencil

  Simplified example reading from 9 arrays and writing to 9 arrays   Actual LBMHD reads 73, writes 79 arrays

24

x dimension

(b)(a)

write_array[ ][ ]

(+1,0)

(0,+1)

(0,-1)

(-1,0) (0,0)

(+1,+1)

(+1,-1)(-1,-1)

(-1,+1)

read_array[ ][ ]

?

Page 25: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P

25

Auto-tuning LBMHD on Multicore SMPs

Samuel Williams, Jonathan Carter, Leonid Oliker, John Shalf, Katherine Yelick, "Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms", International Parallel & Distributed Processing Symposium (IPDPS), 2008.

Page 26: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 26

LBMHD Performance (reference implementation)

  Generally, scalability looks good

  Scalability is good   but is performance good?

*collision() only

Reference+NUMA

Page 27: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Lattice-Aware Padding

  For a given lattice update, the requisite velocities can be mapped to a relatively narrow range of cache sets (lines).

  As one streams through the grid, one cannot fully exploit the capacity of the cache as conflict misses evict entire lines.

  In an structure-of-arrays format, pad each component such that when referenced with the relevant offsets (±x,±y,±z) they are uniformly distributed throughout the sets of the cache

  Maximizes cache utilization and minimizes conflict misses.

27

Page 28: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

LBMHD Performance (lattice-aware array padding)

28

  LBMHD touches >150 arrays.

  Most caches have limited associativity

  Conflict misses are likely   Apply heuristic to pad

arrays

+Padding

Reference+NUMA

Page 29: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY (b)

(a)

(d)

(c)

Vectorization

  Two phases with a lattice method’s collision() operator:   reconstruction of macroscopic variables   updating discretized velocities

  Normally this is done one point at a time.   Change to do a vector’s worth at a time (loop interchange + tuning)

29

Page 30: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 30

LBMHD Performance (architecture specific optimizations)

  Add unrolling and reordering of inner loop

  Additionally, it exploits SIMD where the compiler doesn’t

  Include a SPE/Local Store optimized version

*collision() only

+Explicit SIMDization

+SW Prefetching

+Unrolling

+Vectorization

+Padding

Reference+NUMA

+small pages

Page 31: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 31

LBMHD Performance (architecture specific optimizations)

  Add unrolling and reordering of inner loop

  Additionally, it exploits SIMD where the compiler doesn’t

  Include a SPE/Local Store optimized version

*collision() only

+Explicit SIMDization

+SW Prefetching

+Unrolling

+Vectorization

+Padding

Reference+NUMA

+small pages

1.6x 4x

3x 130x

Page 32: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Limitations

  Ignored MPP (distributed) world   Kept problem size fixed and cubical   When run with only 1 process per SMP, maximizing threads per

process always looked best

32

Page 33: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P

33

Auto-tuning LBMHD on Multicore MPPs

Samuel Williams, Jonathan Carter, Leonid Oliker, John Shalf, Katherine Yelick, "Extracting Ultra-Scale Lattice Boltzmann Performance via Hierarchical and Distributed Auto-Tuning", Supercomputing (SC), 2011.

Page 34: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P

34

MPI+Pthreads and MPI+OpenMP Implementations

Explored performance on 3 ultrascale machines using 2048 nodes on each and running a 1GB, 4GB, and if possible 16GB(per node) problem size.

• IBM Blue Gene/P at Argonne (Intrepid) 8,192 cores • Cray XT4 at NERSC (Franklin) 8,192 cores • Cray XE6 at NERSC (Hopper) 49,152 cores

Page 35: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

collision()

pack()

unpack()

MPI()

pack()

pack()

MPI()

MPI()

unpack()

unpack()

collision()

pack()

unpack()

MPI()

pack()

pack()

MPI()

MPI()

unpack()

unpack()

collision()

pack()

unpack()

MPI()

pack()

pack()

MPI()

MPI()

unpack()

unpack()

collision()

pack()

unpack()

MPI()

pack()

pack()

MPI()

MPI()

unpack()

unpack()

process 0 process 1 process 2 process 3(core 0) (core 1) (core 2) (core 3)

Flat MPI

  In the flat MPI world, there is one process per core, and only one thread per process

  All communication is through MPI

35

Page 36: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

collision() collision() collision() collision()

pack() pack() pack() pack()

unpack() unpack() unpack() unpack()

MPI()

pack() pack() pack() pack()

pack() pack() pack() pack()

MPI()

MPI()

unpack() unpack() unpack() unpack()

unpack() unpack() unpack() unpack()

barrier()

barrier()

barrier()

barrier()

barrier()

barrier()

barrier()

barrier()

barrier()

barrier()

barrier()

MPI()

MPI()

MPI()

barrier()

barrier()

barrier()

barrier()

barrier()

barrier()

barrier()

barrier()

barrier()

barrier()

barrier()

process 0 process 1thread 0 thread 1 thread 0 thread 1(core 0) (core 1) (core 2) (core 3)

Hybrid MPI + Pthreads/OpenMP

  As multicore processors already provide cache coherency for free, we can exploit it to reduce MPI overhead and traffic.

  We examine using pthreads and OpenMP for threading (other possibilities exist)

  For correctness in pthreads, we are required to include a intra-process (thread) barrier between function calls for correctness. (we wrote our own)

  Implicitly, OpenMP will barrier via the #pragma

  We can choose any balance between processes/node and threads/process

  In both Pthreads and OpenMP, only thread 0 performs MPI calls

36

collision() collision() collision() collision()

pack() pack() pack() pack()

unpack() unpack() unpack() unpack()

MPI()

pack() pack() pack() pack()

pack() pack() pack() pack()

MPI()

MPI()

unpack() unpack() unpack() unpack()

unpack() unpack() unpack() unpack()

barrier()

barrier()

barrier()

barrier()

barrier()

barrier()

barrier()

barrier()

barrier()

barrier()

barrier()

process 0thread 0 thread 1 thread 2 thread 3(core 0) (core 1) (core 2) (core 3)

Page 37: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

The Distributed Auto-tuning Problem

  We believe that even for relatively large problems, auto-tuning only the local computation (e.g. IPDPS’08) will deliver sub-optimal MPI performance.

  Want to explore MPI/Hybrid decomposition as well   We have a combinatoric explosion in the search space coupled with

a large problem size (number of nodes)

  To remedy this, we employ a greedy search approach that   determines the best single core implementation (on a single node) ~

IPDPS work   explores the best parallel MPI decomposition among nodes and the

best on-node programming model (8-64 nodes)   Evaluates performance at scale (2048 nodes = 49,152 cores)

37

Page 38: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Stage 2

38

  In stage 2, we prune the MPI space.   Given a fixed memory footprint per node, explore the different ways

of partitioning it among processes and threads.

“Flat MPI” 4 Processes per node

(no threading)

“Hybrid” 2 Processes per node, 2 threads per process

“Hybrid” 1 Processes per node, 4 threads per process

process 0 process 1

process 2 process 3

thread 0

thread 1

thread 0

thread 1

thread 0

thread 1

thread 0

thread 1

thread 0

thread 1

thread 2

thread 3

thread 0

thread 1

pro

cess 0

pro

cess 1

pro

cess 2

pro

cess 3

process 0

process 1

process 2

process 3

pro

cess 0

pro

cess 1

process 0

process 2

process 1

process 3

pro

cess 0

pro

cess 1

process 0

process 1

pro

cess 0

process 0

process 1

process 3

pro

cess 3

pro

cess 0

pro

cess 1

0.00

0.25

0.50

0.75

1.00

1.25

1.50

1.75

2.00

2.25

GF

lop

/s p

er

Co

re

Flat MPI MPI+OpenMP MPI+Pthreads

Intrepid Franklin Hopper 1G

B

1G

B

4G

B

1G

B

4G

B

16G

B

Page 39: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Stage 2

  Hybrid Auto-tuning requires we mimic the SPMD environment

  Suppose we wish to explore this color-coded optimization space.

  In the serial world (or fully threaded nodes), the tuning is easily run

  However, in the MPI or hybrid world a problem arises as processes are not guaranteed to be synchronized.

  As such, one process may execute some optimizations faster than others simply due to fortuitous scheduling with another processes’ trials

  Solution: add an MPI_barrier() around each trial (a configuration with 100’s of iterations)

39 tim

e

process0 process1 . . . .

process0 process1 . . . .

one node

Page 40: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P

40

Results

Page 41: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Performance Results (using 2048 nodes on each machine)

  We present the best data for progressively more aggressive auto-tuning efforts

  Remember, Hopper has 6x as many cores per node as Intrepid or Franklin. So performance per node is far greater.

  auto-tuning can improve performance   ISA-specific optimizations (e.g. SIMD

intrinsics) help more   Overall, we see speedups of up to 3.4x

  As problem size increased, so to does performance. However, the value of threading is diminished.

41

0.00

0.25

0.50

0.75

1.00

1.25

1.50

1.75

2.00

2.25

1G

B

1G

B

4G

B

1G

B

4G

B

16G

B

Intrepid Franklin Hopper

GF

lop

/s p

er

Co

re

threading of stream()

threading of collision()

auto-tuned (ISA-specific)

auto-tuned (portable C)

reference

Page 42: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Performance Results (using 2048 nodes on each machine)

  We present the best data for progressively more aggressive auto-tuning efforts

  Remember, Hopper has 6x as many cores per node as Intrepid or Franklin. So performance per node is far greater.

  auto-tuning can improve performance   ISA-specific optimizations (e.g. SIMD

intrinsics) help more   As problem size increased, so to does

performance. However, the value of threading is diminished.

  For small problems, MPI time can dominate runtime on Hopper

  Threading mitigates this

42

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Refe

rence

Optim

ized

Thre

aded

Refe

rence

Optim

ized

Thre

aded

Refe

rence

Optim

ized

Thre

aded

Intrepid

(1GB)

Franklin

(1GB)

Hopper

(1GB)

Tim

e R

ela

tive t

o R

efe

ren

ce collision()

stream()

Page 43: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Performance Results (using 2048 nodes on each machine)

  We present the best data for progressively more aggressive auto-tuning efforts

  Remember, Hopper has 6x as many cores per node as Intrepid or Franklin. So performance per node is far greater.

  auto-tuning can improve performance   ISA-specific optimizations (e.g. SIMD

intrinsics) help more   As problem size increased, so to does

performance. However, the value of threading is diminished.

  For large problems, MPI time remains a small fraction of overall time

43

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Refe

rence

Optim

ized

Thre

aded

Refe

rence

Optim

ized

Thre

aded

Refe

rence

Optim

ized

Thre

aded

Intrepid

(1GB)

Franklin

(4GB)

Hopper

(16GB)

Tim

e R

ela

tive t

o R

efe

ren

ce collision()

stream()

Page 44: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Energy Results (using 2048 nodes on each machine)

  Ultimately, energy is becoming the great equalizer among machines.

  Hoper has 6x the cores, but burns 15x the power of Intrepid.

  To visualize this, we explore energy efficiency (Mflop/s per Watt)

  Clearly, despite the performance differences, energy efficiency is remarkably similar.

44

0

10

20

30

40

50

60

70

80

90

100

1G

B

1G

B

4G

B

1G

B

4G

B

16G

B

Intrepid Franklin Hopper

MF

lop

/s p

er

Watt

fully optimized

reference

Page 45: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P

Sparse Matrix Vector Multiplication (SpMV)

45

Page 46: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P

46

Auto-tuning Sparse Matrix-Vector Multiplication (SpMV)

Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Supercomputing (SC), 2007.

Page 47: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 47

Sparse Matrix Vector Multiplication

  What’s a Sparse Matrix ?   Most entries are 0.0   Performance advantage in only storing/operating on the nonzeros   Requires significant meta data to record the matrix structure

  What’s SpMV ?   Evaluate y=Ax   A is a sparse matrix, x & y are dense vectors

  Challenges   Very memory-intensive (often <0.166 flops/byte)   Difficult to exploit ILP (bad for pipelined or superscalar),   Difficult to exploit DLP (bad for SIMD)

(a) algebra conceptualization

(c) CSR reference code

for (r=0; r<A.rows; r++) { double y0 = 0.0; for (i=A.rowStart[r]; i<A.rowStart[r+1]; i++){

y0 += A.val[i] * x[A.col[i]]; } y[r] = y0; }

A x y

(b) CSR data structure

A.val[ ]

A.rowStart[ ]

...

...

A.col[ ] ...

Page 48: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 48

The Dataset (matrices)

  Unlike DGEMV, performance is dictated by sparsity   Suite of 14 matrices   All bigger than the caches of our SMPs   We’ll also include a median performance number

Dense

Protein FEM / Spheres

FEM / Cantilever

Wind Tunnel

FEM / Harbor QCD FEM /

Ship Economics Epidemiology

FEM / Accelerator Circuit webbase

LP

2K x 2K Dense matrix stored in sparse format

Well Structured (sorted by nonzeros/row)

Poorly Structured hodgepodge

Extreme Aspect Ratio (linear programming)

Page 49: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

SpMV Parallelization

  How do we parallelize a matrix-vector multiplication ?

49

Page 50: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

SpMV Parallelization

  How do we parallelize a matrix-vector multiplication ?   We could parallelize by columns (sparse matrix time dense sub vector)

and in the worst case simplify the random access challenge but:   each thread would need to store a temporary partial sum   and we would need to perform a reduction (inter-thread data dependency)

50

thread 0 thread 1 thread 2 thread 3

Page 51: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

SpMV Parallelization

  How do we parallelize a matrix-vector multiplication ?   We could parallelize by columns (sparse matrix time dense sub vector)

and in the worst case simplify the random access challenge but:   each thread would need to store a temporary partial sum   and we would need to perform a reduction (inter-thread data dependency)

51

thread 0 thread 1 thread 2 thread 3

Page 52: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

SpMV Parallelization

  How do we parallelize a matrix-vector multiplication ?   Alternately, we could parallelize by rows.   Clearly, there are now no inter-thread data dependencies, but, in the

worst case, one must still deal with challenging random access

52

thre

ad 0

th

read

1

thre

ad 2

th

read

3

Page 53: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 53

SpMV Performance (simple parallelization)

  Out-of-the box SpMV performance on a suite of 14 matrices

  Simplest solution = parallelization by rows (solves data dependency challenge)

  Scalability isn’t great   Is this performance

good?

Naïve Pthreads

Naïve

Page 54: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

667MHz DDR2 DIMMs

10.66GB/s

2x64b controllers

Hyp

erT

ran

sp

ort

Op

tero

n

Op

tero

n

Op

tero

n

Op

tero

n

51

2K

51

2K

51

2K

51

2K

2MB victim

SRI / xbar

667MHz DDR2 DIMMs

10.66GB/s

2x64b controllers

Hyp

erT

ran

sp

ort

Op

tero

n

Op

tero

n

Op

tero

n

Op

tero

n

51

2K

51

2K

51

2K

51

2K

2MB victim

SRI / xbar

4G

B/s

(e

ach

dire

ctio

n)

NUMA (Data Locality for Matrices)

  On NUMA architectures, all large arrays should be partitioned either   explicitly (multiple malloc()’s + affinity)   implicitly (parallelize initialization and rely on first touch)

  You cannot partition on granularities less than the page size   512 elements on x86   2M elements on Niagara

  For SpMV, partition the matrix and perform multiple malloc()’s

  Pin submatrices so they are co-located with the cores tasked to process them

54

Page 55: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

667MHz DDR2 DIMMs

10.66GB/s

2x64b controllers

HyperT

ransport

Opte

ron

Opte

ron

Opte

ron

Opte

ron

512K

512K

512K

512K

2MB victim

SRI / xbar

667MHz DDR2 DIMMs

10.66GB/s

2x64b controllers

HyperT

ransport

Opte

ron

Opte

ron

Opte

ron

Opte

ron

512K

512K

512K

512K

2MB victim

SRI / xbar

4G

B/s

(e

ach

dire

ctio

n)

NUMA (Data Locality for Matrices)

55

Page 56: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Prefetch for SpMV

  SW prefetch injects more MLP into the memory subsystem.

  (attempts to supplement HW prefetchers in their attempt to satisfy Little’s Law)

  Can try to prefetch the   values   indices   source vector   or any combination thereof

  In general, should only insert one prefetch per cache line (works best on unrolled code)

56

for(all rows){

y0 = 0.0;

y1 = 0.0;

y2 = 0.0;

y3 = 0.0;

for(all tiles in this row){

PREFETCH(V+i+PFDistance);

y0+=V[i ]*X[C[i]]

y1+=V[i+1]*X[C[i]]

y2+=V[i+2]*X[C[i]]

y3+=V[i+3]*X[C[i]]

}

y[r+0] = y0;

y[r+1] = y1;

y[r+2] = y2;

y[r+3] = y3;

}

Page 57: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

SpMV Performance (NUMA and Software Prefetching)

57

  NUMA-aware allocation is essential on memory-bound NUMA SMPs.

  Explicit software prefetching can boost bandwidth and change cache replacement policies

  Cell PPEs are likely latency-limited.

  used exhaustive search

Page 58: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

ILP/DLP vs Bandwidth

  In the multicore era, which is the bigger issue?   a lack of ILP/DLP (a major advantage of BCSR)   insufficient memory bandwidth per core

  There are many architectures that when running low arithmetic intensity kernels, there is so little available memory bandwidth per core that you won’t notice a complete lack of ILP

  Perhaps we should concentrate on minimizing memory traffic rather than maximizing ILP/DLP

  Rather than benchmarking every combination, just Select the register blocking that minimizes the matrix foot print.

58

Page 59: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0 0.0

0.0 0.0

0.0 0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

A00

A10A11

A22

A32A33

A44

A55

A66

A77

A54

A66

A24

A02

A46

0.0

0.0

0.0

0.0

0.0

0.0

0.0

A00

A10A11

A22

A32A33

A44

A55

A66

A77

A54

A66

A24

A02

A46

A00

A10A11

A22

A32A33

A44

A55

A66

A77

A54

A66

A24

A02

A46

A00

A10A11

A22

A32A33

A44

A55

A66

A77

A54

A66

A24

A02

A46

1x1 1x2 2x22x1

(a) (b) (c) (d)

Matrix Compression Strategies

  Register blocking creates small dense tiles   better ILP/DLP   reduced overhead per nonzero

  Let each thread select a unique register blocking   In this work,

  we only considered power-of-two register blocks   select the register blocking that minimizes memory traffic

59

Page 60: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0 0.0

0.0 0.0

0.0 0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

A00

A10A11

A22

A32A33

A44

A55

A66

A77

A54

A66

A24

A02

A46

0.0

0.0

0.0

0.0

0.0

0.0

0.0

A00

A10A11

A22

A32A33

A44

A55

A66

A77

A54

A66

A24

A02

A46

A00

A10A11

A22

A32A33

A44

A55

A66

A77

A54

A66

A24

A02

A46

A00

A10A11

A22

A32A33

A44

A55

A66

A77

A54

A66

A24

A02

A46

1x1 1x2 2x22x1

(a) (b) (c) (d)

Matrix Compression Strategies

  Where possible we may encode indices with less than 32 bits   We may also select different matrix formats

  In this work,   we considered 16-bit and 32-bit indices (relative to thread’s start)   we explored BCSR/BCOO (GCSR in book chapter)

60

Page 61: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

SpMV Performance (Matrix Compression)

61

  After maximizing memory bandwidth, the only hope is to minimize memory traffic.

  exploit:

  register blocking   other formats   smaller indices

  Use a traffic minimization heuristic rather than search

  Benefit is clearly matrix-dependent.

  Register blocking enables efficient software prefetching (one per cache line)

Page 62: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Cache blocking for SpMV (Data Locality for Vectors)

  Cache-blocking sparse matrices is very different than cache-blocking dense matrices.

  Rather than changing loop bounds, store entire submatrices contiguously.

  The columns spanned by each cache block are selected so that all submatrices place the same pressure on the cache

i.e. touch the same number of unique source vector cache lines

  TLB blocking is a similar concept but instead of on 8 byte granularities, it uses 4KB granularities

62

thre

ad 0

th

read

1

thre

ad 2

th

read

3

Page 63: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Cache blocking for SpMV (Data Locality for Vectors)

  Cache-blocking sparse matrices is very different than cache-blocking dense matrices.

  Rather than changing loop bounds, store entire submatrices contiguously.

  The columns spanned by each cache block are selected so that all submatrices place the same pressure on the cache

i.e. touch the same number of unique source vector cache lines

  TLB blocking is a similar concept but instead of on 64 byte granularities, it uses 4KB granularities

63

thre

ad 0

th

read

1

thre

ad 2

th

read

3

Page 64: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 64

Auto-tuned SpMV Performance (cache and TLB blocking)

  Fully auto-tuned SpMV performance across the suite of matrices

  Why do some optimizations work better on some architectures?

  matrices with naturally small working sets

  architectures with giant caches

+Cache/LS/TLB Blocking

+Matrix Compression

+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve

Page 65: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 65

Auto-tuned SpMV Performance (architecture specific optimizations)

  Fully auto-tuned SpMV performance across the suite of matrices

  Included SPE/local store optimized version

  Why do some optimizations work better on some architectures?

+Cache/LS/TLB Blocking

+Matrix Compression

+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve

Page 66: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 66

Auto-tuned SpMV Performance (max speedup)

  Fully auto-tuned SpMV performance across the suite of matrices

  Included SPE/local store optimized version

  Why do some optimizations work better on some architectures?

+Cache/LS/TLB Blocking

+Matrix Compression

+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve

2.7x 4.0x

2.9x 35x

Page 67: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P

Summary

67

Page 68: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Summary

  There is a continual struggle between computer architects, mathematicians, and computer scientists.   architects will increase peak performance   architects may attempt to facilitate satisfying Little’s Law   mathematicians create new, more efficient algorithms

  In order to minimize time to solution, we often must simultaneously satisfy Little’s Law and minimize computation/communication.   Even if we satisfy little’s Law, applications may be severely bottlenecked

by computation/communication

  Perennially, we must manage:   data/task locality   data dependencies   communication   variable and dynamic parallelism

68

Page 69: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Summary (2)

  When optimizing code, the ideal solution for one machine is often found to be deficient on another.

  To that end, we are faced with the prospect of optimizing key computations for every architecture-input combination.

  Automatic Performance Tuning (auto-tuning) has been shown to mitigate these challenges by parameterizing some of the optimizations.

  Unfortunately, the more diverse the architectures the more we must rely on radically different implementations and algorithms to improve time to solution.

69

Page 70: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P

Questions?

Acknowledgments Research supported by DOE Office of Science under contract number DE-AC02-05CH11231. All XT4/XE6 simulations were performed on the XT4 (Franklin) and XE6 (Hopper) at the National Energy Research Scientific Computing Center (NERSC). This research used resources of the Argonne Leadership Computing Facility at Argonne National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under contract DE-AC02-06CH11357. 2005. George Vahala and his research group provided the original (FORTRAN) version of the LBMHD code. Jonathan Carter provided the optimized (FORTRAN) implementation that served as a basis for the LBMHD portion of this talk.

70

Page 71: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P

BACKUP SLIDES

71

Page 72: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P

Evolution of Computer Architecture

and Little’s Law

72

Page 73: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P

Yesterday’s Constraint: Instruction Latency & Parallelism

73

Page 74: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Single-issue, non-pipelined

  Consider a single issue, non-pipelined processor

  Little’s Law   Bandwidth = issue width = 1   Latency = 1   Concurrency = 1

  Very easy to get good performance even if all instructions are dependent

74

Issue width

In flight

completed

Future instructions

Page 75: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Pipelined

  By pipelining, we can increase the processor frequency.

  However, we must ensure the pipeline remains filled to achieve better performance.

  Little’s Law   Bandwidth = issue width = 1   Latency = 3   Concurrency = 3

  Performance may drop to 1/3 of peak

75

Issue width

In flight

completed

Future instructions

Page 76: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Pipelined (w/unrolling, reordering)

  There may be inherent and untapped parallelism in the code

  Compilers/programmers must find parallelism, and unroll/reorder the code to keep the pipeline full

76

Issue width

In flight

completed

Future instructions

Page 77: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Out-of-order

  Alternately, the hardware can try to find instruction level parallelism (ILP)

  Instructions are:   Queued up   Executed out-of-order   Reordered   Committed in-order

  Useful when parallelism or latency cannot be determined at compile time.

77

Issue width

Out-of-order execution

completed

Future instructions

1

Reorder buffer

Reservation Stations

2 3 5

6 4 7

8 9 10 11

6 4 7 8 9

Page 78: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Superscalar

  Increase throughput, by executing multiple instructions in parallel

  Usually separate pipelines for different instruction types: FP, integer, memory

  Significantly complicates out-of-order execution

78

Issue width

Out-of-order execution

completed

Future instructions

1

Reorder buffer

Reservation Stations

2

4 6

7 9

10 12 14

7 5 8 9 10

5 8

11 13

3

Page 79: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Instruction-Level Parallelism

  On modern pipelined architectures, operations (like floating-point addition) have a latency of 4-6 cycles (until the result is ready).

  However, independent adds can be pipelined one after another.   Although this increases the peak flop rate,

  one can only achieve peak flops on the condition that on any given cycle the program has >4 independent adds ready to execute.

  failing to do so will result in a >4x drop in performance.   The problem is exacerbated by superscalar or VLIW architectures

like POWER or Itanium.

  One must often reorganize kernels to express more instruction-level parallelism

79

Page 80: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

1x1 Register Block

ILP Example (1x1 BCSR)

for(all rows){

y0 = 0.0;

for(all tiles in this row){

y0+=V[i]*X[C[i]]

}

y[r] = y0;

}

  Consider the core of SpMV   No ILP in the inner loop   OOO can’t accelerate serial FMAs

FMA

time

= 0

FMA

time

= 4

FMA

time

= 8

FMA

time

= 12

FMA

time

= 16

Page 81: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

ILP Example (1x4 BCSR)

for(all rows){

y0 = 0.0;

for(all tiles in this row){

y0+=V[i ]*X[C[i] ]

y0+=V[i+1]*X[C[i]+1]

y0+=V[i+2]*X[C[i]+2]

y0+=V[i+3]*X[C[i]+3]

}

y[r] = y0;

}

  What about 1x4 BCSR ?   Still no ILP in the inner loop   FMAs are still dependent on each other

FMA

time

= 0

FMA

time

= 4

FMA

time

= 8

FMA

time

= 12

1x4 Register Block

Page 82: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

ILP Example (4x1 BCSR)

for(all rows){

y0 = 0.0;y1 = 0.0;

y2 = 0.0;y3 = 0.0;

for(all tiles in this row){

y0+=V[i ]*X[C[i]]

y1+=V[i+1]*X[C[i]]

y2+=V[i+2]*X[C[i]]

y3+=V[i+3]*X[C[i]]

}

y[r+0] = y0; y[r+1] = y1;

y[r+2] = y2; y[r+3] = y3;

}

  What about 4x1 BCSR ?   Updating 4 different rows   The 4 FMAs are independent   Thus they can be pipelined.

FMA

time

= 0

FMA

time

= 1

FMA

time

= 2

FMA

time

= 3

4x1 Register Block FMA

time

= 4

FMA

time

= 5

FMA

time

= 6

FMA

time

= 7

Page 83: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

SIMD

  Many codes perform the same operations on different pieces of data (Data level parallelism = DLP)

  SIMD : Single Instruction Multiple Data

  Register sizes are increased.   Instead of each register being a 64b FP #,

each register holds 2 or 4 FP#’s

  Much more efficient solution than superscalar on data parallel codes

83

Page 84: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Data-level Parallelism

  DLP = apply the same operation to multiple independent operands.

  Today, rather than relying on superscalar issue, many architectures have adopted SIMD as an efficient means of boosting peak performance. (SSE, Double Hummer, AltiVec, Cell, GPUs, etc…)

  Typically these instructions operate on four single precision (or two double precision) numbers at a time.

  However, some are more GPUs(32), Larrabee(16), and AVX(8)   Failing to use these instructions may cause a 2-32x drop in

performance   Unfortunately, most compilers utterly fail to generate these

instructions. 84

+ + + +

Page 85: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Memory-Level Parallelism (1)

  Although caches may filter many memory requests, in HPC many memory references will still go all the way to DRAM.

  Memory latency (as measured in core cycles) grew by an order of magnitude in the 90’s

  Today, the latency of a memory operation can exceed 200 cycles (1 double every 80ns is unacceptably slow).

  Like ILP, we wish to pipeline requests to DRAM   Several solutions exist today

  HW stream prefetchers   HW Multithreading (e.g. hyperthreading)   SW line prefetch   DMA

85

Page 86: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Memory-Level Parallelism (2)

  HW stream prefetchers are by far the easiest to implement and exploit.

  They detect a series of consecutive cache misses and speculate that the next addresses in the series will be needed. They then prefetch that data into the cache or a dedicated buffer.

  To effectively exploit a HW prefetcher, ensure your array references accesses 100’s of consecutive addresses.

  e.g. read A[i]…A[i+255] without any jumps or discontinuities

  This force limits the effectiveness (shape) of the cache blocking you implemented in HW1 as you accessed: A[(j+0)*N+i]…A[(j+0)*N+i+B], jump

A[(j+1)*N+i]…A[(j+1)*N+i+B], jump

A[(j+2)*N+i]…A[(j+2)*N+i+B], jump

86

Page 87: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Multithreaded

  Superscalars fail when there is no ILP or DLP   However, there are many codes with

thread-level parallelism (TLP)   Consider architectures that are virtualized to appear as N cores.   In reality, there is one core maintaining multiple contexts and

dynamically switching between them   There are 3 main types of multithread architectures:

  Coarse-grained multithreading (CGMT)   Fine-grained multithreading (FGMT) , aka Vertical Multithreading   Simultaneous multithreading (SMT)

87

Page 88: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Coarse-grained Multithreading

  Maintain multiple contexts   On a long latency instruction:

  dispatch instruction   Switch to a ready thread   Hide latency with multiple

ready threads   Eventually switch back to

original

88

In flight

completed

Ready instructions

Page 89: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Fine-grained Multithreading

  Maintain multiple contexts   On every cycle choose a ready

thread   May now satisfy Little’s Law

through multithreading: threads ~ latency * bandwidth

89

In flight

completed

Ready instructions

Page 90: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Simultaneous Multithreading

  Maintain multiple contexts   On every cycle choose as

many ready instructions from the thread pool as possible

  Can be applied to both in-order and out-of-order architectures

90

In flight

completed

Ready instructions

Page 91: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P

Today’s Constraint: The Memory Wall

91

Page 92: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Abstract Machine Model (as seen in programming model)

  In the abstract, processor architectures appear to have just memory, and functional units.

  On early HPC machines, reading from memory required just one cycle.

92

DRAM float y[N];!

Cores z=0;!

z+=x[i]*y[i];!

i++;!

float x[N];!

int i;!float z;!

Page 93: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Abstract Machine Model (as seen in programming model)

  In the abstract, processor architectures appear to have just memory, and functional units.

  On early HPC machines, reading from memory required just one cycle.

93

DRAM float y[N];!

Cores z=0;!

z+=x[i]*y[i];!

i++;!

float x[N];!

int i;!float z;!

  Unfortunately, as processors developed, DRAM latencies (in terms of core cycles) dramatically increased.

  Eventually a small memory (the register file) was added so that one could hide this latency (by keeping data in the RF)

  The programming model and compiler evolved to hide the fact the management of data locality in the RF.

Register Files

  Unfortunately, today, latency to DRAM can be 1000x that to the register file.

  As the RF is too small for today’s problems, architects inserted another memory (cache) between the register file and the DRAM.

  Data is transparently copied into the cache for future reference.   This memory is entirely invisible to the programmer and the

compiler, but still has latency 10x higher than the register file.

Cache

Page 94: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Abstract Machine Model (as seen in programming model)

  Not only are the differences in latencies substantial, so to are the bandwidths

  Once a link has been saturated (Little’s law is satisfied), it acts as a bottleneck against increased performance.

  The only solution is to reduce the volume of traffic across that link

94

DRAM float y[N];!

Cores z=0;!

z+=x[i]*y[i];!

i++;!

float x[N];!

int i;!float z;!

Register Files

Cache

<50 GB/s

<1000 GB/s

<6000 GB/s

Page 95: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Impact on Little’s Law ?

  Today, utilizing the full DRAM bandwidth and minimizing memory traffic are paramount.

  DRAM latency can exceed 1000 cpu cycles.   Impact on Little’s Law (200ns * 20GB/s):

4KB of data in flight

  How did architects solve this?

95

Page 96: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

out-of-order ?

  Out-of-order machines can only scale to ~100 instructions in flight of which only ~40 can be loads

  This is ~10% of what Little’s Law requires

  Out-of-order execution can now only hide cache latency, not DRAM latency

96

Page 97: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Software prefetch ?

  A software prefetch is an instruction similar to a load   However,

  It does not write to a register   Its execution is decoupled from the core   It is designed to bring data into the cache before it is needed   It must be scheduled by software early enough to hide DRAM latency

  Limited applicability   work best on patterns for which many addresses are known well in advance.   must be inserted by hand with the distance tuned for

97

Page 98: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Hardware Stream Prefetchers ?

  Hardware examines the cache miss pattern   Detects unit-stride (now strided) miss patterns, and begins to

prefetch data before the next miss occurs.

  Summary   Only works for simple memory access patterns   Can be tripped up if there are too many streams (>8)   Cannot prefetch beyond TLB page boundaries

98

Page 99: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Local Store + DMA

  Create an on-chip memory(Local Store) disjoint from the cache/TLB-heirarchy

  Use DMA to transfer data from DRAM to local store   Basic operation: specify a long contiguous transfer (unit stride)   Can be extended to specifying a list of transfers (random access)   DMA is decoupled from execution (poll to check for completion)

  Allows one to efficiently satisfy the concurrency from Little’s Law:   Concurrency = #DMAs * DMA length

  #DMAs * DMA length

  #DMAs * DMA length

  Requires major software effort

99

Page 100: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Multithreading

  Another approach to satisfying Little’s law   But more threads/core -> less cache(& associativity) per thread

  Allow one cache miss per thread   #threads * 1 cacheline vs. Latency * Bandwidth   #threads = Latency*Bandwidth / cacheline ~ 64

  64 threads/core is unrealistically high

100

Page 101: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P

Roofline Model

101

Page 102: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline Model Basic Concept

102

  Synthesize communication, computation, and locality into a single visually-intuitive performance figure using bound and bottleneck analysis.

  where optimization i can be SIMDize, or unroll, or SW prefetch, …   Given a kernel’s arithmetic intensity (based on DRAM traffic after

being filtered by the cache), programmers can inspect the figure, and bound performance.

  Moreover, provides insights as to which optimizations will potentially be beneficial.

Attainable Performanceij

= min FLOP/s with Optimizations1-i

AI * Bandwidth with Optimizations1-j

Page 103: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline Model Basic Concept

103

  Plot on log-log scale   Given AI, we can easily

bound performance   But architectures are much

more complicated

  We will bound performance as we eliminate specific forms of in-core parallelism

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356 (Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

peak DP

Page 104: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline Model computational ceilings

104

  Opterons have dedicated multipliers and adders.

  If the code is dominated by adds, then attainable performance is half of peak.

  We call these Ceilings   They act like constraints on

performance

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356 (Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

peak DP

mul / add imbalance

Page 105: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline Model computational ceilings

105

  Opterons have 128-bit datapaths.

  If instructions aren’t SIMDized, attainable performance will be halved

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356 (Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

peak DP

mul / add imbalance

w/out SIMD

Page 106: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline Model computational ceilings

106

  On Opterons, floating-point instructions have a 4 cycle latency.

  If we don’t express 4-way ILP, performance will drop by as much as 4x

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356 (Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

w/out SIMD

w/out ILP

peak DP

mul / add imbalance

Page 107: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline Model communication ceilings

107

  We can perform a similar exercise taking away parallelism from the memory subsystem

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356 (Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

peak DP

Page 108: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline Model communication ceilings

108

  Explicit software prefetch instructions are required to achieve peak bandwidth

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356 (Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

peak DP

Page 109: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline Model communication ceilings

109

  Opterons are NUMA   As such memory traffic

must be correctly balanced among the two sockets to achieve good Stream bandwidth.

  We could continue this by examining strided or random memory access patterns

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356 (Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

peak DP

Page 110: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline Model computation + communication ceilings

110

  We may bound performance based on the combination of expressed in-core parallelism and attained bandwidth.

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356 (Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

w/out SIMD

peak DP

mul / add imbalance

w/out ILP

Page 111: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline Model locality walls

111

  Remember, memory traffic includes more than just compulsory misses.

  As such, actual arithmetic intensity may be substantially lower.

  Walls are unique to the architecture-kernel combination

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

w/out SIMD

mul / add imbalance

w/out ILP

Opteron 2356 (Barcelona)

peak DP

only compulsory m

iss traffic

FLOPs Compulsory Misses

AI =

Page 112: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline Model locality walls

112

  Remember, memory traffic includes more than just compulsory misses.

  As such, actual arithmetic intensity may be substantially lower.

  Walls are unique to the architecture-kernel combination

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

w/out SIMD

mul / add imbalance

w/out ILP

Opteron 2356 (Barcelona)

peak DP

only compulsory m

iss traffic +w

rite allocation traffic

FLOPs Allocations + Compulsory Misses

AI =

Page 113: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline Model locality walls

113

  Remember, memory traffic includes more than just compulsory misses.

  As such, actual arithmetic intensity may be substantially lower.

  Walls are unique to the architecture-kernel combination

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

w/out SIMD

mul / add imbalance

w/out ILP

Opteron 2356 (Barcelona)

peak DP

only compulsory m

iss traffic +w

rite allocation traffic

+capacity miss traffic

FLOPs Capacity + Allocations + Compulsory

AI =

Page 114: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline Model locality walls

114

  Remember, memory traffic includes more than just compulsory misses.

  As such, actual arithmetic intensity may be substantially lower.

  Walls are unique to the architecture-kernel combination

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

w/out SIMD

mul / add imbalance

w/out ILP

Opteron 2356 (Barcelona)

peak DP

only compulsory m

iss traffic +w

rite allocation traffic

+capacity miss traffic

+conflict miss traffic

FLOPs Conflict + Capacity + Allocations + Compulsory

AI =

Page 115: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 115

Optimization Categorization

Maximizing (attained) In-core Performance

Minimizing (total) Memory Traffic

Maximizing (attained) Memory Bandwidth

Page 116: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 116

Optimization Categorization

Minimizing Memory Traffic

Maximizing Memory Bandwidth

Maximizing In-core Performance • Exploit in-core parallelism (ILP, DLP, etc…)

• Good (enough) floating-point balance

Page 117: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 117

Optimization Categorization

Minimizing Memory Traffic

Maximizing Memory Bandwidth

Maximizing In-core Performance • Exploit in-core parallelism (ILP, DLP, etc…)

• Good (enough) floating-point balance

?

?

?

? unroll &

jam explicit SIMD

reorder eliminate branches

Page 118: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

?

?

?

? unroll &

jam explicit SIMD

reorder eliminate branches

118

Optimization Categorization

Maximizing In-core Performance

Minimizing Memory Traffic

• Exploit in-core parallelism (ILP, DLP, etc…)

• Good (enough) floating-point balance

Maximizing Memory Bandwidth • Exploit NUMA

• Hide memory latency

• Satisfy Little’s Law

? memory affinity ?

SW prefetch

? DMA lists

? unit-stride

streams

? TLB

blocking

Page 119: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

? memory affinity ?

SW prefetch

? DMA lists

? unit-stride

streams

? TLB

blocking

?

?

?

? unroll &

jam explicit SIMD

reorder eliminate branches

119

Optimization Categorization

Maximizing In-core Performance

Maximizing Memory Bandwidth

• Exploit in-core parallelism (ILP, DLP, etc…)

• Good (enough) floating-point balance

• Exploit NUMA

• Hide memory latency

• Satisfy Little’s Law

Minimizing Memory Traffic

Eliminate: • Capacity misses • Conflict misses • Compulsory misses • Write allocate behavior

? ?

? ?

cache blocking array

padding compress

data

streaming stores

Page 120: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 120

Optimization Categorization

Maximizing In-core Performance

Minimizing Memory Traffic

Maximizing Memory Bandwidth

• Exploit in-core parallelism (ILP, DLP, etc…)

• Good (enough) floating-point balance

• Exploit NUMA

• Hide memory latency

• Satisfy Little’s Law

? memory affinity ?

SW prefetch

? DMA lists

? unit-stride

streams

? TLB

blocking

Eliminate: • Capacity misses • Conflict misses • Compulsory misses • Write allocate behavior

? ?

? ?

cache blocking array

padding compress

data

streaming stores ?

?

?

? unroll &

jam explicit SIMD

reorder eliminate branches

Page 121: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline Model locality walls

121

  Optimizations remove these walls and ceilings which act to constrain performance.

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

w/out SIMD

mul / add imbalance

w/out ILP

Opteron 2356 (Barcelona)

peak DP

only compulsory m

iss traffic +w

rite allocation traffic

+capacity miss traffic

+conflict miss traffic

Page 122: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline Model locality walls

122

  Optimizations remove these walls and ceilings which act to constrain performance.

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

w/out SIMD

mul / add imbalance

w/out ILP

Opteron 2356 (Barcelona)

peak DP

only compulsory m

iss traffic

Page 123: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline Model locality walls

123

  Optimizations remove these walls and ceilings which act to constrain performance.

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

Opteron 2356 (Barcelona)

peak DP

only compulsory m

iss traffic

Page 124: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline Model locality walls

124

  Optimizations remove these walls and ceilings which act to constrain performance.

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

Opteron 2356 (Barcelona)

peak DP

only compulsory m

iss traffic

Page 125: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

? memory affinity ?

SW prefetch

? DMA lists

? unit-stride

streams

? TLB

blocking

? ?

? ?

cache blocking array

padding compress

data

streaming stores ?

?

?

? unroll &

jam explicit SIMD

reorder eliminate branches

125

Optimization Categorization

Maximizing In-core Performance

Minimizing Memory Traffic

Maximizing Memory Bandwidth

• Exploit in-core parallelism (ILP, DLP, etc…)

• Good (enough) floating-point balance

• Exploit NUMA

• Hide memory latency

• Satisfy Little’s Law

Eliminate: • Capacity misses • Conflict misses • Compulsory misses • Write allocate behavior Each optimization has

a large parameter space

What are the optimal parameters?

Page 126: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 126

Auto-tuning?

  Provides performance portability across the existing breadth and evolution of microprocessors

  One time up front productivity cost is amortized by the number of machines its used on

  Auto-tuning does not invent new optimizations   Auto-tuning automates the code generation and exploration of

the optimization and parameter space   Two components:

  parameterized code generator (we wrote ours in Perl)   Auto-tuning exploration benchmark

(combination of heuristics and exhaustive search)   Can be extended with ISA specific optimizations (e.g. DMA, SIMD)

Page 127: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P

Multicore: Architectures & Challenges

127

Page 128: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Options:

  Moore’s law continues to double the transistors, what do we do with them ?   More out-of-order (prohibited by complexity, performance, power)   More threading (asymptotic performance)   More DLP/SIMD (limited applications, compilers?)   Bigger caches (doesn’t address compulsory misses, asymptotic perf.)   Place a SMP on a chip = ‘multicore’

128

Page 129: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

What are SMPs? What is multicore ? What are multicore SMPs ?

129

  SMP = shared memory parallel   In the past, it meant multiple chips (typically < 32) could address any location in a

large shared memory through a network or bus

  Today, multiple cores are integrated on the same chip   Almost universally this is done in a SMP fashion   For “convince”, programming multicore SMPs is indistinguishable from programming

multi-socket SMPs. (easy transition)

  Multiple cores can share:   memory controllers   caches   occasionally FPUs

  Although there was a graceful transition from multiple sockets to multiple cores from the point of view of correctness, achieving good performance can be incredibly challenging.

Page 130: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Multicore & SMP Comparison

  Advances in Moore’s Law allows for increased integration on-chip.   Nevertheless, the basic architecture and programming model

remained the same:   Physically partitioned, logically shared caches and DRAM

130

Core

DRAM

Cache

Core

DRAM

Cache

Core

DRAM

Cache

Core

DRAM

Cache

Core

DRAM

Cache

Core Core

DRAM

Cache

Core Core

DRAM

Cache

Core Core Core

Page 131: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

NUMA vs NUCA

  When physically partitioned, cache or memory access is non uniform (latency and bandwidth to memory/cache addresses varies)

  UCA & UMA architecture:

131

Core

DRAM

Cache

Core Core Core

Page 132: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

NUMA vs NUCA

  When physically partitioned, cache or memory access is non uniform (latency and bandwidth to memory/cache addresses varies)

  UCA & UMA architecture:

132

Core

DRAM

Cache

Core Core Core

Page 133: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

NUMA vs NUCA

  When physically partitioned, cache or memory access is non uniform (latency and bandwidth to memory/cache addresses varies)

  UCA & UMA architecture:

133

Core

DRAM

Cache

Core Core Core

Page 134: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

NUMA vs NUCA

  When physically partitioned, cache or memory access is non uniform (latency and bandwidth to memory/cache addresses varies)

  UCA & UMA architecture:

134

Core

DRAM

Cache

Core Core Core

Page 135: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

NUMA vs NUCA

  When physically partitioned, cache or memory access is non uniform (latency and bandwidth to memory/cache addresses varies)

  UCA & UMA architecture:

135

Core

DRAM

Cache

Core Core Core

Page 136: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

NUMA vs NUCA

  When physically partitioned, cache or memory access is non uniform (latency and bandwidth to memory/cache addresses varies)

  NUCA & UMA architecture:

136

Core

Cache

Core Core Core

Cache Cache Cache

Memory Controller Hub

DRAM

Page 137: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

NUMA vs NUCA

  When physically partitioned, cache or memory access is non uniform (latency and bandwidth to memory/cache addresses varies)

  NUCA & UMA architecture:

137

Core

Cache

Core Core Core

Cache Cache Cache

Memory Controller Hub

DRAM

Page 138: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

NUMA vs NUCA

  When physically partitioned, cache or memory access is non uniform (latency and bandwidth to memory/cache addresses varies)

  NUCA & UMA architecture:

138

Core

Cache

Core Core Core

Cache Cache Cache

Memory Controller Hub

DRAM

Page 139: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

NUMA vs NUCA

  When physically partitioned, cache or memory access is non uniform (latency and bandwidth to memory/cache addresses varies)

  NUCA & UMA architecture:

139

Core

Cache

Core Core Core

Cache Cache Cache

Memory Controller Hub

DRAM

Page 140: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

NUMA vs NUCA

  When physically partitioned, cache or memory access is non uniform (latency and bandwidth to memory/cache addresses varies)

  NUCA & UMA architecture:

140

Core

Cache

Core Core Core

Cache Cache Cache

Memory Controller Hub

DRAM

Page 141: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

NUMA vs NUCA

  When physically partitioned, cache or memory access is non uniform (latency and bandwidth to memory/cache addresses varies)

  NUCA & UMA architecture:

141

Core

Cache

Core Core Core

Cache Cache Cache

Memory Controller Hub

DRAM

Page 142: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

NUMA vs NUCA

  When physically partitioned, cache or memory access is non uniform (latency and bandwidth to memory/cache addresses varies)

  NUCA & NUMA architecture:

142

Core

Cache

Core Core Core

Cache Cache Cache

DRAM DRAM

Page 143: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

NUMA vs NUCA

  When physically partitioned, cache or memory access is non uniform (latency and bandwidth to memory/cache addresses varies)

  NUCA & NUMA architecture:

143

Core

Cache

Core Core Core

Cache Cache Cache

DRAM DRAM

Page 144: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

NUMA vs NUCA

  When physically partitioned, cache or memory access is non uniform (latency and bandwidth to memory/cache addresses varies)

  NUCA & NUMA architecture:

144

Core

Cache

Core Core Core

Cache Cache Cache

DRAM DRAM

Page 145: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

NUMA vs NUCA

  Proper cache locality

145

Core

Cache

Core Core Core

Cache Cache Cache

DRAM DRAM

Page 146: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

NUMA vs NUCA

  Proper DRAM locality

146

Core

Cache

Core Core Core

Cache Cache Cache

DRAM DRAM

Page 147: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Multicore and Little’s Law?

  Like MT, for codes with enough TLP, multicore helps in satisfying Little’s Law

  Combination of multicore and multithreading also works

Concurrency per chip = Concurrency per thread * threads per core * cores per chip =

latency * bandwidth

147

Page 148: Performance Optimization and Auto-tuning...Performance Optimization and Auto-tuning Samuel Williams1 Jonathan Carter1, Richard Vuduc3, Leonid Oliker1, John Shalf1, Katherine Yelick1,2,

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Best Architecture?

  Short answer: there’s not one

  Architectures have diversified into different markets (different balance between design options)

  Architectures are constrained by a company’s manpower, money, target price, volume, as well as a new Power Wall

  As a result, architectures are becoming simpler:   shallower pipelines (hard to increase frequency)   narrower superscalar or in-order

  But there are   more cores (6 minimum)   more threads per core (2-4)

148