introduction to communication-avoiding algorithms cs.berkeley /~ demmel /sc13_tutorial
DESCRIPTION
Introduction to Communication-Avoiding Algorithms www.cs.berkeley.edu /~ demmel /SC13_tutorial. Jim Demmel EECS & Math Departments UC Berkeley. Why avoid communication? (1/2). Algorithms have two costs (measured in time or energy): Arithmetic (FLOPS) Communication: moving data between - PowerPoint PPT PresentationTRANSCRIPT
Introduction to Communication-Avoiding Algorithms
wwwcsberkeleyedu~demmelSC13_tutorial
Jim DemmelEECS amp Math Departments
UC Berkeley
2
Why avoid communication (12)
Algorithms have two costs (measured in time or energy)1 Arithmetic (FLOPS)2 Communication moving data between
ndash levels of a memory hierarchy (sequential case) ndash processors over a network (parallel case)
CPUCache
DRAM
CPUDRAM
CPUDRAM
CPUDRAM
CPUDRAM
Why avoid communication (23)
bull Running time of an algorithm is sum of 3 termsndash flops time_per_flopndash words moved bandwidthndash messages latency
3
communication
bull Time_per_flop ltlt 1 bandwidth ltlt latencybull Gaps growing exponentially with time [FOSC]
bull Avoid communication to save time
Annual improvements
Time_per_flop Bandwidth Latency
Network 26 15
DRAM 23 559
Why Minimize Communication (22)
Source John Shalf LBL
Why Minimize Communication (22)
Source John Shalf LBL
Minimize communication to save energy
Goals
6
bull Redesign algorithms to avoid communicationbull Between all memory hierarchy levels
bull L1 L2 DRAM network etc bull Attain lower bounds if possible
bull Current algorithms often far from lower boundsbull Large speedups and energy savings possible
ldquoNew Algorithm Improves Performance and Accuracy on Extreme-Scale Computing Systems On modern computer architectures communication between processors takes longer than the performance of a floating point arithmetic operation by a given processor ASCR researchers have developed a new method derived from commonly used linear algebra methods to minimize communications between processors and the memory hierarchy by reformulating the communication patterns specified within the algorithm This method has been implemented in the TRILINOS framework a highly-regarded suite of software which provides functionality for researchers around the world to solve large scale complex multi-physics problemsrdquo
FY 2010 Congressional Budget Volume 4 FY2010 Accomplishments Advanced Scientific Computing Research (ASCR) pages 65-67
President Obama cites Communication-Avoiding Algorithms in the FY 2012 Department of Energy Budget Request to Congress
CA-GMRES (Hoemmen Mohiyuddin Yelick JD)ldquoTall-Skinnyrdquo QR (Grigori Hoemmen Langou JD)
Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya
Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Oded Schwartz Edgar Solomonik Omer Spillinger
bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David
Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang
bull Jack Dongarra Dulceneia Becker Ichitaro Yamazaki hellipbull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou
Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National
Instruments NEC Nokia NVIDIA Samsung Oracle bull bebopcsberkeleyedu
Summary of CA Algorithmsbull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication for linear algebra problems like Ax=b least squares Ax = λx SVD etc
bull New algorithms that attain these lower boundsbull Being added to libraries ScaLAPACK PLASMA
MAGMAbull Large speed-ups possible
bull Autotuning to find optimal implementationbull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
12
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
13
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent ge words_moved largest_message_size
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
14
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz
Can we attain these lower bounds
bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not
bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties
new ways to encode answers new data structures
ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15
16
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n for j = 1 to n
for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)
= +
C(ij) A(i)
B(j)C(ij)
17
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory
= +
C(ij) A(i)
B(j)C(ij)
18
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether
= +
C(ij) A(i)
B(j)C(ij)
n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic
19
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory
= +
C(ij) C(ij) A(ik)
B(kj)b-by-bblock
20
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes
= +
C(ij) C(ij) A(ik)
B(kj)b-by-bblock
2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster
Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of
size M then readswrites = 2n3b + 2n2
bull Make b as large as possible 3b2 le M so readswrites ge 2n3(M3)12 + 2n2
bull Attains lower bound = Ω (flops M12 )
bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm
21
How hard is hand-tuning matmul anyway
22
bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)
bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results
How hard is hand-tuning matmul anyway
23
Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m
bull C = = A middot B = middot middot
=
bull True when each Aij etc 1x1 or n2 x n2
24
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11middotB11 + A12middotB21 A11middotB12 + A12middotB22
A21middotB11 + A22middotB21 A21middotB12 + A22middotB22
func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
Recursive Matrix Multiplication (RMM) (22)
25
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n)
= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1
= 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n)
= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2
= O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA FasterL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
30
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layout
ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more
memory
31
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
32
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
For k=0 to nb-1
for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree)
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree)
Receive A(ik) into Acol
Receive B(kj) into Brow
C_myproc = C_myproc + Acol Brow
Brow
Acol
bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky
bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD
Can we do Better
Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal
ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum
energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime
T that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
TSQR QR of a Tall Skinny matrix
45
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
46
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
2
Why avoid communication (12)
Algorithms have two costs (measured in time or energy)1 Arithmetic (FLOPS)2 Communication moving data between
ndash levels of a memory hierarchy (sequential case) ndash processors over a network (parallel case)
CPUCache
DRAM
CPUDRAM
CPUDRAM
CPUDRAM
CPUDRAM
Why avoid communication (23)
bull Running time of an algorithm is sum of 3 termsndash flops time_per_flopndash words moved bandwidthndash messages latency
3
communication
bull Time_per_flop ltlt 1 bandwidth ltlt latencybull Gaps growing exponentially with time [FOSC]
bull Avoid communication to save time
Annual improvements
Time_per_flop Bandwidth Latency
Network 26 15
DRAM 23 559
Why Minimize Communication (22)
Source John Shalf LBL
Why Minimize Communication (22)
Source John Shalf LBL
Minimize communication to save energy
Goals
6
bull Redesign algorithms to avoid communicationbull Between all memory hierarchy levels
bull L1 L2 DRAM network etc bull Attain lower bounds if possible
bull Current algorithms often far from lower boundsbull Large speedups and energy savings possible
ldquoNew Algorithm Improves Performance and Accuracy on Extreme-Scale Computing Systems On modern computer architectures communication between processors takes longer than the performance of a floating point arithmetic operation by a given processor ASCR researchers have developed a new method derived from commonly used linear algebra methods to minimize communications between processors and the memory hierarchy by reformulating the communication patterns specified within the algorithm This method has been implemented in the TRILINOS framework a highly-regarded suite of software which provides functionality for researchers around the world to solve large scale complex multi-physics problemsrdquo
FY 2010 Congressional Budget Volume 4 FY2010 Accomplishments Advanced Scientific Computing Research (ASCR) pages 65-67
President Obama cites Communication-Avoiding Algorithms in the FY 2012 Department of Energy Budget Request to Congress
CA-GMRES (Hoemmen Mohiyuddin Yelick JD)ldquoTall-Skinnyrdquo QR (Grigori Hoemmen Langou JD)
Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya
Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Oded Schwartz Edgar Solomonik Omer Spillinger
bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David
Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang
bull Jack Dongarra Dulceneia Becker Ichitaro Yamazaki hellipbull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou
Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National
Instruments NEC Nokia NVIDIA Samsung Oracle bull bebopcsberkeleyedu
Summary of CA Algorithmsbull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication for linear algebra problems like Ax=b least squares Ax = λx SVD etc
bull New algorithms that attain these lower boundsbull Being added to libraries ScaLAPACK PLASMA
MAGMAbull Large speed-ups possible
bull Autotuning to find optimal implementationbull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
12
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
13
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent ge words_moved largest_message_size
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
14
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz
Can we attain these lower bounds
bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not
bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties
new ways to encode answers new data structures
ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15
16
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n for j = 1 to n
for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)
= +
C(ij) A(i)
B(j)C(ij)
17
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory
= +
C(ij) A(i)
B(j)C(ij)
18
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether
= +
C(ij) A(i)
B(j)C(ij)
n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic
19
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory
= +
C(ij) C(ij) A(ik)
B(kj)b-by-bblock
20
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes
= +
C(ij) C(ij) A(ik)
B(kj)b-by-bblock
2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster
Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of
size M then readswrites = 2n3b + 2n2
bull Make b as large as possible 3b2 le M so readswrites ge 2n3(M3)12 + 2n2
bull Attains lower bound = Ω (flops M12 )
bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm
21
How hard is hand-tuning matmul anyway
22
bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)
bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results
How hard is hand-tuning matmul anyway
23
Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m
bull C = = A middot B = middot middot
=
bull True when each Aij etc 1x1 or n2 x n2
24
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11middotB11 + A12middotB21 A11middotB12 + A12middotB22
A21middotB11 + A22middotB21 A21middotB12 + A22middotB22
func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
Recursive Matrix Multiplication (RMM) (22)
25
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n)
= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1
= 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n)
= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2
= O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA FasterL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
30
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layout
ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more
memory
31
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
32
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
For k=0 to nb-1
for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree)
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree)
Receive A(ik) into Acol
Receive B(kj) into Brow
C_myproc = C_myproc + Acol Brow
Brow
Acol
bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky
bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD
Can we do Better
Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal
ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum
energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime
T that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
TSQR QR of a Tall Skinny matrix
45
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
46
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Why avoid communication (23)
bull Running time of an algorithm is sum of 3 termsndash flops time_per_flopndash words moved bandwidthndash messages latency
3
communication
bull Time_per_flop ltlt 1 bandwidth ltlt latencybull Gaps growing exponentially with time [FOSC]
bull Avoid communication to save time
Annual improvements
Time_per_flop Bandwidth Latency
Network 26 15
DRAM 23 559
Why Minimize Communication (22)
Source John Shalf LBL
Why Minimize Communication (22)
Source John Shalf LBL
Minimize communication to save energy
Goals
6
bull Redesign algorithms to avoid communicationbull Between all memory hierarchy levels
bull L1 L2 DRAM network etc bull Attain lower bounds if possible
bull Current algorithms often far from lower boundsbull Large speedups and energy savings possible
ldquoNew Algorithm Improves Performance and Accuracy on Extreme-Scale Computing Systems On modern computer architectures communication between processors takes longer than the performance of a floating point arithmetic operation by a given processor ASCR researchers have developed a new method derived from commonly used linear algebra methods to minimize communications between processors and the memory hierarchy by reformulating the communication patterns specified within the algorithm This method has been implemented in the TRILINOS framework a highly-regarded suite of software which provides functionality for researchers around the world to solve large scale complex multi-physics problemsrdquo
FY 2010 Congressional Budget Volume 4 FY2010 Accomplishments Advanced Scientific Computing Research (ASCR) pages 65-67
President Obama cites Communication-Avoiding Algorithms in the FY 2012 Department of Energy Budget Request to Congress
CA-GMRES (Hoemmen Mohiyuddin Yelick JD)ldquoTall-Skinnyrdquo QR (Grigori Hoemmen Langou JD)
Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya
Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Oded Schwartz Edgar Solomonik Omer Spillinger
bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David
Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang
bull Jack Dongarra Dulceneia Becker Ichitaro Yamazaki hellipbull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou
Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National
Instruments NEC Nokia NVIDIA Samsung Oracle bull bebopcsberkeleyedu
Summary of CA Algorithmsbull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication for linear algebra problems like Ax=b least squares Ax = λx SVD etc
bull New algorithms that attain these lower boundsbull Being added to libraries ScaLAPACK PLASMA
MAGMAbull Large speed-ups possible
bull Autotuning to find optimal implementationbull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
12
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
13
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent ge words_moved largest_message_size
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
14
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz
Can we attain these lower bounds
bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not
bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties
new ways to encode answers new data structures
ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15
16
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n for j = 1 to n
for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)
= +
C(ij) A(i)
B(j)C(ij)
17
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory
= +
C(ij) A(i)
B(j)C(ij)
18
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether
= +
C(ij) A(i)
B(j)C(ij)
n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic
19
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory
= +
C(ij) C(ij) A(ik)
B(kj)b-by-bblock
20
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes
= +
C(ij) C(ij) A(ik)
B(kj)b-by-bblock
2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster
Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of
size M then readswrites = 2n3b + 2n2
bull Make b as large as possible 3b2 le M so readswrites ge 2n3(M3)12 + 2n2
bull Attains lower bound = Ω (flops M12 )
bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm
21
How hard is hand-tuning matmul anyway
22
bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)
bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results
How hard is hand-tuning matmul anyway
23
Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m
bull C = = A middot B = middot middot
=
bull True when each Aij etc 1x1 or n2 x n2
24
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11middotB11 + A12middotB21 A11middotB12 + A12middotB22
A21middotB11 + A22middotB21 A21middotB12 + A22middotB22
func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
Recursive Matrix Multiplication (RMM) (22)
25
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n)
= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1
= 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n)
= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2
= O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA FasterL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
30
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layout
ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more
memory
31
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
32
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
For k=0 to nb-1
for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree)
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree)
Receive A(ik) into Acol
Receive B(kj) into Brow
C_myproc = C_myproc + Acol Brow
Brow
Acol
bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky
bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD
Can we do Better
Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal
ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum
energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime
T that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
TSQR QR of a Tall Skinny matrix
45
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
46
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Why Minimize Communication (22)
Source John Shalf LBL
Why Minimize Communication (22)
Source John Shalf LBL
Minimize communication to save energy
Goals
6
bull Redesign algorithms to avoid communicationbull Between all memory hierarchy levels
bull L1 L2 DRAM network etc bull Attain lower bounds if possible
bull Current algorithms often far from lower boundsbull Large speedups and energy savings possible
ldquoNew Algorithm Improves Performance and Accuracy on Extreme-Scale Computing Systems On modern computer architectures communication between processors takes longer than the performance of a floating point arithmetic operation by a given processor ASCR researchers have developed a new method derived from commonly used linear algebra methods to minimize communications between processors and the memory hierarchy by reformulating the communication patterns specified within the algorithm This method has been implemented in the TRILINOS framework a highly-regarded suite of software which provides functionality for researchers around the world to solve large scale complex multi-physics problemsrdquo
FY 2010 Congressional Budget Volume 4 FY2010 Accomplishments Advanced Scientific Computing Research (ASCR) pages 65-67
President Obama cites Communication-Avoiding Algorithms in the FY 2012 Department of Energy Budget Request to Congress
CA-GMRES (Hoemmen Mohiyuddin Yelick JD)ldquoTall-Skinnyrdquo QR (Grigori Hoemmen Langou JD)
Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya
Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Oded Schwartz Edgar Solomonik Omer Spillinger
bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David
Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang
bull Jack Dongarra Dulceneia Becker Ichitaro Yamazaki hellipbull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou
Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National
Instruments NEC Nokia NVIDIA Samsung Oracle bull bebopcsberkeleyedu
Summary of CA Algorithmsbull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication for linear algebra problems like Ax=b least squares Ax = λx SVD etc
bull New algorithms that attain these lower boundsbull Being added to libraries ScaLAPACK PLASMA
MAGMAbull Large speed-ups possible
bull Autotuning to find optimal implementationbull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
12
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
13
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent ge words_moved largest_message_size
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
14
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz
Can we attain these lower bounds
bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not
bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties
new ways to encode answers new data structures
ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15
16
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n for j = 1 to n
for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)
= +
C(ij) A(i)
B(j)C(ij)
17
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory
= +
C(ij) A(i)
B(j)C(ij)
18
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether
= +
C(ij) A(i)
B(j)C(ij)
n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic
19
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory
= +
C(ij) C(ij) A(ik)
B(kj)b-by-bblock
20
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes
= +
C(ij) C(ij) A(ik)
B(kj)b-by-bblock
2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster
Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of
size M then readswrites = 2n3b + 2n2
bull Make b as large as possible 3b2 le M so readswrites ge 2n3(M3)12 + 2n2
bull Attains lower bound = Ω (flops M12 )
bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm
21
How hard is hand-tuning matmul anyway
22
bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)
bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results
How hard is hand-tuning matmul anyway
23
Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m
bull C = = A middot B = middot middot
=
bull True when each Aij etc 1x1 or n2 x n2
24
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11middotB11 + A12middotB21 A11middotB12 + A12middotB22
A21middotB11 + A22middotB21 A21middotB12 + A22middotB22
func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
Recursive Matrix Multiplication (RMM) (22)
25
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n)
= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1
= 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n)
= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2
= O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA FasterL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
30
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layout
ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more
memory
31
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
32
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
For k=0 to nb-1
for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree)
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree)
Receive A(ik) into Acol
Receive B(kj) into Brow
C_myproc = C_myproc + Acol Brow
Brow
Acol
bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky
bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD
Can we do Better
Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal
ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum
energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime
T that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
TSQR QR of a Tall Skinny matrix
45
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
46
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Why Minimize Communication (22)
Source John Shalf LBL
Minimize communication to save energy
Goals
6
bull Redesign algorithms to avoid communicationbull Between all memory hierarchy levels
bull L1 L2 DRAM network etc bull Attain lower bounds if possible
bull Current algorithms often far from lower boundsbull Large speedups and energy savings possible
ldquoNew Algorithm Improves Performance and Accuracy on Extreme-Scale Computing Systems On modern computer architectures communication between processors takes longer than the performance of a floating point arithmetic operation by a given processor ASCR researchers have developed a new method derived from commonly used linear algebra methods to minimize communications between processors and the memory hierarchy by reformulating the communication patterns specified within the algorithm This method has been implemented in the TRILINOS framework a highly-regarded suite of software which provides functionality for researchers around the world to solve large scale complex multi-physics problemsrdquo
FY 2010 Congressional Budget Volume 4 FY2010 Accomplishments Advanced Scientific Computing Research (ASCR) pages 65-67
President Obama cites Communication-Avoiding Algorithms in the FY 2012 Department of Energy Budget Request to Congress
CA-GMRES (Hoemmen Mohiyuddin Yelick JD)ldquoTall-Skinnyrdquo QR (Grigori Hoemmen Langou JD)
Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya
Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Oded Schwartz Edgar Solomonik Omer Spillinger
bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David
Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang
bull Jack Dongarra Dulceneia Becker Ichitaro Yamazaki hellipbull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou
Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National
Instruments NEC Nokia NVIDIA Samsung Oracle bull bebopcsberkeleyedu
Summary of CA Algorithmsbull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication for linear algebra problems like Ax=b least squares Ax = λx SVD etc
bull New algorithms that attain these lower boundsbull Being added to libraries ScaLAPACK PLASMA
MAGMAbull Large speed-ups possible
bull Autotuning to find optimal implementationbull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
12
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
13
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent ge words_moved largest_message_size
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
14
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz
Can we attain these lower bounds
bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not
bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties
new ways to encode answers new data structures
ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15
16
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n for j = 1 to n
for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)
= +
C(ij) A(i)
B(j)C(ij)
17
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory
= +
C(ij) A(i)
B(j)C(ij)
18
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether
= +
C(ij) A(i)
B(j)C(ij)
n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic
19
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory
= +
C(ij) C(ij) A(ik)
B(kj)b-by-bblock
20
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes
= +
C(ij) C(ij) A(ik)
B(kj)b-by-bblock
2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster
Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of
size M then readswrites = 2n3b + 2n2
bull Make b as large as possible 3b2 le M so readswrites ge 2n3(M3)12 + 2n2
bull Attains lower bound = Ω (flops M12 )
bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm
21
How hard is hand-tuning matmul anyway
22
bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)
bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results
How hard is hand-tuning matmul anyway
23
Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m
bull C = = A middot B = middot middot
=
bull True when each Aij etc 1x1 or n2 x n2
24
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11middotB11 + A12middotB21 A11middotB12 + A12middotB22
A21middotB11 + A22middotB21 A21middotB12 + A22middotB22
func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
Recursive Matrix Multiplication (RMM) (22)
25
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n)
= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1
= 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n)
= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2
= O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA FasterL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
30
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layout
ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more
memory
31
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
32
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
For k=0 to nb-1
for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree)
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree)
Receive A(ik) into Acol
Receive B(kj) into Brow
C_myproc = C_myproc + Acol Brow
Brow
Acol
bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky
bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD
Can we do Better
Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal
ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum
energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime
T that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
TSQR QR of a Tall Skinny matrix
45
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
46
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Goals
6
bull Redesign algorithms to avoid communicationbull Between all memory hierarchy levels
bull L1 L2 DRAM network etc bull Attain lower bounds if possible
bull Current algorithms often far from lower boundsbull Large speedups and energy savings possible
ldquoNew Algorithm Improves Performance and Accuracy on Extreme-Scale Computing Systems On modern computer architectures communication between processors takes longer than the performance of a floating point arithmetic operation by a given processor ASCR researchers have developed a new method derived from commonly used linear algebra methods to minimize communications between processors and the memory hierarchy by reformulating the communication patterns specified within the algorithm This method has been implemented in the TRILINOS framework a highly-regarded suite of software which provides functionality for researchers around the world to solve large scale complex multi-physics problemsrdquo
FY 2010 Congressional Budget Volume 4 FY2010 Accomplishments Advanced Scientific Computing Research (ASCR) pages 65-67
President Obama cites Communication-Avoiding Algorithms in the FY 2012 Department of Energy Budget Request to Congress
CA-GMRES (Hoemmen Mohiyuddin Yelick JD)ldquoTall-Skinnyrdquo QR (Grigori Hoemmen Langou JD)
Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya
Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Oded Schwartz Edgar Solomonik Omer Spillinger
bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David
Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang
bull Jack Dongarra Dulceneia Becker Ichitaro Yamazaki hellipbull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou
Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National
Instruments NEC Nokia NVIDIA Samsung Oracle bull bebopcsberkeleyedu
Summary of CA Algorithmsbull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication for linear algebra problems like Ax=b least squares Ax = λx SVD etc
bull New algorithms that attain these lower boundsbull Being added to libraries ScaLAPACK PLASMA
MAGMAbull Large speed-ups possible
bull Autotuning to find optimal implementationbull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
12
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
13
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent ge words_moved largest_message_size
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
14
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz
Can we attain these lower bounds
bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not
bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties
new ways to encode answers new data structures
ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15
16
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n for j = 1 to n
for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)
= +
C(ij) A(i)
B(j)C(ij)
17
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory
= +
C(ij) A(i)
B(j)C(ij)
18
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether
= +
C(ij) A(i)
B(j)C(ij)
n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic
19
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory
= +
C(ij) C(ij) A(ik)
B(kj)b-by-bblock
20
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes
= +
C(ij) C(ij) A(ik)
B(kj)b-by-bblock
2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster
Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of
size M then readswrites = 2n3b + 2n2
bull Make b as large as possible 3b2 le M so readswrites ge 2n3(M3)12 + 2n2
bull Attains lower bound = Ω (flops M12 )
bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm
21
How hard is hand-tuning matmul anyway
22
bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)
bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results
How hard is hand-tuning matmul anyway
23
Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m
bull C = = A middot B = middot middot
=
bull True when each Aij etc 1x1 or n2 x n2
24
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11middotB11 + A12middotB21 A11middotB12 + A12middotB22
A21middotB11 + A22middotB21 A21middotB12 + A22middotB22
func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
Recursive Matrix Multiplication (RMM) (22)
25
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n)
= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1
= 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n)
= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2
= O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA FasterL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
30
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layout
ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more
memory
31
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
32
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
For k=0 to nb-1
for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree)
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree)
Receive A(ik) into Acol
Receive B(kj) into Brow
C_myproc = C_myproc + Acol Brow
Brow
Acol
bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky
bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD
Can we do Better
Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal
ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum
energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime
T that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
TSQR QR of a Tall Skinny matrix
45
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
46
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
ldquoNew Algorithm Improves Performance and Accuracy on Extreme-Scale Computing Systems On modern computer architectures communication between processors takes longer than the performance of a floating point arithmetic operation by a given processor ASCR researchers have developed a new method derived from commonly used linear algebra methods to minimize communications between processors and the memory hierarchy by reformulating the communication patterns specified within the algorithm This method has been implemented in the TRILINOS framework a highly-regarded suite of software which provides functionality for researchers around the world to solve large scale complex multi-physics problemsrdquo
FY 2010 Congressional Budget Volume 4 FY2010 Accomplishments Advanced Scientific Computing Research (ASCR) pages 65-67
President Obama cites Communication-Avoiding Algorithms in the FY 2012 Department of Energy Budget Request to Congress
CA-GMRES (Hoemmen Mohiyuddin Yelick JD)ldquoTall-Skinnyrdquo QR (Grigori Hoemmen Langou JD)
Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya
Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Oded Schwartz Edgar Solomonik Omer Spillinger
bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David
Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang
bull Jack Dongarra Dulceneia Becker Ichitaro Yamazaki hellipbull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou
Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National
Instruments NEC Nokia NVIDIA Samsung Oracle bull bebopcsberkeleyedu
Summary of CA Algorithmsbull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication for linear algebra problems like Ax=b least squares Ax = λx SVD etc
bull New algorithms that attain these lower boundsbull Being added to libraries ScaLAPACK PLASMA
MAGMAbull Large speed-ups possible
bull Autotuning to find optimal implementationbull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
12
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
13
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent ge words_moved largest_message_size
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
14
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz
Can we attain these lower bounds
bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not
bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties
new ways to encode answers new data structures
ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15
16
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n for j = 1 to n
for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)
= +
C(ij) A(i)
B(j)C(ij)
17
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory
= +
C(ij) A(i)
B(j)C(ij)
18
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether
= +
C(ij) A(i)
B(j)C(ij)
n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic
19
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory
= +
C(ij) C(ij) A(ik)
B(kj)b-by-bblock
20
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes
= +
C(ij) C(ij) A(ik)
B(kj)b-by-bblock
2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster
Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of
size M then readswrites = 2n3b + 2n2
bull Make b as large as possible 3b2 le M so readswrites ge 2n3(M3)12 + 2n2
bull Attains lower bound = Ω (flops M12 )
bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm
21
How hard is hand-tuning matmul anyway
22
bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)
bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results
How hard is hand-tuning matmul anyway
23
Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m
bull C = = A middot B = middot middot
=
bull True when each Aij etc 1x1 or n2 x n2
24
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11middotB11 + A12middotB21 A11middotB12 + A12middotB22
A21middotB11 + A22middotB21 A21middotB12 + A22middotB22
func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
Recursive Matrix Multiplication (RMM) (22)
25
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n)
= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1
= 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n)
= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2
= O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA FasterL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
30
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layout
ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more
memory
31
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
32
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
For k=0 to nb-1
for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree)
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree)
Receive A(ik) into Acol
Receive B(kj) into Brow
C_myproc = C_myproc + Acol Brow
Brow
Acol
bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky
bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD
Can we do Better
Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal
ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum
energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime
T that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
TSQR QR of a Tall Skinny matrix
45
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
46
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya
Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Oded Schwartz Edgar Solomonik Omer Spillinger
bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David
Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang
bull Jack Dongarra Dulceneia Becker Ichitaro Yamazaki hellipbull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou
Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National
Instruments NEC Nokia NVIDIA Samsung Oracle bull bebopcsberkeleyedu
Summary of CA Algorithmsbull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication for linear algebra problems like Ax=b least squares Ax = λx SVD etc
bull New algorithms that attain these lower boundsbull Being added to libraries ScaLAPACK PLASMA
MAGMAbull Large speed-ups possible
bull Autotuning to find optimal implementationbull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
12
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
13
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent ge words_moved largest_message_size
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
14
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz
Can we attain these lower bounds
bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not
bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties
new ways to encode answers new data structures
ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15
16
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n for j = 1 to n
for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)
= +
C(ij) A(i)
B(j)C(ij)
17
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory
= +
C(ij) A(i)
B(j)C(ij)
18
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether
= +
C(ij) A(i)
B(j)C(ij)
n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic
19
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory
= +
C(ij) C(ij) A(ik)
B(kj)b-by-bblock
20
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes
= +
C(ij) C(ij) A(ik)
B(kj)b-by-bblock
2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster
Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of
size M then readswrites = 2n3b + 2n2
bull Make b as large as possible 3b2 le M so readswrites ge 2n3(M3)12 + 2n2
bull Attains lower bound = Ω (flops M12 )
bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm
21
How hard is hand-tuning matmul anyway
22
bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)
bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results
How hard is hand-tuning matmul anyway
23
Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m
bull C = = A middot B = middot middot
=
bull True when each Aij etc 1x1 or n2 x n2
24
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11middotB11 + A12middotB21 A11middotB12 + A12middotB22
A21middotB11 + A22middotB21 A21middotB12 + A22middotB22
func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
Recursive Matrix Multiplication (RMM) (22)
25
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n)
= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1
= 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n)
= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2
= O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA FasterL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
30
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layout
ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more
memory
31
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
32
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
For k=0 to nb-1
for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree)
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree)
Receive A(ik) into Acol
Receive B(kj) into Brow
C_myproc = C_myproc + Acol Brow
Brow
Acol
bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky
bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD
Can we do Better
Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal
ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum
energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime
T that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
TSQR QR of a Tall Skinny matrix
45
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
46
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Summary of CA Algorithmsbull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication for linear algebra problems like Ax=b least squares Ax = λx SVD etc
bull New algorithms that attain these lower boundsbull Being added to libraries ScaLAPACK PLASMA
MAGMAbull Large speed-ups possible
bull Autotuning to find optimal implementationbull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
12
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
13
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent ge words_moved largest_message_size
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
14
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz
Can we attain these lower bounds
bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not
bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties
new ways to encode answers new data structures
ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15
16
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n for j = 1 to n
for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)
= +
C(ij) A(i)
B(j)C(ij)
17
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory
= +
C(ij) A(i)
B(j)C(ij)
18
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether
= +
C(ij) A(i)
B(j)C(ij)
n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic
19
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory
= +
C(ij) C(ij) A(ik)
B(kj)b-by-bblock
20
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes
= +
C(ij) C(ij) A(ik)
B(kj)b-by-bblock
2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster
Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of
size M then readswrites = 2n3b + 2n2
bull Make b as large as possible 3b2 le M so readswrites ge 2n3(M3)12 + 2n2
bull Attains lower bound = Ω (flops M12 )
bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm
21
How hard is hand-tuning matmul anyway
22
bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)
bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results
How hard is hand-tuning matmul anyway
23
Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m
bull C = = A middot B = middot middot
=
bull True when each Aij etc 1x1 or n2 x n2
24
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11middotB11 + A12middotB21 A11middotB12 + A12middotB22
A21middotB11 + A22middotB21 A21middotB12 + A22middotB22
func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
Recursive Matrix Multiplication (RMM) (22)
25
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n)
= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1
= 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n)
= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2
= O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA FasterL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
30
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layout
ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more
memory
31
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
32
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
For k=0 to nb-1
for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree)
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree)
Receive A(ik) into Acol
Receive B(kj) into Brow
C_myproc = C_myproc + Acol Brow
Brow
Acol
bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky
bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD
Can we do Better
Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal
ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum
energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime
T that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
TSQR QR of a Tall Skinny matrix
45
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
46
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
12
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
13
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent ge words_moved largest_message_size
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
14
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz
Can we attain these lower bounds
bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not
bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties
new ways to encode answers new data structures
ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15
16
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n for j = 1 to n
for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)
= +
C(ij) A(i)
B(j)C(ij)
17
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory
= +
C(ij) A(i)
B(j)C(ij)
18
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether
= +
C(ij) A(i)
B(j)C(ij)
n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic
19
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory
= +
C(ij) C(ij) A(ik)
B(kj)b-by-bblock
20
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes
= +
C(ij) C(ij) A(ik)
B(kj)b-by-bblock
2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster
Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of
size M then readswrites = 2n3b + 2n2
bull Make b as large as possible 3b2 le M so readswrites ge 2n3(M3)12 + 2n2
bull Attains lower bound = Ω (flops M12 )
bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm
21
How hard is hand-tuning matmul anyway
22
bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)
bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results
How hard is hand-tuning matmul anyway
23
Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m
bull C = = A middot B = middot middot
=
bull True when each Aij etc 1x1 or n2 x n2
24
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11middotB11 + A12middotB21 A11middotB12 + A12middotB22
A21middotB11 + A22middotB21 A21middotB12 + A22middotB22
func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
Recursive Matrix Multiplication (RMM) (22)
25
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n)
= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1
= 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n)
= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2
= O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA FasterL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
30
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layout
ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more
memory
31
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
32
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
For k=0 to nb-1
for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree)
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree)
Receive A(ik) into Acol
Receive B(kj) into Brow
C_myproc = C_myproc + Acol Brow
Brow
Acol
bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky
bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD
Can we do Better
Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal
ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum
energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime
T that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
TSQR QR of a Tall Skinny matrix
45
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
46
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
12
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
13
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent ge words_moved largest_message_size
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
14
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz
Can we attain these lower bounds
bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not
bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties
new ways to encode answers new data structures
ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15
16
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n for j = 1 to n
for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)
= +
C(ij) A(i)
B(j)C(ij)
17
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory
= +
C(ij) A(i)
B(j)C(ij)
18
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether
= +
C(ij) A(i)
B(j)C(ij)
n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic
19
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory
= +
C(ij) C(ij) A(ik)
B(kj)b-by-bblock
20
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes
= +
C(ij) C(ij) A(ik)
B(kj)b-by-bblock
2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster
Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of
size M then readswrites = 2n3b + 2n2
bull Make b as large as possible 3b2 le M so readswrites ge 2n3(M3)12 + 2n2
bull Attains lower bound = Ω (flops M12 )
bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm
21
How hard is hand-tuning matmul anyway
22
bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)
bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results
How hard is hand-tuning matmul anyway
23
Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m
bull C = = A middot B = middot middot
=
bull True when each Aij etc 1x1 or n2 x n2
24
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11middotB11 + A12middotB21 A11middotB12 + A12middotB22
A21middotB11 + A22middotB21 A21middotB12 + A22middotB22
func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
Recursive Matrix Multiplication (RMM) (22)
25
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n)
= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1
= 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n)
= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2
= O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA FasterL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
30
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layout
ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more
memory
31
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
32
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
For k=0 to nb-1
for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree)
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree)
Receive A(ik) into Acol
Receive B(kj) into Brow
C_myproc = C_myproc + Acol Brow
Brow
Acol
bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky
bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD
Can we do Better
Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal
ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum
energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime
T that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
TSQR QR of a Tall Skinny matrix
45
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
46
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
12
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
13
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent ge words_moved largest_message_size
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
14
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz
Can we attain these lower bounds
bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not
bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties
new ways to encode answers new data structures
ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15
16
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n for j = 1 to n
for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)
= +
C(ij) A(i)
B(j)C(ij)
17
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory
= +
C(ij) A(i)
B(j)C(ij)
18
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether
= +
C(ij) A(i)
B(j)C(ij)
n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic
19
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory
= +
C(ij) C(ij) A(ik)
B(kj)b-by-bblock
20
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes
= +
C(ij) C(ij) A(ik)
B(kj)b-by-bblock
2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster
Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of
size M then readswrites = 2n3b + 2n2
bull Make b as large as possible 3b2 le M so readswrites ge 2n3(M3)12 + 2n2
bull Attains lower bound = Ω (flops M12 )
bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm
21
How hard is hand-tuning matmul anyway
22
bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)
bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results
How hard is hand-tuning matmul anyway
23
Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m
bull C = = A middot B = middot middot
=
bull True when each Aij etc 1x1 or n2 x n2
24
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11middotB11 + A12middotB21 A11middotB12 + A12middotB22
A21middotB11 + A22middotB21 A21middotB12 + A22middotB22
func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
Recursive Matrix Multiplication (RMM) (22)
25
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n)
= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1
= 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n)
= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2
= O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA FasterL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
30
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layout
ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more
memory
31
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
32
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
For k=0 to nb-1
for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree)
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree)
Receive A(ik) into Acol
Receive B(kj) into Brow
C_myproc = C_myproc + Acol Brow
Brow
Acol
bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky
bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD
Can we do Better
Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal
ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum
energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime
T that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
TSQR QR of a Tall Skinny matrix
45
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
46
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
13
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent ge words_moved largest_message_size
bull Parallel case assume either load or memory balanced
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
14
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz
Can we attain these lower bounds
bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not
bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties
new ways to encode answers new data structures
ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15
16
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n for j = 1 to n
for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)
= +
C(ij) A(i)
B(j)C(ij)
17
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory
= +
C(ij) A(i)
B(j)C(ij)
18
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether
= +
C(ij) A(i)
B(j)C(ij)
n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic
19
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory
= +
C(ij) C(ij) A(ik)
B(kj)b-by-bblock
20
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes
= +
C(ij) C(ij) A(ik)
B(kj)b-by-bblock
2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster
Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of
size M then readswrites = 2n3b + 2n2
bull Make b as large as possible 3b2 le M so readswrites ge 2n3(M3)12 + 2n2
bull Attains lower bound = Ω (flops M12 )
bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm
21
How hard is hand-tuning matmul anyway
22
bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)
bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results
How hard is hand-tuning matmul anyway
23
Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m
bull C = = A middot B = middot middot
=
bull True when each Aij etc 1x1 or n2 x n2
24
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11middotB11 + A12middotB21 A11middotB12 + A12middotB22
A21middotB11 + A22middotB21 A21middotB12 + A22middotB22
func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
Recursive Matrix Multiplication (RMM) (22)
25
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n)
= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1
= 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n)
= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2
= O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA FasterL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
30
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layout
ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more
memory
31
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
32
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
For k=0 to nb-1
for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree)
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree)
Receive A(ik) into Acol
Receive B(kj) into Brow
C_myproc = C_myproc + Acol Brow
Brow
Acol
bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky
bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD
Can we do Better
Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal
ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum
energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime
T that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
TSQR QR of a Tall Skinny matrix
45
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
46
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Lower bound for all ldquodirectrdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
14
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz
Can we attain these lower bounds
bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not
bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties
new ways to encode answers new data structures
ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15
16
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n for j = 1 to n
for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)
= +
C(ij) A(i)
B(j)C(ij)
17
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory
= +
C(ij) A(i)
B(j)C(ij)
18
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether
= +
C(ij) A(i)
B(j)C(ij)
n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic
19
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory
= +
C(ij) C(ij) A(ik)
B(kj)b-by-bblock
20
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes
= +
C(ij) C(ij) A(ik)
B(kj)b-by-bblock
2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster
Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of
size M then readswrites = 2n3b + 2n2
bull Make b as large as possible 3b2 le M so readswrites ge 2n3(M3)12 + 2n2
bull Attains lower bound = Ω (flops M12 )
bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm
21
How hard is hand-tuning matmul anyway
22
bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)
bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results
How hard is hand-tuning matmul anyway
23
Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m
bull C = = A middot B = middot middot
=
bull True when each Aij etc 1x1 or n2 x n2
24
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11middotB11 + A12middotB21 A11middotB12 + A12middotB22
A21middotB11 + A22middotB21 A21middotB12 + A22middotB22
func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
Recursive Matrix Multiplication (RMM) (22)
25
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n)
= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1
= 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n)
= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2
= O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA FasterL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
30
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layout
ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more
memory
31
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
32
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
For k=0 to nb-1
for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree)
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree)
Receive A(ik) into Acol
Receive B(kj) into Brow
C_myproc = C_myproc + Acol Brow
Brow
Acol
bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky
bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD
Can we do Better
Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal
ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum
energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime
T that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
TSQR QR of a Tall Skinny matrix
45
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
46
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Can we attain these lower bounds
bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not
bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties
new ways to encode answers new data structures
ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15
16
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n for j = 1 to n
for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)
= +
C(ij) A(i)
B(j)C(ij)
17
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory
= +
C(ij) A(i)
B(j)C(ij)
18
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether
= +
C(ij) A(i)
B(j)C(ij)
n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic
19
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory
= +
C(ij) C(ij) A(ik)
B(kj)b-by-bblock
20
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes
= +
C(ij) C(ij) A(ik)
B(kj)b-by-bblock
2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster
Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of
size M then readswrites = 2n3b + 2n2
bull Make b as large as possible 3b2 le M so readswrites ge 2n3(M3)12 + 2n2
bull Attains lower bound = Ω (flops M12 )
bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm
21
How hard is hand-tuning matmul anyway
22
bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)
bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results
How hard is hand-tuning matmul anyway
23
Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m
bull C = = A middot B = middot middot
=
bull True when each Aij etc 1x1 or n2 x n2
24
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11middotB11 + A12middotB21 A11middotB12 + A12middotB22
A21middotB11 + A22middotB21 A21middotB12 + A22middotB22
func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
Recursive Matrix Multiplication (RMM) (22)
25
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n)
= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1
= 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n)
= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2
= O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA FasterL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
30
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layout
ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more
memory
31
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
32
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
For k=0 to nb-1
for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree)
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree)
Receive A(ik) into Acol
Receive B(kj) into Brow
C_myproc = C_myproc + Acol Brow
Brow
Acol
bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky
bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD
Can we do Better
Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal
ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum
energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime
T that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
TSQR QR of a Tall Skinny matrix
45
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
46
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
16
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n for j = 1 to n
for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)
= +
C(ij) A(i)
B(j)C(ij)
17
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory
= +
C(ij) A(i)
B(j)C(ij)
18
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether
= +
C(ij) A(i)
B(j)C(ij)
n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic
19
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory
= +
C(ij) C(ij) A(ik)
B(kj)b-by-bblock
20
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes
= +
C(ij) C(ij) A(ik)
B(kj)b-by-bblock
2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster
Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of
size M then readswrites = 2n3b + 2n2
bull Make b as large as possible 3b2 le M so readswrites ge 2n3(M3)12 + 2n2
bull Attains lower bound = Ω (flops M12 )
bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm
21
How hard is hand-tuning matmul anyway
22
bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)
bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results
How hard is hand-tuning matmul anyway
23
Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m
bull C = = A middot B = middot middot
=
bull True when each Aij etc 1x1 or n2 x n2
24
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11middotB11 + A12middotB21 A11middotB12 + A12middotB22
A21middotB11 + A22middotB21 A21middotB12 + A22middotB22
func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
Recursive Matrix Multiplication (RMM) (22)
25
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n)
= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1
= 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n)
= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2
= O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA FasterL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
30
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layout
ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more
memory
31
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
32
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
For k=0 to nb-1
for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree)
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree)
Receive A(ik) into Acol
Receive B(kj) into Brow
C_myproc = C_myproc + Acol Brow
Brow
Acol
bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky
bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD
Can we do Better
Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal
ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum
energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime
T that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
TSQR QR of a Tall Skinny matrix
45
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
46
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
17
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory
= +
C(ij) A(i)
B(j)C(ij)
18
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether
= +
C(ij) A(i)
B(j)C(ij)
n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic
19
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory
= +
C(ij) C(ij) A(ik)
B(kj)b-by-bblock
20
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes
= +
C(ij) C(ij) A(ik)
B(kj)b-by-bblock
2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster
Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of
size M then readswrites = 2n3b + 2n2
bull Make b as large as possible 3b2 le M so readswrites ge 2n3(M3)12 + 2n2
bull Attains lower bound = Ω (flops M12 )
bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm
21
How hard is hand-tuning matmul anyway
22
bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)
bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results
How hard is hand-tuning matmul anyway
23
Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m
bull C = = A middot B = middot middot
=
bull True when each Aij etc 1x1 or n2 x n2
24
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11middotB11 + A12middotB21 A11middotB12 + A12middotB22
A21middotB11 + A22middotB21 A21middotB12 + A22middotB22
func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
Recursive Matrix Multiplication (RMM) (22)
25
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n)
= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1
= 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n)
= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2
= O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA FasterL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
30
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layout
ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more
memory
31
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
32
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
For k=0 to nb-1
for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree)
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree)
Receive A(ik) into Acol
Receive B(kj) into Brow
C_myproc = C_myproc + Acol Brow
Brow
Acol
bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky
bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD
Can we do Better
Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal
ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum
energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime
T that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
TSQR QR of a Tall Skinny matrix
45
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
46
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
18
Naiumlve Matrix Multiply
implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether
= +
C(ij) A(i)
B(j)C(ij)
n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic
19
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory
= +
C(ij) C(ij) A(ik)
B(kj)b-by-bblock
20
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes
= +
C(ij) C(ij) A(ik)
B(kj)b-by-bblock
2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster
Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of
size M then readswrites = 2n3b + 2n2
bull Make b as large as possible 3b2 le M so readswrites ge 2n3(M3)12 + 2n2
bull Attains lower bound = Ω (flops M12 )
bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm
21
How hard is hand-tuning matmul anyway
22
bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)
bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results
How hard is hand-tuning matmul anyway
23
Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m
bull C = = A middot B = middot middot
=
bull True when each Aij etc 1x1 or n2 x n2
24
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11middotB11 + A12middotB21 A11middotB12 + A12middotB22
A21middotB11 + A22middotB21 A21middotB12 + A22middotB22
func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
Recursive Matrix Multiplication (RMM) (22)
25
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n)
= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1
= 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n)
= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2
= O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA FasterL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
30
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layout
ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more
memory
31
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
32
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
For k=0 to nb-1
for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree)
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree)
Receive A(ik) into Acol
Receive B(kj) into Brow
C_myproc = C_myproc + Acol Brow
Brow
Acol
bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky
bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD
Can we do Better
Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal
ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum
energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime
T that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
TSQR QR of a Tall Skinny matrix
45
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
46
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
19
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory
= +
C(ij) C(ij) A(ik)
B(kj)b-by-bblock
20
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes
= +
C(ij) C(ij) A(ik)
B(kj)b-by-bblock
2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster
Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of
size M then readswrites = 2n3b + 2n2
bull Make b as large as possible 3b2 le M so readswrites ge 2n3(M3)12 + 2n2
bull Attains lower bound = Ω (flops M12 )
bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm
21
How hard is hand-tuning matmul anyway
22
bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)
bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results
How hard is hand-tuning matmul anyway
23
Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m
bull C = = A middot B = middot middot
=
bull True when each Aij etc 1x1 or n2 x n2
24
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11middotB11 + A12middotB21 A11middotB12 + A12middotB22
A21middotB11 + A22middotB21 A21middotB12 + A22middotB22
func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
Recursive Matrix Multiplication (RMM) (22)
25
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n)
= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1
= 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n)
= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2
= O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA FasterL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
30
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layout
ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more
memory
31
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
32
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
For k=0 to nb-1
for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree)
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree)
Receive A(ik) into Acol
Receive B(kj) into Brow
C_myproc = C_myproc + Acol Brow
Brow
Acol
bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky
bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD
Can we do Better
Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal
ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum
energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime
T that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
TSQR QR of a Tall Skinny matrix
45
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
46
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
20
Blocked (Tiled) Matrix Multiply
Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb
for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes
= +
C(ij) C(ij) A(ik)
B(kj)b-by-bblock
2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster
Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of
size M then readswrites = 2n3b + 2n2
bull Make b as large as possible 3b2 le M so readswrites ge 2n3(M3)12 + 2n2
bull Attains lower bound = Ω (flops M12 )
bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm
21
How hard is hand-tuning matmul anyway
22
bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)
bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results
How hard is hand-tuning matmul anyway
23
Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m
bull C = = A middot B = middot middot
=
bull True when each Aij etc 1x1 or n2 x n2
24
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11middotB11 + A12middotB21 A11middotB12 + A12middotB22
A21middotB11 + A22middotB21 A21middotB12 + A22middotB22
func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
Recursive Matrix Multiplication (RMM) (22)
25
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n)
= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1
= 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n)
= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2
= O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA FasterL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
30
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layout
ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more
memory
31
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
32
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
For k=0 to nb-1
for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree)
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree)
Receive A(ik) into Acol
Receive B(kj) into Brow
C_myproc = C_myproc + Acol Brow
Brow
Acol
bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky
bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD
Can we do Better
Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal
ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum
energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime
T that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
TSQR QR of a Tall Skinny matrix
45
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
46
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of
size M then readswrites = 2n3b + 2n2
bull Make b as large as possible 3b2 le M so readswrites ge 2n3(M3)12 + 2n2
bull Attains lower bound = Ω (flops M12 )
bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm
21
How hard is hand-tuning matmul anyway
22
bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)
bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results
How hard is hand-tuning matmul anyway
23
Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m
bull C = = A middot B = middot middot
=
bull True when each Aij etc 1x1 or n2 x n2
24
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11middotB11 + A12middotB21 A11middotB12 + A12middotB22
A21middotB11 + A22middotB21 A21middotB12 + A22middotB22
func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
Recursive Matrix Multiplication (RMM) (22)
25
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n)
= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1
= 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n)
= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2
= O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA FasterL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
30
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layout
ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more
memory
31
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
32
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
For k=0 to nb-1
for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree)
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree)
Receive A(ik) into Acol
Receive B(kj) into Brow
C_myproc = C_myproc + Acol Brow
Brow
Acol
bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky
bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD
Can we do Better
Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal
ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum
energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime
T that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
TSQR QR of a Tall Skinny matrix
45
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
46
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
How hard is hand-tuning matmul anyway
22
bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)
bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results
How hard is hand-tuning matmul anyway
23
Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m
bull C = = A middot B = middot middot
=
bull True when each Aij etc 1x1 or n2 x n2
24
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11middotB11 + A12middotB21 A11middotB12 + A12middotB22
A21middotB11 + A22middotB21 A21middotB12 + A22middotB22
func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
Recursive Matrix Multiplication (RMM) (22)
25
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n)
= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1
= 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n)
= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2
= O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA FasterL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
30
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layout
ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more
memory
31
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
32
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
For k=0 to nb-1
for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree)
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree)
Receive A(ik) into Acol
Receive B(kj) into Brow
C_myproc = C_myproc + Acol Brow
Brow
Acol
bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky
bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD
Can we do Better
Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal
ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum
energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime
T that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
TSQR QR of a Tall Skinny matrix
45
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
46
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
How hard is hand-tuning matmul anyway
23
Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m
bull C = = A middot B = middot middot
=
bull True when each Aij etc 1x1 or n2 x n2
24
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11middotB11 + A12middotB21 A11middotB12 + A12middotB22
A21middotB11 + A22middotB21 A21middotB12 + A22middotB22
func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
Recursive Matrix Multiplication (RMM) (22)
25
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n)
= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1
= 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n)
= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2
= O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA FasterL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
30
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layout
ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more
memory
31
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
32
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
For k=0 to nb-1
for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree)
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree)
Receive A(ik) into Acol
Receive B(kj) into Brow
C_myproc = C_myproc + Acol Brow
Brow
Acol
bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky
bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD
Can we do Better
Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal
ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum
energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime
T that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
TSQR QR of a Tall Skinny matrix
45
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
46
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m
bull C = = A middot B = middot middot
=
bull True when each Aij etc 1x1 or n2 x n2
24
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11middotB11 + A12middotB21 A11middotB12 + A12middotB22
A21middotB11 + A22middotB21 A21middotB12 + A22middotB22
func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
Recursive Matrix Multiplication (RMM) (22)
25
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n)
= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1
= 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n)
= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2
= O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA FasterL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
30
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layout
ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more
memory
31
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
32
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
For k=0 to nb-1
for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree)
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree)
Receive A(ik) into Acol
Receive B(kj) into Brow
C_myproc = C_myproc + Acol Brow
Brow
Acol
bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky
bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD
Can we do Better
Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal
ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum
energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime
T that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
TSQR QR of a Tall Skinny matrix
45
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
46
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Recursive Matrix Multiplication (RMM) (22)
25
func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return
A(n) = arithmetic operations in RMM( n)
= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1
= 2n3 hellip same operations as usual in different order
W(n) = words moved between fast slow memory by RMM( n)
= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2
= O( n3 M12 + n2 ) hellip same as blocked matmul
ldquoCache obliviousrdquo works for memory hierarchies but not panacea
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA FasterL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
30
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layout
ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more
memory
31
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
32
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
For k=0 to nb-1
for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree)
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree)
Receive A(ik) into Acol
Receive B(kj) into Brow
C_myproc = C_myproc + Acol Brow
Brow
Acol
bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky
bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD
Can we do Better
Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal
ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum
energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime
T that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
TSQR QR of a Tall Skinny matrix
45
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
46
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA FasterL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
30
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layout
ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more
memory
31
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
32
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
For k=0 to nb-1
for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree)
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree)
Receive A(ik) into Acol
Receive B(kj) into Brow
C_myproc = C_myproc + Acol Brow
Brow
Acol
bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky
bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD
Can we do Better
Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal
ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum
energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime
T that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
TSQR QR of a Tall Skinny matrix
45
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
46
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA FasterL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
30
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layout
ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more
memory
31
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
32
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
For k=0 to nb-1
for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree)
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree)
Receive A(ik) into Acol
Receive B(kj) into Brow
C_myproc = C_myproc + Acol Brow
Brow
Acol
bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky
bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD
Can we do Better
Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal
ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum
energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime
T that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
TSQR QR of a Tall Skinny matrix
45
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
46
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Why is CARMA FasterL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
30
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layout
ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more
memory
31
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
32
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
For k=0 to nb-1
for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree)
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree)
Receive A(ik) into Acol
Receive B(kj) into Brow
C_myproc = C_myproc + Acol Brow
Brow
Acol
bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky
bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD
Can we do Better
Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal
ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum
energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime
T that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
TSQR QR of a Tall Skinny matrix
45
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
46
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Parallel MatMul with 2D Processor Layout
bull P processors in P12 x P12 gridndash Processors communicate along rows columns
bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33
ndash Processor Pij owns submatrices Aij Bij and Cij
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
C = A B
30
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layout
ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more
memory
31
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
32
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
For k=0 to nb-1
for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree)
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree)
Receive A(ik) into Acol
Receive B(kj) into Brow
C_myproc = C_myproc + Acol Brow
Brow
Acol
bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky
bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD
Can we do Better
Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal
ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum
energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime
T that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
TSQR QR of a Tall Skinny matrix
45
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
46
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
30
SUMMA Algorithm
bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds
bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )
ndash Can accommodate any processor grid matrix dimensions amp layout
ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps
bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more
memory
31
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
32
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
For k=0 to nb-1
for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree)
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree)
Receive A(ik) into Acol
Receive B(kj) into Brow
C_myproc = C_myproc + Acol Brow
Brow
Acol
bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky
bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD
Can we do Better
Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal
ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum
energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime
T that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
TSQR QR of a Tall Skinny matrix
45
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
46
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
31
SUMMA ndash n x n matmul on P12 x P12 grid
bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)
bull summation over submatricesbull Need not be square processor grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
32
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
For k=0 to nb-1
for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree)
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree)
Receive A(ik) into Acol
Receive B(kj) into Brow
C_myproc = C_myproc + Acol Brow
Brow
Acol
bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky
bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD
Can we do Better
Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal
ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum
energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime
T that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
TSQR QR of a Tall Skinny matrix
45
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
46
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
32
SUMMAndash n x n matmul on P12 x P12 grid
=i
j
A(ik)
k
k
B(kj)
C(ij)
For k=0 to nb-1
for all i = 1 to P12
owner of A(ik) broadcasts it to whole processor row (using binary tree)
for all j = 1 to P12
owner of B(kj) broadcasts it to whole processor column (using bin tree)
Receive A(ik) into Acol
Receive B(kj) into Brow
C_myproc = C_myproc + Acol Brow
Brow
Acol
bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky
bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD
Can we do Better
Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal
ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum
energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime
T that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
TSQR QR of a Tall Skinny matrix
45
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
46
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Summary of dense parallel algorithms attaining communication lower bounds
bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds
words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )
bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky
bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD
Can we do Better
Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal
ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum
energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime
T that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
TSQR QR of a Tall Skinny matrix
45
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
46
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal
ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)
bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]
bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where
each submatrix is nP13 x nP13
ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum
energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime
T that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
TSQR QR of a Tall Skinny matrix
45
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
46
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
25D Matrix Multiplication
bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum
energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime
T that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
TSQR QR of a Tall Skinny matrix
45
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
46
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum
energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime
T that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
TSQR QR of a Tall Skinny matrix
45
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
46
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
25D Matmul on BGP 16K nodes 64K cores
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum
energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime
T that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
TSQR QR of a Tall Skinny matrix
45
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
46
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum
energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime
T that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
TSQR QR of a Tall Skinny matrix
45
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
46
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum
energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime
T that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
TSQR QR of a Tall Skinny matrix
45
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
46
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Perfect Strong Scaling ndash in Time and Energy (22)
bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including
bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum
energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime
T that we can attainbull The ratio P = ET gives us the average power required to run the
algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target
energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
TSQR QR of a Tall Skinny matrix
45
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
46
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
TSQR QR of a Tall Skinny matrix
45
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
46
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
TSQR QR of a Tall Skinny matrix
45
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
46
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
TSQR QR of a Tall Skinny matrix
45
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
46
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
TSQR QR of a Tall Skinny matrix
45
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
46
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
TSQR QR of a Tall Skinny matrix
45
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
TSQR QR of a Tall Skinny matrix
46
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
TSQR QR of a Tall Skinny matrix
46
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= =
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11
= R01
R11
R01
R11
= Q02 R02
Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
Sequential
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
TSQR Performance Resultsbull Parallel
ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)
ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)
ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)
ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)
ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)
bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop
bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished
bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer
others48
Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
49
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term
bull Thm As numerically stable as Partial Pivoting on a larger matrix
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
LU Speedups from Tournament Pivoting and 25D
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
25D vs 2D LUWith and Without Pivoting
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log 2 (
n2 p)
=
log 2 (
mem
ory_
per_
proc
)
Up to 29xfaster
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Other CA algorithmsbull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
57
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Symmetric Band Reduction
bull Grey Ballard and Nick Knightbull A QAQT = T where
ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix
bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal
bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound
theorem for applying orthogonal transformationsndash It can communicate even less
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Conventional vs CA - SBR
Conventional Communication-Avoiding
Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12
Touch all data 4 times Touch all data once
Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD
bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads
bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads
bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads
bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads
bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
61
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
62
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 63
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
64
Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on
ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices
ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip
ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip
bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Recall optimal sequential Matmul
bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)
bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)
bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from
b x b matmul
New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ
bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e
bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12
i j k1 0 1 A
Δ = 0 1 1 B1 1 0 C
New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ
bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e
bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1
i j1 0 F
Δ = 1 0 P(i)0 1 P(j)
N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles
118x speedup
K Yelick E Georganas M Driscoll P Koanantakool E Solomonik
New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ
bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e
bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47
i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ = 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)
φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that
words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to
rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk
bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces
to Hilbertrsquos 10th problem over Q bull Could be undecidable open question
ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)
ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)
ndash Also easy to get upperlower bounds on e
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices
the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip
bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo
dependencies) work in progress
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Avoiding Communication in Iterative Linear Algebra
bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo
bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation
bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal
ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal
bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3 Step 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost92
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
93
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
94
Requires Co-tuning Kernels
95
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be done
ndash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matrices
ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel
ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors
ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip
Reproducible Floating Point Computation
bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop
bull Floating point addition is nonassociative summation order not reproducible
bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of
processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)
bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week
Summary
Donrsquot Communichellip
102
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- CARMA Performance Shared Memory
- CARMA Performance Shared Memory (2)
- Why is CARMA Faster
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- Summary of dense parallel algorithms attaining communication l
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling ndash in Time and Energy (22)
- Handling Heterogeneity
- Application to Tensor Contractions
- C(ijk) = Σm A(ijm)B(mk)
- Application to Tensor Contractions (2)
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Using similar idea for TSLU as TSQR Use reduction tree to do
- LU Speedups from Tournament Pivoting and 25D
- 25D vs 2D LU With and Without Pivoting
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- Other CA algorithms
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 57
- Symmetric Band Reduction
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
- What about sparse matrices (13)
- Performance of 25D APSP using Kleene
- What about sparse matrices (23)
- What about sparse matrices (33)
- Summary of Direct Linear Algebra
- Outline (3)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- Outline (4)
- Avoiding Communication in Iterative Linear Algebra
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Slide 88
- Slide 89
- Slide 90
- Slide 91
- Minimizing Communication of GMRES to solve Ax=b
- Slide 93
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 95
- Slide 96
- Slide 97
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- For more details
- Reproducible Floating Point Computation
- Summary
-
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
61
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
62
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 63
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
64
Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on
ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices
ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip
ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip
bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Recall optimal sequential Matmul
bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)
bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)
bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from
b x b matmul
New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ
bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e
bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12
i j k1 0 1 A
Δ = 0 1 1 B1 1 0 C
New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ
bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e
bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1
i j1 0 F
Δ = 1 0 P(i)0 1 P(j)
N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles
118x speedup
K Yelick E Georganas M Driscoll P Koanantakool E Solomonik
New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ
bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e
bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47
i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ = 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)
φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that
words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to
rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk
bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces
to Hilbertrsquos 10th problem over Q bull Could be undecidable open question
ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)
ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)
ndash Also easy to get upperlower bounds on e
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices
the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip
bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo
dependencies) work in progress
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Avoiding Communication in Iterative Linear Algebra
bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo
bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation
bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal
ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal
bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3 Step 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost92
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
93
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
94
Requires Co-tuning Kernels
95
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be done
ndash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matrices
ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel
ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors
ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip
Reproducible Floating Point Computation
bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop
bull Floating point addition is nonassociative summation order not reproducible
bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of
processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)
bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week
Summary
Donrsquot Communichellip
102
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- CARMA Performance Shared Memory
- CARMA Performance Shared Memory (2)
- Why is CARMA Faster
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- Summary of dense parallel algorithms attaining communication l
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling ndash in Time and Energy (22)
- Handling Heterogeneity
- Application to Tensor Contractions
- C(ijk) = Σm A(ijm)B(mk)
- Application to Tensor Contractions (2)
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Using similar idea for TSLU as TSQR Use reduction tree to do
- LU Speedups from Tournament Pivoting and 25D
- 25D vs 2D LU With and Without Pivoting
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- Other CA algorithms
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 57
- Symmetric Band Reduction
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
- What about sparse matrices (13)
- Performance of 25D APSP using Kleene
- What about sparse matrices (23)
- What about sparse matrices (33)
- Summary of Direct Linear Algebra
- Outline (3)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- Outline (4)
- Avoiding Communication in Iterative Linear Algebra
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Slide 88
- Slide 89
- Slide 90
- Slide 91
- Minimizing Communication of GMRES to solve Ax=b
- Slide 93
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 95
- Slide 96
- Slide 97
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- For more details
- Reproducible Floating Point Computation
- Summary
-
Performance of 25D APSP using Kleene
62
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 63
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
64
Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on
ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices
ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip
ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip
bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Recall optimal sequential Matmul
bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)
bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)
bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from
b x b matmul
New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ
bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e
bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12
i j k1 0 1 A
Δ = 0 1 1 B1 1 0 C
New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ
bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e
bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1
i j1 0 F
Δ = 1 0 P(i)0 1 P(j)
N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles
118x speedup
K Yelick E Georganas M Driscoll P Koanantakool E Solomonik
New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ
bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e
bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47
i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ = 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)
φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that
words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to
rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk
bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces
to Hilbertrsquos 10th problem over Q bull Could be undecidable open question
ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)
ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)
ndash Also easy to get upperlower bounds on e
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices
the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip
bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo
dependencies) work in progress
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Avoiding Communication in Iterative Linear Algebra
bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo
bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation
bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal
ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal
bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3 Step 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost92
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
93
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
94
Requires Co-tuning Kernels
95
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be done
ndash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matrices
ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel
ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors
ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip
Reproducible Floating Point Computation
bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop
bull Floating point addition is nonassociative summation order not reproducible
bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of
processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)
bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week
Summary
Donrsquot Communichellip
102
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- CARMA Performance Shared Memory
- CARMA Performance Shared Memory (2)
- Why is CARMA Faster
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- Summary of dense parallel algorithms attaining communication l
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling ndash in Time and Energy (22)
- Handling Heterogeneity
- Application to Tensor Contractions
- C(ijk) = Σm A(ijm)B(mk)
- Application to Tensor Contractions (2)
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Using similar idea for TSLU as TSQR Use reduction tree to do
- LU Speedups from Tournament Pivoting and 25D
- 25D vs 2D LU With and Without Pivoting
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- Other CA algorithms
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 57
- Symmetric Band Reduction
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
- What about sparse matrices (13)
- Performance of 25D APSP using Kleene
- What about sparse matrices (23)
- What about sparse matrices (33)
- Summary of Direct Linear Algebra
- Outline (3)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- Outline (4)
- Avoiding Communication in Iterative Linear Algebra
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Slide 88
- Slide 89
- Slide 90
- Slide 91
- Minimizing Communication of GMRES to solve Ax=b
- Slide 93
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 95
- Slide 96
- Slide 97
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- For more details
- Reproducible Floating Point Computation
- Summary
-
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 63
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
64
Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on
ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices
ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip
ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip
bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Recall optimal sequential Matmul
bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)
bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)
bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from
b x b matmul
New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ
bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e
bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12
i j k1 0 1 A
Δ = 0 1 1 B1 1 0 C
New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ
bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e
bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1
i j1 0 F
Δ = 1 0 P(i)0 1 P(j)
N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles
118x speedup
K Yelick E Georganas M Driscoll P Koanantakool E Solomonik
New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ
bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e
bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47
i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ = 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)
φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that
words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to
rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk
bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces
to Hilbertrsquos 10th problem over Q bull Could be undecidable open question
ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)
ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)
ndash Also easy to get upperlower bounds on e
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices
the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip
bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo
dependencies) work in progress
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Avoiding Communication in Iterative Linear Algebra
bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo
bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation
bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal
ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal
bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3 Step 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost92
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
93
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
94
Requires Co-tuning Kernels
95
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be done
ndash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matrices
ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel
ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors
ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip
Reproducible Floating Point Computation
bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop
bull Floating point addition is nonassociative summation order not reproducible
bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of
processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)
bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week
Summary
Donrsquot Communichellip
102
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- CARMA Performance Shared Memory
- CARMA Performance Shared Memory (2)
- Why is CARMA Faster
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- Summary of dense parallel algorithms attaining communication l
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling ndash in Time and Energy (22)
- Handling Heterogeneity
- Application to Tensor Contractions
- C(ijk) = Σm A(ijm)B(mk)
- Application to Tensor Contractions (2)
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Using similar idea for TSLU as TSQR Use reduction tree to do
- LU Speedups from Tournament Pivoting and 25D
- 25D vs 2D LU With and Without Pivoting
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- Other CA algorithms
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 57
- Symmetric Band Reduction
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
- What about sparse matrices (13)
- Performance of 25D APSP using Kleene
- What about sparse matrices (23)
- What about sparse matrices (33)
- Summary of Direct Linear Algebra
- Outline (3)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- Outline (4)
- Avoiding Communication in Iterative Linear Algebra
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Slide 88
- Slide 89
- Slide 90
- Slide 91
- Minimizing Communication of GMRES to solve Ax=b
- Slide 93
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 95
- Slide 96
- Slide 97
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- For more details
- Reproducible Floating Point Computation
- Summary
-
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
64
Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on
ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices
ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip
ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip
bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Recall optimal sequential Matmul
bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)
bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)
bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from
b x b matmul
New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ
bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e
bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12
i j k1 0 1 A
Δ = 0 1 1 B1 1 0 C
New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ
bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e
bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1
i j1 0 F
Δ = 1 0 P(i)0 1 P(j)
N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles
118x speedup
K Yelick E Georganas M Driscoll P Koanantakool E Solomonik
New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ
bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e
bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47
i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ = 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)
φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that
words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to
rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk
bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces
to Hilbertrsquos 10th problem over Q bull Could be undecidable open question
ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)
ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)
ndash Also easy to get upperlower bounds on e
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices
the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip
bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo
dependencies) work in progress
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Avoiding Communication in Iterative Linear Algebra
bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo
bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation
bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal
ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal
bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3 Step 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost92
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
93
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
94
Requires Co-tuning Kernels
95
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be done
ndash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matrices
ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel
ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors
ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip
Reproducible Floating Point Computation
bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop
bull Floating point addition is nonassociative summation order not reproducible
bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of
processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)
bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week
Summary
Donrsquot Communichellip
102
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- CARMA Performance Shared Memory
- CARMA Performance Shared Memory (2)
- Why is CARMA Faster
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- Summary of dense parallel algorithms attaining communication l
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling ndash in Time and Energy (22)
- Handling Heterogeneity
- Application to Tensor Contractions
- C(ijk) = Σm A(ijm)B(mk)
- Application to Tensor Contractions (2)
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Using similar idea for TSLU as TSQR Use reduction tree to do
- LU Speedups from Tournament Pivoting and 25D
- 25D vs 2D LU With and Without Pivoting
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- Other CA algorithms
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 57
- Symmetric Band Reduction
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
- What about sparse matrices (13)
- Performance of 25D APSP using Kleene
- What about sparse matrices (23)
- What about sparse matrices (33)
- Summary of Direct Linear Algebra
- Outline (3)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- Outline (4)
- Avoiding Communication in Iterative Linear Algebra
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Slide 88
- Slide 89
- Slide 90
- Slide 91
- Minimizing Communication of GMRES to solve Ax=b
- Slide 93
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 95
- Slide 96
- Slide 97
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- For more details
- Reproducible Floating Point Computation
- Summary
-
Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on
ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices
ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip
ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip
bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Recall optimal sequential Matmul
bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)
bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)
bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from
b x b matmul
New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ
bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e
bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12
i j k1 0 1 A
Δ = 0 1 1 B1 1 0 C
New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ
bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e
bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1
i j1 0 F
Δ = 1 0 P(i)0 1 P(j)
N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles
118x speedup
K Yelick E Georganas M Driscoll P Koanantakool E Solomonik
New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ
bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e
bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47
i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ = 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)
φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that
words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to
rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk
bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces
to Hilbertrsquos 10th problem over Q bull Could be undecidable open question
ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)
ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)
ndash Also easy to get upperlower bounds on e
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices
the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip
bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo
dependencies) work in progress
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Avoiding Communication in Iterative Linear Algebra
bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo
bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation
bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal
ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal
bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3 Step 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost92
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
93
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
94
Requires Co-tuning Kernels
95
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be done
ndash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matrices
ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel
ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors
ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip
Reproducible Floating Point Computation
bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop
bull Floating point addition is nonassociative summation order not reproducible
bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of
processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)
bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week
Summary
Donrsquot Communichellip
102
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- CARMA Performance Shared Memory
- CARMA Performance Shared Memory (2)
- Why is CARMA Faster
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- Summary of dense parallel algorithms attaining communication l
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling ndash in Time and Energy (22)
- Handling Heterogeneity
- Application to Tensor Contractions
- C(ijk) = Σm A(ijm)B(mk)
- Application to Tensor Contractions (2)
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Using similar idea for TSLU as TSQR Use reduction tree to do
- LU Speedups from Tournament Pivoting and 25D
- 25D vs 2D LU With and Without Pivoting
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- Other CA algorithms
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 57
- Symmetric Band Reduction
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
- What about sparse matrices (13)
- Performance of 25D APSP using Kleene
- What about sparse matrices (23)
- What about sparse matrices (33)
- Summary of Direct Linear Algebra
- Outline (3)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- Outline (4)
- Avoiding Communication in Iterative Linear Algebra
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Slide 88
- Slide 89
- Slide 90
- Slide 91
- Minimizing Communication of GMRES to solve Ax=b
- Slide 93
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 95
- Slide 96
- Slide 97
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- For more details
- Reproducible Floating Point Computation
- Summary
-
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Recall optimal sequential Matmul
bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)
bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)
bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from
b x b matmul
New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ
bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e
bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12
i j k1 0 1 A
Δ = 0 1 1 B1 1 0 C
New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ
bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e
bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1
i j1 0 F
Δ = 1 0 P(i)0 1 P(j)
N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles
118x speedup
K Yelick E Georganas M Driscoll P Koanantakool E Solomonik
New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ
bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e
bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47
i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ = 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)
φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that
words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to
rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk
bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces
to Hilbertrsquos 10th problem over Q bull Could be undecidable open question
ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)
ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)
ndash Also easy to get upperlower bounds on e
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices
the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip
bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo
dependencies) work in progress
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Avoiding Communication in Iterative Linear Algebra
bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo
bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation
bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal
ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal
bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3 Step 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost92
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
93
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
94
Requires Co-tuning Kernels
95
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be done
ndash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matrices
ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel
ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors
ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip
Reproducible Floating Point Computation
bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop
bull Floating point addition is nonassociative summation order not reproducible
bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of
processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)
bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week
Summary
Donrsquot Communichellip
102
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- CARMA Performance Shared Memory
- CARMA Performance Shared Memory (2)
- Why is CARMA Faster
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- Summary of dense parallel algorithms attaining communication l
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling ndash in Time and Energy (22)
- Handling Heterogeneity
- Application to Tensor Contractions
- C(ijk) = Σm A(ijm)B(mk)
- Application to Tensor Contractions (2)
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Using similar idea for TSLU as TSQR Use reduction tree to do
- LU Speedups from Tournament Pivoting and 25D
- 25D vs 2D LU With and Without Pivoting
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- Other CA algorithms
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 57
- Symmetric Band Reduction
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
- What about sparse matrices (13)
- Performance of 25D APSP using Kleene
- What about sparse matrices (23)
- What about sparse matrices (33)
- Summary of Direct Linear Algebra
- Outline (3)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- Outline (4)
- Avoiding Communication in Iterative Linear Algebra
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Slide 88
- Slide 89
- Slide 90
- Slide 91
- Minimizing Communication of GMRES to solve Ax=b
- Slide 93
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 95
- Slide 96
- Slide 97
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- For more details
- Reproducible Floating Point Computation
- Summary
-
Recall optimal sequential Matmul
bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)
bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)
bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from
b x b matmul
New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ
bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e
bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12
i j k1 0 1 A
Δ = 0 1 1 B1 1 0 C
New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ
bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e
bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1
i j1 0 F
Δ = 1 0 P(i)0 1 P(j)
N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles
118x speedup
K Yelick E Georganas M Driscoll P Koanantakool E Solomonik
New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ
bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e
bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47
i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ = 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)
φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that
words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to
rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk
bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces
to Hilbertrsquos 10th problem over Q bull Could be undecidable open question
ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)
ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)
ndash Also easy to get upperlower bounds on e
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices
the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip
bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo
dependencies) work in progress
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Avoiding Communication in Iterative Linear Algebra
bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo
bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation
bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal
ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal
bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3 Step 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost92
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
93
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
94
Requires Co-tuning Kernels
95
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be done
ndash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matrices
ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel
ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors
ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip
Reproducible Floating Point Computation
bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop
bull Floating point addition is nonassociative summation order not reproducible
bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of
processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)
bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week
Summary
Donrsquot Communichellip
102
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- CARMA Performance Shared Memory
- CARMA Performance Shared Memory (2)
- Why is CARMA Faster
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- Summary of dense parallel algorithms attaining communication l
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling ndash in Time and Energy (22)
- Handling Heterogeneity
- Application to Tensor Contractions
- C(ijk) = Σm A(ijm)B(mk)
- Application to Tensor Contractions (2)
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Using similar idea for TSLU as TSQR Use reduction tree to do
- LU Speedups from Tournament Pivoting and 25D
- 25D vs 2D LU With and Without Pivoting
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- Other CA algorithms
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 57
- Symmetric Band Reduction
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
- What about sparse matrices (13)
- Performance of 25D APSP using Kleene
- What about sparse matrices (23)
- What about sparse matrices (33)
- Summary of Direct Linear Algebra
- Outline (3)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- Outline (4)
- Avoiding Communication in Iterative Linear Algebra
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Slide 88
- Slide 89
- Slide 90
- Slide 91
- Minimizing Communication of GMRES to solve Ax=b
- Slide 93
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 95
- Slide 96
- Slide 97
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- For more details
- Reproducible Floating Point Computation
- Summary
-
New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ
bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e
bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12
i j k1 0 1 A
Δ = 0 1 1 B1 1 0 C
New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ
bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e
bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1
i j1 0 F
Δ = 1 0 P(i)0 1 P(j)
N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles
118x speedup
K Yelick E Georganas M Driscoll P Koanantakool E Solomonik
New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ
bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e
bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47
i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ = 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)
φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that
words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to
rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk
bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces
to Hilbertrsquos 10th problem over Q bull Could be undecidable open question
ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)
ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)
ndash Also easy to get upperlower bounds on e
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices
the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip
bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo
dependencies) work in progress
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Avoiding Communication in Iterative Linear Algebra
bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo
bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation
bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal
ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal
bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3 Step 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost92
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
93
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
94
Requires Co-tuning Kernels
95
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be done
ndash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matrices
ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel
ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors
ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip
Reproducible Floating Point Computation
bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop
bull Floating point addition is nonassociative summation order not reproducible
bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of
processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)
bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week
Summary
Donrsquot Communichellip
102
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- CARMA Performance Shared Memory
- CARMA Performance Shared Memory (2)
- Why is CARMA Faster
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- Summary of dense parallel algorithms attaining communication l
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling ndash in Time and Energy (22)
- Handling Heterogeneity
- Application to Tensor Contractions
- C(ijk) = Σm A(ijm)B(mk)
- Application to Tensor Contractions (2)
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Using similar idea for TSLU as TSQR Use reduction tree to do
- LU Speedups from Tournament Pivoting and 25D
- 25D vs 2D LU With and Without Pivoting
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- Other CA algorithms
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 57
- Symmetric Band Reduction
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
- What about sparse matrices (13)
- Performance of 25D APSP using Kleene
- What about sparse matrices (23)
- What about sparse matrices (33)
- Summary of Direct Linear Algebra
- Outline (3)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- Outline (4)
- Avoiding Communication in Iterative Linear Algebra
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Slide 88
- Slide 89
- Slide 90
- Slide 91
- Minimizing Communication of GMRES to solve Ax=b
- Slide 93
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 95
- Slide 96
- Slide 97
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- For more details
- Reproducible Floating Point Computation
- Summary
-
New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ
bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e
bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1
i j1 0 F
Δ = 1 0 P(i)0 1 P(j)
N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles
118x speedup
K Yelick E Georganas M Driscoll P Koanantakool E Solomonik
New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ
bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e
bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47
i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ = 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)
φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that
words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to
rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk
bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces
to Hilbertrsquos 10th problem over Q bull Could be undecidable open question
ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)
ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)
ndash Also easy to get upperlower bounds on e
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices
the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip
bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo
dependencies) work in progress
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Avoiding Communication in Iterative Linear Algebra
bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo
bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation
bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal
ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal
bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3 Step 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost92
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
93
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
94
Requires Co-tuning Kernels
95
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be done
ndash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matrices
ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel
ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors
ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip
Reproducible Floating Point Computation
bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop
bull Floating point addition is nonassociative summation order not reproducible
bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of
processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)
bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week
Summary
Donrsquot Communichellip
102
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- CARMA Performance Shared Memory
- CARMA Performance Shared Memory (2)
- Why is CARMA Faster
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- Summary of dense parallel algorithms attaining communication l
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling ndash in Time and Energy (22)
- Handling Heterogeneity
- Application to Tensor Contractions
- C(ijk) = Σm A(ijm)B(mk)
- Application to Tensor Contractions (2)
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Using similar idea for TSLU as TSQR Use reduction tree to do
- LU Speedups from Tournament Pivoting and 25D
- 25D vs 2D LU With and Without Pivoting
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- Other CA algorithms
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 57
- Symmetric Band Reduction
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
- What about sparse matrices (13)
- Performance of 25D APSP using Kleene
- What about sparse matrices (23)
- What about sparse matrices (33)
- Summary of Direct Linear Algebra
- Outline (3)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- Outline (4)
- Avoiding Communication in Iterative Linear Algebra
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Slide 88
- Slide 89
- Slide 90
- Slide 91
- Minimizing Communication of GMRES to solve Ax=b
- Slide 93
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 95
- Slide 96
- Slide 97
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- For more details
- Reproducible Floating Point Computation
- Summary
-
N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles
118x speedup
K Yelick E Georganas M Driscoll P Koanantakool E Solomonik
New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ
bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e
bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47
i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ = 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)
φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that
words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to
rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk
bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces
to Hilbertrsquos 10th problem over Q bull Could be undecidable open question
ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)
ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)
ndash Also easy to get upperlower bounds on e
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices
the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip
bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo
dependencies) work in progress
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Avoiding Communication in Iterative Linear Algebra
bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo
bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation
bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal
ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal
bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3 Step 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost92
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
93
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
94
Requires Co-tuning Kernels
95
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be done
ndash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matrices
ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel
ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors
ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip
Reproducible Floating Point Computation
bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop
bull Floating point addition is nonassociative summation order not reproducible
bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of
processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)
bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week
Summary
Donrsquot Communichellip
102
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- CARMA Performance Shared Memory
- CARMA Performance Shared Memory (2)
- Why is CARMA Faster
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- Summary of dense parallel algorithms attaining communication l
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling ndash in Time and Energy (22)
- Handling Heterogeneity
- Application to Tensor Contractions
- C(ijk) = Σm A(ijm)B(mk)
- Application to Tensor Contractions (2)
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Using similar idea for TSLU as TSQR Use reduction tree to do
- LU Speedups from Tournament Pivoting and 25D
- 25D vs 2D LU With and Without Pivoting
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- Other CA algorithms
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 57
- Symmetric Band Reduction
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
- What about sparse matrices (13)
- Performance of 25D APSP using Kleene
- What about sparse matrices (23)
- What about sparse matrices (33)
- Summary of Direct Linear Algebra
- Outline (3)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- Outline (4)
- Avoiding Communication in Iterative Linear Algebra
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Slide 88
- Slide 89
- Slide 90
- Slide 91
- Minimizing Communication of GMRES to solve Ax=b
- Slide 93
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 95
- Slide 96
- Slide 97
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- For more details
- Reproducible Floating Point Computation
- Summary
-
New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ
bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e
bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47
i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ = 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)
φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that
words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to
rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk
bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces
to Hilbertrsquos 10th problem over Q bull Could be undecidable open question
ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)
ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)
ndash Also easy to get upperlower bounds on e
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices
the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip
bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo
dependencies) work in progress
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Avoiding Communication in Iterative Linear Algebra
bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo
bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation
bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal
ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal
bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3 Step 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost92
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
93
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
94
Requires Co-tuning Kernels
95
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be done
ndash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matrices
ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel
ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors
ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip
Reproducible Floating Point Computation
bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop
bull Floating point addition is nonassociative summation order not reproducible
bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of
processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)
bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week
Summary
Donrsquot Communichellip
102
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- CARMA Performance Shared Memory
- CARMA Performance Shared Memory (2)
- Why is CARMA Faster
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- Summary of dense parallel algorithms attaining communication l
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling ndash in Time and Energy (22)
- Handling Heterogeneity
- Application to Tensor Contractions
- C(ijk) = Σm A(ijm)B(mk)
- Application to Tensor Contractions (2)
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Using similar idea for TSLU as TSQR Use reduction tree to do
- LU Speedups from Tournament Pivoting and 25D
- 25D vs 2D LU With and Without Pivoting
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- Other CA algorithms
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 57
- Symmetric Band Reduction
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
- What about sparse matrices (13)
- Performance of 25D APSP using Kleene
- What about sparse matrices (23)
- What about sparse matrices (33)
- Summary of Direct Linear Algebra
- Outline (3)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- Outline (4)
- Avoiding Communication in Iterative Linear Algebra
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Slide 88
- Slide 89
- Slide 90
- Slide 91
- Minimizing Communication of GMRES to solve Ax=b
- Slide 93
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 95
- Slide 96
- Slide 97
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- For more details
- Reproducible Floating Point Computation
- Summary
-
Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3
Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk
Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)
φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip
bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip
General Communication Bound
bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that
words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to
rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk
bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces
to Hilbertrsquos 10th problem over Q bull Could be undecidable open question
ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)
ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)
ndash Also easy to get upperlower bounds on e
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices
the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip
bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo
dependencies) work in progress
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Avoiding Communication in Iterative Linear Algebra
bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo
bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation
bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal
ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal
bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3 Step 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost92
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
93
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
94
Requires Co-tuning Kernels
95
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be done
ndash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matrices
ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel
ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors
ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip
Reproducible Floating Point Computation
bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop
bull Floating point addition is nonassociative summation order not reproducible
bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of
processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)
bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week
Summary
Donrsquot Communichellip
102
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- CARMA Performance Shared Memory
- CARMA Performance Shared Memory (2)
- Why is CARMA Faster
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- Summary of dense parallel algorithms attaining communication l
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling ndash in Time and Energy (22)
- Handling Heterogeneity
- Application to Tensor Contractions
- C(ijk) = Σm A(ijm)B(mk)
- Application to Tensor Contractions (2)
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Using similar idea for TSLU as TSQR Use reduction tree to do
- LU Speedups from Tournament Pivoting and 25D
- 25D vs 2D LU With and Without Pivoting
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- Other CA algorithms
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 57
- Symmetric Band Reduction
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
- What about sparse matrices (13)
- Performance of 25D APSP using Kleene
- What about sparse matrices (23)
- What about sparse matrices (33)
- Summary of Direct Linear Algebra
- Outline (3)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- Outline (4)
- Avoiding Communication in Iterative Linear Algebra
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Slide 88
- Slide 89
- Slide 90
- Slide 91
- Minimizing Communication of GMRES to solve Ax=b
- Slide 93
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 95
- Slide 96
- Slide 97
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- For more details
- Reproducible Floating Point Computation
- Summary
-
General Communication Bound
bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that
words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to
rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk
bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett
bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|
bull Thm (ChristTaoCarberyBennett) Given s1hellipsm
|S| le Πj |φj(S)|sj
Is this bound attainable (12)
bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces
to Hilbertrsquos 10th problem over Q bull Could be undecidable open question
ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)
ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)
ndash Also easy to get upperlower bounds on e
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices
the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip
bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo
dependencies) work in progress
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Avoiding Communication in Iterative Linear Algebra
bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo
bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation
bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal
ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal
bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3 Step 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost92
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
93
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
94
Requires Co-tuning Kernels
95
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be done
ndash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matrices
ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel
ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors
ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip
Reproducible Floating Point Computation
bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop
bull Floating point addition is nonassociative summation order not reproducible
bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of
processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)
bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week
Summary
Donrsquot Communichellip
102
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- CARMA Performance Shared Memory
- CARMA Performance Shared Memory (2)
- Why is CARMA Faster
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- Summary of dense parallel algorithms attaining communication l
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling ndash in Time and Energy (22)
- Handling Heterogeneity
- Application to Tensor Contractions
- C(ijk) = Σm A(ijm)B(mk)
- Application to Tensor Contractions (2)
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Using similar idea for TSLU as TSQR Use reduction tree to do
- LU Speedups from Tournament Pivoting and 25D
- 25D vs 2D LU With and Without Pivoting
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- Other CA algorithms
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 57
- Symmetric Band Reduction
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
- What about sparse matrices (13)
- Performance of 25D APSP using Kleene
- What about sparse matrices (23)
- What about sparse matrices (33)
- Summary of Direct Linear Algebra
- Outline (3)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- Outline (4)
- Avoiding Communication in Iterative Linear Algebra
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Slide 88
- Slide 89
- Slide 90
- Slide 91
- Minimizing Communication of GMRES to solve Ax=b
- Slide 93
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 95
- Slide 96
- Slide 97
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- For more details
- Reproducible Floating Point Computation
- Summary
-
Is this bound attainable (12)
bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces
to Hilbertrsquos 10th problem over Q bull Could be undecidable open question
ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)
ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)
ndash Also easy to get upperlower bounds on e
bull Tarski-decidable to get superset of constraints (may get sHBL too large)
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices
the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip
bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo
dependencies) work in progress
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Avoiding Communication in Iterative Linear Algebra
bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo
bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation
bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal
ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal
bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3 Step 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost92
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
93
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
94
Requires Co-tuning Kernels
95
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be done
ndash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matrices
ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel
ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors
ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip
Reproducible Floating Point Computation
bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop
bull Floating point addition is nonassociative summation order not reproducible
bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of
processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)
bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week
Summary
Donrsquot Communichellip
102
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- CARMA Performance Shared Memory
- CARMA Performance Shared Memory (2)
- Why is CARMA Faster
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- Summary of dense parallel algorithms attaining communication l
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling ndash in Time and Energy (22)
- Handling Heterogeneity
- Application to Tensor Contractions
- C(ijk) = Σm A(ijm)B(mk)
- Application to Tensor Contractions (2)
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Using similar idea for TSLU as TSQR Use reduction tree to do
- LU Speedups from Tournament Pivoting and 25D
- 25D vs 2D LU With and Without Pivoting
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- Other CA algorithms
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 57
- Symmetric Band Reduction
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
- What about sparse matrices (13)
- Performance of 25D APSP using Kleene
- What about sparse matrices (23)
- What about sparse matrices (33)
- Summary of Direct Linear Algebra
- Outline (3)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- Outline (4)
- Avoiding Communication in Iterative Linear Algebra
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Slide 88
- Slide 89
- Slide 90
- Slide 91
- Minimizing Communication of GMRES to solve Ax=b
- Slide 93
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 95
- Slide 96
- Slide 97
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- For more details
- Reproducible Floating Point Computation
- Summary
-
Is this bound attainable (22)
bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices
the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip
bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo
dependencies) work in progress
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Avoiding Communication in Iterative Linear Algebra
bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo
bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation
bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal
ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal
bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3 Step 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost92
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
93
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
94
Requires Co-tuning Kernels
95
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be done
ndash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matrices
ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel
ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors
ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip
Reproducible Floating Point Computation
bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop
bull Floating point addition is nonassociative summation order not reproducible
bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of
processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)
bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week
Summary
Donrsquot Communichellip
102
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- CARMA Performance Shared Memory
- CARMA Performance Shared Memory (2)
- Why is CARMA Faster
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- Summary of dense parallel algorithms attaining communication l
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling ndash in Time and Energy (22)
- Handling Heterogeneity
- Application to Tensor Contractions
- C(ijk) = Σm A(ijm)B(mk)
- Application to Tensor Contractions (2)
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Using similar idea for TSLU as TSQR Use reduction tree to do
- LU Speedups from Tournament Pivoting and 25D
- 25D vs 2D LU With and Without Pivoting
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- Other CA algorithms
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 57
- Symmetric Band Reduction
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
- What about sparse matrices (13)
- Performance of 25D APSP using Kleene
- What about sparse matrices (23)
- What about sparse matrices (33)
- Summary of Direct Linear Algebra
- Outline (3)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- Outline (4)
- Avoiding Communication in Iterative Linear Algebra
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Slide 88
- Slide 89
- Slide 90
- Slide 91
- Minimizing Communication of GMRES to solve Ax=b
- Slide 93
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 95
- Slide 96
- Slide 97
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- For more details
- Reproducible Floating Point Computation
- Summary
-
Ongoing Work
bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts
bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and
energy by using extra memorybull Have yet to find a case where we cannot attain
lower bound ndash can we prove thisbull Incorporate into compilers
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Avoiding Communication in Iterative Linear Algebra
bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo
bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation
bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal
ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal
bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3 Step 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost92
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
93
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
94
Requires Co-tuning Kernels
95
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be done
ndash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matrices
ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel
ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors
ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip
Reproducible Floating Point Computation
bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop
bull Floating point addition is nonassociative summation order not reproducible
bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of
processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)
bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week
Summary
Donrsquot Communichellip
102
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- CARMA Performance Shared Memory
- CARMA Performance Shared Memory (2)
- Why is CARMA Faster
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- Summary of dense parallel algorithms attaining communication l
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling ndash in Time and Energy (22)
- Handling Heterogeneity
- Application to Tensor Contractions
- C(ijk) = Σm A(ijm)B(mk)
- Application to Tensor Contractions (2)
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Using similar idea for TSLU as TSQR Use reduction tree to do
- LU Speedups from Tournament Pivoting and 25D
- 25D vs 2D LU With and Without Pivoting
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- Other CA algorithms
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 57
- Symmetric Band Reduction
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
- What about sparse matrices (13)
- Performance of 25D APSP using Kleene
- What about sparse matrices (23)
- What about sparse matrices (33)
- Summary of Direct Linear Algebra
- Outline (3)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- Outline (4)
- Avoiding Communication in Iterative Linear Algebra
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Slide 88
- Slide 89
- Slide 90
- Slide 91
- Minimizing Communication of GMRES to solve Ax=b
- Slide 93
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 95
- Slide 96
- Slide 97
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- For more details
- Reproducible Floating Point Computation
- Summary
-
Outlinebull ldquoDirectrdquo Linear Algebra
bull Lower bounds on communication bull New algorithms that attain these lower bounds
bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra
Avoiding Communication in Iterative Linear Algebra
bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo
bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation
bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal
ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal
bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3 Step 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost92
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
93
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
94
Requires Co-tuning Kernels
95
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be done
ndash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matrices
ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel
ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors
ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip
Reproducible Floating Point Computation
bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop
bull Floating point addition is nonassociative summation order not reproducible
bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of
processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)
bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week
Summary
Donrsquot Communichellip
102
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- CARMA Performance Shared Memory
- CARMA Performance Shared Memory (2)
- Why is CARMA Faster
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- Summary of dense parallel algorithms attaining communication l
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling ndash in Time and Energy (22)
- Handling Heterogeneity
- Application to Tensor Contractions
- C(ijk) = Σm A(ijm)B(mk)
- Application to Tensor Contractions (2)
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Using similar idea for TSLU as TSQR Use reduction tree to do
- LU Speedups from Tournament Pivoting and 25D
- 25D vs 2D LU With and Without Pivoting
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- Other CA algorithms
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 57
- Symmetric Band Reduction
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
- What about sparse matrices (13)
- Performance of 25D APSP using Kleene
- What about sparse matrices (23)
- What about sparse matrices (33)
- Summary of Direct Linear Algebra
- Outline (3)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- Outline (4)
- Avoiding Communication in Iterative Linear Algebra
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Slide 88
- Slide 89
- Slide 90
- Slide 91
- Minimizing Communication of GMRES to solve Ax=b
- Slide 93
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 95
- Slide 96
- Slide 97
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- For more details
- Reproducible Floating Point Computation
- Summary
-
Avoiding Communication in Iterative Linear Algebra
bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo
bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation
bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal
ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal
bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3 Step 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost92
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
93
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
94
Requires Co-tuning Kernels
95
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be done
ndash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matrices
ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel
ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors
ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip
Reproducible Floating Point Computation
bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop
bull Floating point addition is nonassociative summation order not reproducible
bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of
processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)
bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week
Summary
Donrsquot Communichellip
102
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- CARMA Performance Shared Memory
- CARMA Performance Shared Memory (2)
- Why is CARMA Faster
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- Summary of dense parallel algorithms attaining communication l
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling ndash in Time and Energy (22)
- Handling Heterogeneity
- Application to Tensor Contractions
- C(ijk) = Σm A(ijm)B(mk)
- Application to Tensor Contractions (2)
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Using similar idea for TSLU as TSQR Use reduction tree to do
- LU Speedups from Tournament Pivoting and 25D
- 25D vs 2D LU With and Without Pivoting
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- Other CA algorithms
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 57
- Symmetric Band Reduction
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
- What about sparse matrices (13)
- Performance of 25D APSP using Kleene
- What about sparse matrices (23)
- What about sparse matrices (33)
- Summary of Direct Linear Algebra
- Outline (3)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- Outline (4)
- Avoiding Communication in Iterative Linear Algebra
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Slide 88
- Slide 89
- Slide 90
- Slide 91
- Minimizing Communication of GMRES to solve Ax=b
- Slide 93
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 95
- Slide 96
- Slide 97
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- For more details
- Reproducible Floating Point Computation
- Summary
-
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3 Step 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost92
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
93
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
94
Requires Co-tuning Kernels
95
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be done
ndash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matrices
ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel
ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors
ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip
Reproducible Floating Point Computation
bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop
bull Floating point addition is nonassociative summation order not reproducible
bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of
processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)
bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week
Summary
Donrsquot Communichellip
102
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- CARMA Performance Shared Memory
- CARMA Performance Shared Memory (2)
- Why is CARMA Faster
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- Summary of dense parallel algorithms attaining communication l
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling ndash in Time and Energy (22)
- Handling Heterogeneity
- Application to Tensor Contractions
- C(ijk) = Σm A(ijm)B(mk)
- Application to Tensor Contractions (2)
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Using similar idea for TSLU as TSQR Use reduction tree to do
- LU Speedups from Tournament Pivoting and 25D
- 25D vs 2D LU With and Without Pivoting
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- Other CA algorithms
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 57
- Symmetric Band Reduction
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
- What about sparse matrices (13)
- Performance of 25D APSP using Kleene
- What about sparse matrices (23)
- What about sparse matrices (33)
- Summary of Direct Linear Algebra
- Outline (3)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- Outline (4)
- Avoiding Communication in Iterative Linear Algebra
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Slide 88
- Slide 89
- Slide 90
- Slide 91
- Minimizing Communication of GMRES to solve Ax=b
- Slide 93
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 95
- Slide 96
- Slide 97
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- For more details
- Reproducible Floating Point Computation
- Summary
-
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3 Step 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost92
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
93
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
94
Requires Co-tuning Kernels
95
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be done
ndash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matrices
ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel
ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors
ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip
Reproducible Floating Point Computation
bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop
bull Floating point addition is nonassociative summation order not reproducible
bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of
processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)
bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week
Summary
Donrsquot Communichellip
102
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- CARMA Performance Shared Memory
- CARMA Performance Shared Memory (2)
- Why is CARMA Faster
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- Summary of dense parallel algorithms attaining communication l
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling ndash in Time and Energy (22)
- Handling Heterogeneity
- Application to Tensor Contractions
- C(ijk) = Σm A(ijm)B(mk)
- Application to Tensor Contractions (2)
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Using similar idea for TSLU as TSQR Use reduction tree to do
- LU Speedups from Tournament Pivoting and 25D
- 25D vs 2D LU With and Without Pivoting
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- Other CA algorithms
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 57
- Symmetric Band Reduction
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
- What about sparse matrices (13)
- Performance of 25D APSP using Kleene
- What about sparse matrices (23)
- What about sparse matrices (33)
- Summary of Direct Linear Algebra
- Outline (3)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- Outline (4)
- Avoiding Communication in Iterative Linear Algebra
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Slide 88
- Slide 89
- Slide 90
- Slide 91
- Minimizing Communication of GMRES to solve Ax=b
- Slide 93
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 95
- Slide 96
- Slide 97
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- For more details
- Reproducible Floating Point Computation
- Summary
-
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3 Step 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost92
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
93
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
94
Requires Co-tuning Kernels
95
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be done
ndash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matrices
ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel
ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors
ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip
Reproducible Floating Point Computation
bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop
bull Floating point addition is nonassociative summation order not reproducible
bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of
processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)
bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week
Summary
Donrsquot Communichellip
102
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- CARMA Performance Shared Memory
- CARMA Performance Shared Memory (2)
- Why is CARMA Faster
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- Summary of dense parallel algorithms attaining communication l
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling ndash in Time and Energy (22)
- Handling Heterogeneity
- Application to Tensor Contractions
- C(ijk) = Σm A(ijm)B(mk)
- Application to Tensor Contractions (2)
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Using similar idea for TSLU as TSQR Use reduction tree to do
- LU Speedups from Tournament Pivoting and 25D
- 25D vs 2D LU With and Without Pivoting
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- Other CA algorithms
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 57
- Symmetric Band Reduction
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
- What about sparse matrices (13)
- Performance of 25D APSP using Kleene
- What about sparse matrices (23)
- What about sparse matrices (33)
- Summary of Direct Linear Algebra
- Outline (3)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- Outline (4)
- Avoiding Communication in Iterative Linear Algebra
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Slide 88
- Slide 89
- Slide 90
- Slide 91
- Minimizing Communication of GMRES to solve Ax=b
- Slide 93
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 95
- Slide 96
- Slide 97
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- For more details
- Reproducible Floating Point Computation
- Summary
-
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3 Step 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost92
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
93
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
94
Requires Co-tuning Kernels
95
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be done
ndash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matrices
ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel
ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors
ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip
Reproducible Floating Point Computation
bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop
bull Floating point addition is nonassociative summation order not reproducible
bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of
processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)
bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week
Summary
Donrsquot Communichellip
102
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- CARMA Performance Shared Memory
- CARMA Performance Shared Memory (2)
- Why is CARMA Faster
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- Summary of dense parallel algorithms attaining communication l
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling ndash in Time and Energy (22)
- Handling Heterogeneity
- Application to Tensor Contractions
- C(ijk) = Σm A(ijm)B(mk)
- Application to Tensor Contractions (2)
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Using similar idea for TSLU as TSQR Use reduction tree to do
- LU Speedups from Tournament Pivoting and 25D
- 25D vs 2D LU With and Without Pivoting
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- Other CA algorithms
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 57
- Symmetric Band Reduction
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
- What about sparse matrices (13)
- Performance of 25D APSP using Kleene
- What about sparse matrices (23)
- What about sparse matrices (33)
- Summary of Direct Linear Algebra
- Outline (3)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- Outline (4)
- Avoiding Communication in Iterative Linear Algebra
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Slide 88
- Slide 89
- Slide 90
- Slide 91
- Minimizing Communication of GMRES to solve Ax=b
- Slide 93
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 95
- Slide 96
- Slide 97
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- For more details
- Reproducible Floating Point Computation
- Summary
-
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3 Step 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost92
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
93
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
94
Requires Co-tuning Kernels
95
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be done
ndash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matrices
ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel
ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors
ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip
Reproducible Floating Point Computation
bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop
bull Floating point addition is nonassociative summation order not reproducible
bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of
processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)
bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week
Summary
Donrsquot Communichellip
102
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- CARMA Performance Shared Memory
- CARMA Performance Shared Memory (2)
- Why is CARMA Faster
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- Summary of dense parallel algorithms attaining communication l
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling ndash in Time and Energy (22)
- Handling Heterogeneity
- Application to Tensor Contractions
- C(ijk) = Σm A(ijm)B(mk)
- Application to Tensor Contractions (2)
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Using similar idea for TSLU as TSQR Use reduction tree to do
- LU Speedups from Tournament Pivoting and 25D
- 25D vs 2D LU With and Without Pivoting
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- Other CA algorithms
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 57
- Symmetric Band Reduction
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
- What about sparse matrices (13)
- Performance of 25D APSP using Kleene
- What about sparse matrices (23)
- What about sparse matrices (33)
- Summary of Direct Linear Algebra
- Outline (3)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- Outline (4)
- Avoiding Communication in Iterative Linear Algebra
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Slide 88
- Slide 89
- Slide 90
- Slide 91
- Minimizing Communication of GMRES to solve Ax=b
- Slide 93
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 95
- Slide 96
- Slide 97
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- For more details
- Reproducible Floating Point Computation
- Summary
-
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]
bull Example A tridiagonal n=32 k=3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3 Step 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost92
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
93
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
94
Requires Co-tuning Kernels
95
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be done
ndash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matrices
ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel
ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors
ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip
Reproducible Floating Point Computation
bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop
bull Floating point addition is nonassociative summation order not reproducible
bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of
processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)
bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week
Summary
Donrsquot Communichellip
102
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- CARMA Performance Shared Memory
- CARMA Performance Shared Memory (2)
- Why is CARMA Faster
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- Summary of dense parallel algorithms attaining communication l
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling ndash in Time and Energy (22)
- Handling Heterogeneity
- Application to Tensor Contractions
- C(ijk) = Σm A(ijm)B(mk)
- Application to Tensor Contractions (2)
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Using similar idea for TSLU as TSQR Use reduction tree to do
- LU Speedups from Tournament Pivoting and 25D
- 25D vs 2D LU With and Without Pivoting
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- Other CA algorithms
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 57
- Symmetric Band Reduction
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
- What about sparse matrices (13)
- Performance of 25D APSP using Kleene
- What about sparse matrices (23)
- What about sparse matrices (33)
- Summary of Direct Linear Algebra
- Outline (3)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- Outline (4)
- Avoiding Communication in Iterative Linear Algebra
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Slide 88
- Slide 89
- Slide 90
- Slide 91
- Minimizing Communication of GMRES to solve Ax=b
- Slide 93
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 95
- Slide 96
- Slide 97
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- For more details
- Reproducible Floating Point Computation
- Summary
-
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3 Step 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost92
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
93
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
94
Requires Co-tuning Kernels
95
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be done
ndash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matrices
ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel
ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors
ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip
Reproducible Floating Point Computation
bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop
bull Floating point addition is nonassociative summation order not reproducible
bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of
processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)
bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week
Summary
Donrsquot Communichellip
102
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- CARMA Performance Shared Memory
- CARMA Performance Shared Memory (2)
- Why is CARMA Faster
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- Summary of dense parallel algorithms attaining communication l
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling ndash in Time and Energy (22)
- Handling Heterogeneity
- Application to Tensor Contractions
- C(ijk) = Σm A(ijm)B(mk)
- Application to Tensor Contractions (2)
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Using similar idea for TSLU as TSQR Use reduction tree to do
- LU Speedups from Tournament Pivoting and 25D
- 25D vs 2D LU With and Without Pivoting
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- Other CA algorithms
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 57
- Symmetric Band Reduction
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
- What about sparse matrices (13)
- Performance of 25D APSP using Kleene
- What about sparse matrices (23)
- What about sparse matrices (33)
- Summary of Direct Linear Algebra
- Outline (3)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- Outline (4)
- Avoiding Communication in Iterative Linear Algebra
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Slide 88
- Slide 89
- Slide 90
- Slide 91
- Minimizing Communication of GMRES to solve Ax=b
- Slide 93
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 95
- Slide 96
- Slide 97
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- For more details
- Reproducible Floating Point Computation
- Summary
-
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3 Step 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost92
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
93
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
94
Requires Co-tuning Kernels
95
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be done
ndash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matrices
ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel
ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors
ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip
Reproducible Floating Point Computation
bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop
bull Floating point addition is nonassociative summation order not reproducible
bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of
processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)
bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week
Summary
Donrsquot Communichellip
102
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- CARMA Performance Shared Memory
- CARMA Performance Shared Memory (2)
- Why is CARMA Faster
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- Summary of dense parallel algorithms attaining communication l
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling ndash in Time and Energy (22)
- Handling Heterogeneity
- Application to Tensor Contractions
- C(ijk) = Σm A(ijm)B(mk)
- Application to Tensor Contractions (2)
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Using similar idea for TSLU as TSQR Use reduction tree to do
- LU Speedups from Tournament Pivoting and 25D
- 25D vs 2D LU With and Without Pivoting
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- Other CA algorithms
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 57
- Symmetric Band Reduction
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
- What about sparse matrices (13)
- Performance of 25D APSP using Kleene
- What about sparse matrices (23)
- What about sparse matrices (33)
- Summary of Direct Linear Algebra
- Outline (3)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- Outline (4)
- Avoiding Communication in Iterative Linear Algebra
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Slide 88
- Slide 89
- Slide 90
- Slide 91
- Minimizing Communication of GMRES to solve Ax=b
- Slide 93
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 95
- Slide 96
- Slide 97
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- For more details
- Reproducible Floating Point Computation
- Summary
-
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3 Step 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost92
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
93
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
94
Requires Co-tuning Kernels
95
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be done
ndash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matrices
ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel
ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors
ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip
Reproducible Floating Point Computation
bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop
bull Floating point addition is nonassociative summation order not reproducible
bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of
processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)
bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week
Summary
Donrsquot Communichellip
102
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- CARMA Performance Shared Memory
- CARMA Performance Shared Memory (2)
- Why is CARMA Faster
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- Summary of dense parallel algorithms attaining communication l
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling ndash in Time and Energy (22)
- Handling Heterogeneity
- Application to Tensor Contractions
- C(ijk) = Σm A(ijm)B(mk)
- Application to Tensor Contractions (2)
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Using similar idea for TSLU as TSQR Use reduction tree to do
- LU Speedups from Tournament Pivoting and 25D
- 25D vs 2D LU With and Without Pivoting
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- Other CA algorithms
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 57
- Symmetric Band Reduction
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
- What about sparse matrices (13)
- Performance of 25D APSP using Kleene
- What about sparse matrices (23)
- What about sparse matrices (33)
- Summary of Direct Linear Algebra
- Outline (3)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- Outline (4)
- Avoiding Communication in Iterative Linear Algebra
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Slide 88
- Slide 89
- Slide 90
- Slide 91
- Minimizing Communication of GMRES to solve Ax=b
- Slide 93
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 95
- Slide 96
- Slide 97
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- For more details
- Reproducible Floating Point Computation
- Summary
-
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm
bull Example A tridiagonal n=32 k=3
Step 1 Step 2 Step 3 Step 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost92
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
93
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
94
Requires Co-tuning Kernels
95
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be done
ndash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matrices
ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel
ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors
ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip
Reproducible Floating Point Computation
bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop
bull Floating point addition is nonassociative summation order not reproducible
bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of
processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)
bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week
Summary
Donrsquot Communichellip
102
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- CARMA Performance Shared Memory
- CARMA Performance Shared Memory (2)
- Why is CARMA Faster
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- Summary of dense parallel algorithms attaining communication l
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling ndash in Time and Energy (22)
- Handling Heterogeneity
- Application to Tensor Contractions
- C(ijk) = Σm A(ijm)B(mk)
- Application to Tensor Contractions (2)
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Using similar idea for TSLU as TSQR Use reduction tree to do
- LU Speedups from Tournament Pivoting and 25D
- 25D vs 2D LU With and Without Pivoting
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- Other CA algorithms
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 57
- Symmetric Band Reduction
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
- What about sparse matrices (13)
- Performance of 25D APSP using Kleene
- What about sparse matrices (23)
- What about sparse matrices (33)
- Summary of Direct Linear Algebra
- Outline (3)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- Outline (4)
- Avoiding Communication in Iterative Linear Algebra
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Slide 88
- Slide 89
- Slide 90
- Slide 91
- Minimizing Communication of GMRES to solve Ax=b
- Slide 93
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 95
- Slide 96
- Slide 97
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- For more details
- Reproducible Floating Point Computation
- Summary
-
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost92
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
93
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
94
Requires Co-tuning Kernels
95
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be done
ndash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matrices
ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel
ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors
ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip
Reproducible Floating Point Computation
bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop
bull Floating point addition is nonassociative summation order not reproducible
bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of
processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)
bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week
Summary
Donrsquot Communichellip
102
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- CARMA Performance Shared Memory
- CARMA Performance Shared Memory (2)
- Why is CARMA Faster
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- Summary of dense parallel algorithms attaining communication l
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling ndash in Time and Energy (22)
- Handling Heterogeneity
- Application to Tensor Contractions
- C(ijk) = Σm A(ijm)B(mk)
- Application to Tensor Contractions (2)
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Using similar idea for TSLU as TSQR Use reduction tree to do
- LU Speedups from Tournament Pivoting and 25D
- 25D vs 2D LU With and Without Pivoting
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- Other CA algorithms
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 57
- Symmetric Band Reduction
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
- What about sparse matrices (13)
- Performance of 25D APSP using Kleene
- What about sparse matrices (23)
- What about sparse matrices (33)
- Summary of Direct Linear Algebra
- Outline (3)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- Outline (4)
- Avoiding Communication in Iterative Linear Algebra
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Slide 88
- Slide 89
- Slide 90
- Slide 91
- Minimizing Communication of GMRES to solve Ax=b
- Slide 93
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 95
- Slide 96
- Slide 97
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- For more details
- Reproducible Floating Point Computation
- Summary
-
1 2 3 4 hellip hellip 32
x
Amiddotx
A2middotx
A3middotx
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm
bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid
Proc 1 Proc 2 Proc 3 Proc 4
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost92
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
93
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
94
Requires Co-tuning Kernels
95
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be done
ndash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matrices
ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel
ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors
ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip
Reproducible Floating Point Computation
bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop
bull Floating point addition is nonassociative summation order not reproducible
bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of
processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)
bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week
Summary
Donrsquot Communichellip
102
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- CARMA Performance Shared Memory
- CARMA Performance Shared Memory (2)
- Why is CARMA Faster
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- Summary of dense parallel algorithms attaining communication l
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling ndash in Time and Energy (22)
- Handling Heterogeneity
- Application to Tensor Contractions
- C(ijk) = Σm A(ijm)B(mk)
- Application to Tensor Contractions (2)
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Using similar idea for TSLU as TSQR Use reduction tree to do
- LU Speedups from Tournament Pivoting and 25D
- 25D vs 2D LU With and Without Pivoting
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- Other CA algorithms
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 57
- Symmetric Band Reduction
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
- What about sparse matrices (13)
- Performance of 25D APSP using Kleene
- What about sparse matrices (23)
- What about sparse matrices (33)
- Summary of Direct Linear Algebra
- Outline (3)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- Outline (4)
- Avoiding Communication in Iterative Linear Algebra
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Slide 88
- Slide 89
- Slide 90
- Slide 91
- Minimizing Communication of GMRES to solve Ax=b
- Slide 93
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 95
- Slide 96
- Slide 97
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- For more details
- Reproducible Floating Point Computation
- Summary
-
Same idea works for general sparse matrices
Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]
Simple block-row partitioning (hyper)graph partitioning
Top-to-bottom processing Traveling Salesman Problem
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost92
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
93
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
94
Requires Co-tuning Kernels
95
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be done
ndash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matrices
ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel
ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors
ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip
Reproducible Floating Point Computation
bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop
bull Floating point addition is nonassociative summation order not reproducible
bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of
processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)
bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week
Summary
Donrsquot Communichellip
102
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- CARMA Performance Shared Memory
- CARMA Performance Shared Memory (2)
- Why is CARMA Faster
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- Summary of dense parallel algorithms attaining communication l
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling ndash in Time and Energy (22)
- Handling Heterogeneity
- Application to Tensor Contractions
- C(ijk) = Σm A(ijm)B(mk)
- Application to Tensor Contractions (2)
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Using similar idea for TSLU as TSQR Use reduction tree to do
- LU Speedups from Tournament Pivoting and 25D
- 25D vs 2D LU With and Without Pivoting
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- Other CA algorithms
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 57
- Symmetric Band Reduction
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
- What about sparse matrices (13)
- Performance of 25D APSP using Kleene
- What about sparse matrices (23)
- What about sparse matrices (33)
- Summary of Direct Linear Algebra
- Outline (3)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- Outline (4)
- Avoiding Communication in Iterative Linear Algebra
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Slide 88
- Slide 89
- Slide 90
- Slide 91
- Minimizing Communication of GMRES to solve Ax=b
- Slide 93
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 95
- Slide 96
- Slide 97
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- For more details
- Reproducible Floating Point Computation
- Summary
-
Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H
Communication-avoiding GMRES
W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H
bullOops ndash W from power method precision lost92
Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
93
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
94
Requires Co-tuning Kernels
95
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be done
ndash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matrices
ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel
ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors
ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip
Reproducible Floating Point Computation
bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop
bull Floating point addition is nonassociative summation order not reproducible
bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of
processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)
bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week
Summary
Donrsquot Communichellip
102
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- CARMA Performance Shared Memory
- CARMA Performance Shared Memory (2)
- Why is CARMA Faster
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- Summary of dense parallel algorithms attaining communication l
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling ndash in Time and Energy (22)
- Handling Heterogeneity
- Application to Tensor Contractions
- C(ijk) = Σm A(ijm)B(mk)
- Application to Tensor Contractions (2)
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Using similar idea for TSLU as TSQR Use reduction tree to do
- LU Speedups from Tournament Pivoting and 25D
- 25D vs 2D LU With and Without Pivoting
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- Other CA algorithms
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 57
- Symmetric Band Reduction
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
- What about sparse matrices (13)
- Performance of 25D APSP using Kleene
- What about sparse matrices (23)
- What about sparse matrices (33)
- Summary of Direct Linear Algebra
- Outline (3)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- Outline (4)
- Avoiding Communication in Iterative Linear Algebra
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Slide 88
- Slide 89
- Slide 90
- Slide 91
- Minimizing Communication of GMRES to solve Ax=b
- Slide 93
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 95
- Slide 96
- Slide 97
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- For more details
- Reproducible Floating Point Computation
- Summary
-
ldquoMonomialrdquo basis [AxhellipAkx] fails to converge
Different polynomial basis [p1(A)xhellippk(A)x] does converge
93
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
94
Requires Co-tuning Kernels
95
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be done
ndash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matrices
ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel
ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors
ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip
Reproducible Floating Point Computation
bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop
bull Floating point addition is nonassociative summation order not reproducible
bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of
processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)
bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week
Summary
Donrsquot Communichellip
102
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- CARMA Performance Shared Memory
- CARMA Performance Shared Memory (2)
- Why is CARMA Faster
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- Summary of dense parallel algorithms attaining communication l
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling ndash in Time and Energy (22)
- Handling Heterogeneity
- Application to Tensor Contractions
- C(ijk) = Σm A(ijm)B(mk)
- Application to Tensor Contractions (2)
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Using similar idea for TSLU as TSQR Use reduction tree to do
- LU Speedups from Tournament Pivoting and 25D
- 25D vs 2D LU With and Without Pivoting
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- Other CA algorithms
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 57
- Symmetric Band Reduction
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
- What about sparse matrices (13)
- Performance of 25D APSP using Kleene
- What about sparse matrices (23)
- What about sparse matrices (33)
- Summary of Direct Linear Algebra
- Outline (3)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- Outline (4)
- Avoiding Communication in Iterative Linear Algebra
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Slide 88
- Slide 89
- Slide 90
- Slide 91
- Minimizing Communication of GMRES to solve Ax=b
- Slide 93
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 95
- Slide 96
- Slide 97
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- For more details
- Reproducible Floating Point Computation
- Summary
-
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
94
Requires Co-tuning Kernels
95
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be done
ndash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matrices
ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel
ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors
ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip
Reproducible Floating Point Computation
bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop
bull Floating point addition is nonassociative summation order not reproducible
bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of
processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)
bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week
Summary
Donrsquot Communichellip
102
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- CARMA Performance Shared Memory
- CARMA Performance Shared Memory (2)
- Why is CARMA Faster
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- Summary of dense parallel algorithms attaining communication l
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling ndash in Time and Energy (22)
- Handling Heterogeneity
- Application to Tensor Contractions
- C(ijk) = Σm A(ijm)B(mk)
- Application to Tensor Contractions (2)
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Using similar idea for TSLU as TSQR Use reduction tree to do
- LU Speedups from Tournament Pivoting and 25D
- 25D vs 2D LU With and Without Pivoting
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- Other CA algorithms
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 57
- Symmetric Band Reduction
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
- What about sparse matrices (13)
- Performance of 25D APSP using Kleene
- What about sparse matrices (23)
- What about sparse matrices (33)
- Summary of Direct Linear Algebra
- Outline (3)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- Outline (4)
- Avoiding Communication in Iterative Linear Algebra
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Slide 88
- Slide 89
- Slide 90
- Slide 91
- Minimizing Communication of GMRES to solve Ax=b
- Slide 93
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 95
- Slide 96
- Slide 97
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- For more details
- Reproducible Floating Point Computation
- Summary
-
95
CA-BiCGStab
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be done
ndash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matrices
ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel
ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors
ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip
Reproducible Floating Point Computation
bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop
bull Floating point addition is nonassociative summation order not reproducible
bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of
processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)
bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week
Summary
Donrsquot Communichellip
102
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- CARMA Performance Shared Memory
- CARMA Performance Shared Memory (2)
- Why is CARMA Faster
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- Summary of dense parallel algorithms attaining communication l
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling ndash in Time and Energy (22)
- Handling Heterogeneity
- Application to Tensor Contractions
- C(ijk) = Σm A(ijm)B(mk)
- Application to Tensor Contractions (2)
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Using similar idea for TSLU as TSQR Use reduction tree to do
- LU Speedups from Tournament Pivoting and 25D
- 25D vs 2D LU With and Without Pivoting
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- Other CA algorithms
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 57
- Symmetric Band Reduction
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
- What about sparse matrices (13)
- Performance of 25D APSP using Kleene
- What about sparse matrices (23)
- What about sparse matrices (33)
- Summary of Direct Linear Algebra
- Outline (3)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- Outline (4)
- Avoiding Communication in Iterative Linear Algebra
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Slide 88
- Slide 89
- Slide 90
- Slide 91
- Minimizing Communication of GMRES to solve Ax=b
- Slide 93
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 95
- Slide 96
- Slide 97
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- For more details
- Reproducible Floating Point Computation
- Summary
-
Naive Monomial Newton Chebyshev
Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)
[67 98] (2) 68 (1)
With Residual Replacement (RR) a la Van der Vorst and Ye
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be done
ndash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matrices
ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel
ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors
ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip
Reproducible Floating Point Computation
bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop
bull Floating point addition is nonassociative summation order not reproducible
bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of
processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)
bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week
Summary
Donrsquot Communichellip
102
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- CARMA Performance Shared Memory
- CARMA Performance Shared Memory (2)
- Why is CARMA Faster
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- Summary of dense parallel algorithms attaining communication l
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling ndash in Time and Energy (22)
- Handling Heterogeneity
- Application to Tensor Contractions
- C(ijk) = Σm A(ijm)B(mk)
- Application to Tensor Contractions (2)
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Using similar idea for TSLU as TSQR Use reduction tree to do
- LU Speedups from Tournament Pivoting and 25D
- 25D vs 2D LU With and Without Pivoting
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- Other CA algorithms
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 57
- Symmetric Band Reduction
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
- What about sparse matrices (13)
- Performance of 25D APSP using Kleene
- What about sparse matrices (23)
- What about sparse matrices (33)
- Summary of Direct Linear Algebra
- Outline (3)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- Outline (4)
- Avoiding Communication in Iterative Linear Algebra
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Slide 88
- Slide 89
- Slide 90
- Slide 91
- Minimizing Communication of GMRES to solve Ax=b
- Slide 93
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 95
- Slide 96
- Slide 97
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- For more details
- Reproducible Floating Point Computation
- Summary
-
Tuning space for Krylov Methods
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors
bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one
bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W
bull Preconditioned versions
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be done
ndash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matrices
ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel
ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors
ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip
Reproducible Floating Point Computation
bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop
bull Floating point addition is nonassociative summation order not reproducible
bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of
processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)
bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week
Summary
Donrsquot Communichellip
102
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- CARMA Performance Shared Memory
- CARMA Performance Shared Memory (2)
- Why is CARMA Faster
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- Summary of dense parallel algorithms attaining communication l
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling ndash in Time and Energy (22)
- Handling Heterogeneity
- Application to Tensor Contractions
- C(ijk) = Σm A(ijm)B(mk)
- Application to Tensor Contractions (2)
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Using similar idea for TSLU as TSQR Use reduction tree to do
- LU Speedups from Tournament Pivoting and 25D
- 25D vs 2D LU With and Without Pivoting
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- Other CA algorithms
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 57
- Symmetric Band Reduction
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
- What about sparse matrices (13)
- Performance of 25D APSP using Kleene
- What about sparse matrices (23)
- What about sparse matrices (33)
- Summary of Direct Linear Algebra
- Outline (3)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- Outline (4)
- Avoiding Communication in Iterative Linear Algebra
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Slide 88
- Slide 89
- Slide 90
- Slide 91
- Minimizing Communication of GMRES to solve Ax=b
- Slide 93
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 95
- Slide 96
- Slide 97
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- For more details
- Reproducible Floating Point Computation
- Summary
-
Summary of Iterative Linear Algebra
bull New Lower bounds optimal algorithms big speedups in theory and practice
bull Lots of other progress open problemsndash Many different algorithms reorganized
bull More underway more to be done
ndash Need to recognize stable variants more easilyndash Preconditioning
bull Hierarchically Semiseparable Matrices
ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel
ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors
ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip
Reproducible Floating Point Computation
bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop
bull Floating point addition is nonassociative summation order not reproducible
bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of
processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)
bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week
Summary
Donrsquot Communichellip
102
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- CARMA Performance Shared Memory
- CARMA Performance Shared Memory (2)
- Why is CARMA Faster
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- Summary of dense parallel algorithms attaining communication l
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling ndash in Time and Energy (22)
- Handling Heterogeneity
- Application to Tensor Contractions
- C(ijk) = Σm A(ijm)B(mk)
- Application to Tensor Contractions (2)
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Using similar idea for TSLU as TSQR Use reduction tree to do
- LU Speedups from Tournament Pivoting and 25D
- 25D vs 2D LU With and Without Pivoting
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- Other CA algorithms
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 57
- Symmetric Band Reduction
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
- What about sparse matrices (13)
- Performance of 25D APSP using Kleene
- What about sparse matrices (23)
- What about sparse matrices (33)
- Summary of Direct Linear Algebra
- Outline (3)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- Outline (4)
- Avoiding Communication in Iterative Linear Algebra
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Slide 88
- Slide 89
- Slide 90
- Slide 91
- Minimizing Communication of GMRES to solve Ax=b
- Slide 93
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 95
- Slide 96
- Slide 97
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- For more details
- Reproducible Floating Point Computation
- Summary
-
For more details
bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course
ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel
ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors
ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip
Reproducible Floating Point Computation
bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop
bull Floating point addition is nonassociative summation order not reproducible
bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of
processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)
bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week
Summary
Donrsquot Communichellip
102
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- CARMA Performance Shared Memory
- CARMA Performance Shared Memory (2)
- Why is CARMA Faster
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- Summary of dense parallel algorithms attaining communication l
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling ndash in Time and Energy (22)
- Handling Heterogeneity
- Application to Tensor Contractions
- C(ijk) = Σm A(ijm)B(mk)
- Application to Tensor Contractions (2)
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Using similar idea for TSLU as TSQR Use reduction tree to do
- LU Speedups from Tournament Pivoting and 25D
- 25D vs 2D LU With and Without Pivoting
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- Other CA algorithms
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 57
- Symmetric Band Reduction
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
- What about sparse matrices (13)
- Performance of 25D APSP using Kleene
- What about sparse matrices (23)
- What about sparse matrices (33)
- Summary of Direct Linear Algebra
- Outline (3)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- Outline (4)
- Avoiding Communication in Iterative Linear Algebra
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Slide 88
- Slide 89
- Slide 90
- Slide 91
- Minimizing Communication of GMRES to solve Ax=b
- Slide 93
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 95
- Slide 96
- Slide 97
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- For more details
- Reproducible Floating Point Computation
- Summary
-
Reproducible Floating Point Computation
bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop
bull Floating point addition is nonassociative summation order not reproducible
bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of
processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)
bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week
Summary
Donrsquot Communichellip
102
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- CARMA Performance Shared Memory
- CARMA Performance Shared Memory (2)
- Why is CARMA Faster
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- Summary of dense parallel algorithms attaining communication l
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling ndash in Time and Energy (22)
- Handling Heterogeneity
- Application to Tensor Contractions
- C(ijk) = Σm A(ijm)B(mk)
- Application to Tensor Contractions (2)
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Using similar idea for TSLU as TSQR Use reduction tree to do
- LU Speedups from Tournament Pivoting and 25D
- 25D vs 2D LU With and Without Pivoting
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- Other CA algorithms
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 57
- Symmetric Band Reduction
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
- What about sparse matrices (13)
- Performance of 25D APSP using Kleene
- What about sparse matrices (23)
- What about sparse matrices (33)
- Summary of Direct Linear Algebra
- Outline (3)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- Outline (4)
- Avoiding Communication in Iterative Linear Algebra
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Slide 88
- Slide 89
- Slide 90
- Slide 91
- Minimizing Communication of GMRES to solve Ax=b
- Slide 93
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 95
- Slide 96
- Slide 97
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- For more details
- Reproducible Floating Point Computation
- Summary
-
Summary
Donrsquot Communichellip
102
Time to redesign all linear algebra n-bodyhellip algorithms and software
(and compilershellip)
- Introduction to Communication-Avoiding Algorithms wwwcsberk
- Why avoid communication (12)
- Why avoid communication (23)
- Why Minimize Communication (22)
- Why Minimize Communication (22) (2)
- Goals
- Slide 7
- Collaborators and Supporters
- Summary of CA Algorithms
- Outline
- Outline (2)
- Lower bound for all ldquodirectrdquo linear algebra
- Lower bound for all ldquodirectrdquo linear algebra (2)
- Lower bound for all ldquodirectrdquo linear algebra (3)
- Can we attain these lower bounds
- Naiumlve Matrix Multiply
- Naiumlve Matrix Multiply (2)
- Naiumlve Matrix Multiply (3)
- Blocked (Tiled) Matrix Multiply
- Blocked (Tiled) Matrix Multiply (2)
- Does blocked matmul attain lower bound
- How hard is hand-tuning matmul anyway
- How hard is hand-tuning matmul anyway (2)
- Recursive Matrix Multiplication (RMM) (12)
- Recursive Matrix Multiplication (RMM) (22)
- CARMA Performance Shared Memory
- CARMA Performance Shared Memory (2)
- Why is CARMA Faster
- Parallel MatMul with 2D Processor Layout
- SUMMA Algorithm
- SUMMA ndash n x n matmul on P12 x P12 grid
- SUMMAndash n x n matmul on P12 x P12 grid
- Summary of dense parallel algorithms attaining communication l
- Can we do better
- 25D Matrix Multiplication
- 25D Matrix Multiplication (2)
- 25D Matmul on BGP 16K nodes 64K cores
- 25D Matmul on BGP 16K nodes 64K cores (2)
- Perfect Strong Scaling ndash in Time and Energy (12)
- Perfect Strong Scaling ndash in Time and Energy (22)
- Handling Heterogeneity
- Application to Tensor Contractions
- C(ijk) = Σm A(ijm)B(mk)
- Application to Tensor Contractions (2)
- TSQR QR of a Tall Skinny matrix
- TSQR QR of a Tall Skinny matrix (2)
- TSQR An Architecture-Dependent Algorithm
- TSQR Performance Results
- Using similar idea for TSLU as TSQR Use reduction tree to do
- LU Speedups from Tournament Pivoting and 25D
- 25D vs 2D LU With and Without Pivoting
- Exascale Machine Parameters Source DOE Exascale Workshop
- Exascale predicted speedups for Gaussian Elimination 2D CA
- Other CA algorithms
- Communication Lower Bounds for Strassen-like matmul algorithms
- vs
- Slide 57
- Symmetric Band Reduction
- Conventional vs CA - SBR
- Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
- What about sparse matrices (13)
- Performance of 25D APSP using Kleene
- What about sparse matrices (23)
- What about sparse matrices (33)
- Summary of Direct Linear Algebra
- Outline (3)
- Recall optimal sequential Matmul
- New Thm applied to Matmul
- New Thm applied to Direct N-Body
- N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
- New Thm applied to Random Code
- Approach to generalizing lower bounds
- General Communication Bound
- Is this bound attainable (12)
- Is this bound attainable (22)
- Ongoing Work
- Outline (4)
- Avoiding Communication in Iterative Linear Algebra
- Slide 79
- Slide 80
- Slide 81
- Slide 82
- Slide 83
- Slide 84
- Slide 85
- Slide 86
- Slide 87
- Slide 88
- Slide 89
- Slide 90
- Slide 91
- Minimizing Communication of GMRES to solve Ax=b
- Slide 93
- Speed ups of GMRES on 8-core Intel Clovertown
- Slide 95
- Slide 96
- Slide 97
- Tuning space for Krylov Methods
- Summary of Iterative Linear Algebra
- For more details
- Reproducible Floating Point Computation
- Summary
-