introduction to communication-avoiding algorithms cs.berkeley /~ demmel /sc12_tutorial

108
Introduction to Communication-Avoiding Algorithms www.cs.berkeley.edu/~demmel/SC12_tutorial Jim Demmel EECS & Math Departments UC Berkeley

Upload: dean

Post on 22-Feb-2016

64 views

Category:

Documents


0 download

DESCRIPTION

Introduction to Communication-Avoiding Algorithms www.cs.berkeley.edu /~ demmel /SC12_tutorial. Jim Demmel EECS & Math Departments UC Berkeley. Why avoid communication? (1/2). Algorithms have two costs (measured in time or energy): Arithmetic (FLOPS) Communication: moving data between - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Introduction to Communication-Avoiding Algorithms

wwwcsberkeleyedu~demmelSC12_tutorial

Jim DemmelEECS amp Math Departments

UC Berkeley

2

Why avoid communication (12)

Algorithms have two costs (measured in time or energy)1 Arithmetic (FLOPS)2 Communication moving data between

ndash levels of a memory hierarchy (sequential case) ndash processors over a network (parallel case)

CPUCache

DRAM

CPUDRAM

CPUDRAM

CPUDRAM

CPUDRAM

Why avoid communication (23)

bull Running time of an algorithm is sum of 3 termsndash flops time_per_flopndash words moved bandwidthndash messages latency

3

communication

bull Time_per_flop ltlt 1 bandwidth ltlt latencybull Gaps growing exponentially with time [FOSC]

bull Avoid communication to save time

Annual improvements

Time_per_flop Bandwidth Latency

Network 26 15

DRAM 23 559

Why Minimize Communication (22)

Source John Shalf LBL

Why Minimize Communication (22)

Source John Shalf LBL

Minimize communication to save energy

Goals

6

bull Redesign algorithms to avoid communicationbull Between all memory hierarchy levels

bull L1 L2 DRAM network etc bull Attain lower bounds if possible

bull Current algorithms often far from lower boundsbull Large speedups and energy savings possible

ldquoNew Algorithm Improves Performance and Accuracy on Extreme-Scale Computing Systems On modern computer architectures communication between processors takes longer than the performance of a floating point arithmetic operation by a given processor ASCR researchers have developed a new method derived from commonly used linear algebra methods to minimize communications between processors and the memory hierarchy by reformulating the communication patterns specified within the algorithm This method has been implemented in the TRILINOS framework a highly-regarded suite of software which provides functionality for researchers around the world to solve large scale complex multi-physics problemsrdquo

FY 2010 Congressional Budget Volume 4 FY2010 Accomplishments Advanced Scientific Computing Research (ASCR) pages 65-67

President Obama cites Communication-Avoiding Algorithms in the FY 2012 Department of Energy Budget Request to Congress

CA-GMRES (Hoemmen Mohiyuddin Yelick JD)ldquoTall-Skinnyrdquo QR (Grigori Hoemmen Langou JD)

Collaborators and Supportersbull Michael Christ Jack Dongarra Ioana Dumitriu Armando Fox David Gleich Laura

Grigori Ming Gu Mike Heroux Mark Hoemmen Olga Holtz Kurt Keutzer Julien Langou Tom Scanlon Michelle Strout Sam Williams Hua Xiang Kathy Yelick

bull Michael Anderson Grey Ballard Austin Benson Abhinav Bhatele Aydin Buluc Erin Carson Maryam Dehnavi Michael Driscoll Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Marghoob Mohiyuddin Oded Schwartz Edgar Solomonik

bull Other members of ParLab BEBOP CACHE EASI FASTMath MAGMA PLASMA TOPS projects

bull Thanks to NSF DOE UC Discovery Intel Microsoft Mathworks National Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary of CA Algorithmsbull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication for linear algebra problems like Ax=b least squares Ax = λx SVD etc

bull New algorithms that attain these lower boundsbull Being added to libraries ScaLAPACK PLASMA MAGMAbull Large speed-ups possible

bull Autotuning to find optimal implementationbull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing arrays

(eg n-body)

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

12

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

13

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent ge words_moved largest_message_size

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

14

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz

Can we attain these lower bounds

bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

new ways to encode answers new data structures

ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15

16

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n for j = 1 to n

for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)

= + C(ij) A(i)

B(j)C(ij)

17

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory

= + C(ij) A(i)

B(j)C(ij)

18

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether

= + C(ij) A(i)

B(j)C(ij)

n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic

19

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory

= + C(ij) C(ij) A(ik)

B(kj)b-by-bblock

20

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes

= + C(ij) C(ij) A(ik)

B(kj)b-by-bblock

2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster

Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of

size M then readswrites = 2n3b + 2n2

bull Make b as large as possible 3b2 le M so readswrites ge 312n3M12 + 2n2

bull Attains lower bound = Ω (flops M12 )

bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm

21

Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m

bull C = = A middot B = middot middot

=

bull True when each Aij etc 1x1 or n2 x n2

22

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11middotB11 + A12middotB21 A11middotB12 + A12middotB22

A21middotB11 + A22middotB21 A21middotB12 + A22middotB22

func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

Recursive Matrix Multiplication (RMM) (22)

23

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n) = 8 middot A(n2) + 4(n2)2 if n gt 1 else 1 = 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n) = 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2 = O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

For big speedups see SC12 poster on ldquoBeating MKL and ScaLAPACK at Rectangular Matmulrdquo 1113 at 515-7pm

How hard is hand-tuning matmul anyway

24

bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)

bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results

How hard is hand-tuning matmul anyway

25

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

27

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS

bull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions

layouts and Cannon may use more memory

28

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

kk

B(kj)

C(ij)

29

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

kk

B(kj)

C(ij)

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

Brow

Acol

30

SUMMA Communication Costs

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

deg Total words = log P n2 P12

deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)

deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12

31

Can we do better

bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P

ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)

bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E

needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T

that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

39

Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU

for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update

for i=1 to n-1 update column i update trailing matrix

Communication in sequentialOne-sided Factorizations (LU QR hellip)

bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)

40

bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel

bull Need more ideas

TSQR QR of a Tall Skinny matrix

41

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

42

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01 R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others

44

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

45

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 2: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

2

Why avoid communication (12)

Algorithms have two costs (measured in time or energy)1 Arithmetic (FLOPS)2 Communication moving data between

ndash levels of a memory hierarchy (sequential case) ndash processors over a network (parallel case)

CPUCache

DRAM

CPUDRAM

CPUDRAM

CPUDRAM

CPUDRAM

Why avoid communication (23)

bull Running time of an algorithm is sum of 3 termsndash flops time_per_flopndash words moved bandwidthndash messages latency

3

communication

bull Time_per_flop ltlt 1 bandwidth ltlt latencybull Gaps growing exponentially with time [FOSC]

bull Avoid communication to save time

Annual improvements

Time_per_flop Bandwidth Latency

Network 26 15

DRAM 23 559

Why Minimize Communication (22)

Source John Shalf LBL

Why Minimize Communication (22)

Source John Shalf LBL

Minimize communication to save energy

Goals

6

bull Redesign algorithms to avoid communicationbull Between all memory hierarchy levels

bull L1 L2 DRAM network etc bull Attain lower bounds if possible

bull Current algorithms often far from lower boundsbull Large speedups and energy savings possible

ldquoNew Algorithm Improves Performance and Accuracy on Extreme-Scale Computing Systems On modern computer architectures communication between processors takes longer than the performance of a floating point arithmetic operation by a given processor ASCR researchers have developed a new method derived from commonly used linear algebra methods to minimize communications between processors and the memory hierarchy by reformulating the communication patterns specified within the algorithm This method has been implemented in the TRILINOS framework a highly-regarded suite of software which provides functionality for researchers around the world to solve large scale complex multi-physics problemsrdquo

FY 2010 Congressional Budget Volume 4 FY2010 Accomplishments Advanced Scientific Computing Research (ASCR) pages 65-67

President Obama cites Communication-Avoiding Algorithms in the FY 2012 Department of Energy Budget Request to Congress

CA-GMRES (Hoemmen Mohiyuddin Yelick JD)ldquoTall-Skinnyrdquo QR (Grigori Hoemmen Langou JD)

Collaborators and Supportersbull Michael Christ Jack Dongarra Ioana Dumitriu Armando Fox David Gleich Laura

Grigori Ming Gu Mike Heroux Mark Hoemmen Olga Holtz Kurt Keutzer Julien Langou Tom Scanlon Michelle Strout Sam Williams Hua Xiang Kathy Yelick

bull Michael Anderson Grey Ballard Austin Benson Abhinav Bhatele Aydin Buluc Erin Carson Maryam Dehnavi Michael Driscoll Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Marghoob Mohiyuddin Oded Schwartz Edgar Solomonik

bull Other members of ParLab BEBOP CACHE EASI FASTMath MAGMA PLASMA TOPS projects

bull Thanks to NSF DOE UC Discovery Intel Microsoft Mathworks National Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary of CA Algorithmsbull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication for linear algebra problems like Ax=b least squares Ax = λx SVD etc

bull New algorithms that attain these lower boundsbull Being added to libraries ScaLAPACK PLASMA MAGMAbull Large speed-ups possible

bull Autotuning to find optimal implementationbull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing arrays

(eg n-body)

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

12

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

13

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent ge words_moved largest_message_size

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

14

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz

Can we attain these lower bounds

bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

new ways to encode answers new data structures

ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15

16

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n for j = 1 to n

for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)

= + C(ij) A(i)

B(j)C(ij)

17

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory

= + C(ij) A(i)

B(j)C(ij)

18

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether

= + C(ij) A(i)

B(j)C(ij)

n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic

19

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory

= + C(ij) C(ij) A(ik)

B(kj)b-by-bblock

20

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes

= + C(ij) C(ij) A(ik)

B(kj)b-by-bblock

2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster

Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of

size M then readswrites = 2n3b + 2n2

bull Make b as large as possible 3b2 le M so readswrites ge 312n3M12 + 2n2

bull Attains lower bound = Ω (flops M12 )

bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm

21

Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m

bull C = = A middot B = middot middot

=

bull True when each Aij etc 1x1 or n2 x n2

22

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11middotB11 + A12middotB21 A11middotB12 + A12middotB22

A21middotB11 + A22middotB21 A21middotB12 + A22middotB22

func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

Recursive Matrix Multiplication (RMM) (22)

23

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n) = 8 middot A(n2) + 4(n2)2 if n gt 1 else 1 = 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n) = 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2 = O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

For big speedups see SC12 poster on ldquoBeating MKL and ScaLAPACK at Rectangular Matmulrdquo 1113 at 515-7pm

How hard is hand-tuning matmul anyway

24

bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)

bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results

How hard is hand-tuning matmul anyway

25

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

27

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS

bull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions

layouts and Cannon may use more memory

28

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

kk

B(kj)

C(ij)

29

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

kk

B(kj)

C(ij)

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

Brow

Acol

30

SUMMA Communication Costs

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

deg Total words = log P n2 P12

deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)

deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12

31

Can we do better

bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P

ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)

bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E

needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T

that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

39

Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU

for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update

for i=1 to n-1 update column i update trailing matrix

Communication in sequentialOne-sided Factorizations (LU QR hellip)

bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)

40

bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel

bull Need more ideas

TSQR QR of a Tall Skinny matrix

41

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

42

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01 R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others

44

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

45

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 3: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Why avoid communication (23)

bull Running time of an algorithm is sum of 3 termsndash flops time_per_flopndash words moved bandwidthndash messages latency

3

communication

bull Time_per_flop ltlt 1 bandwidth ltlt latencybull Gaps growing exponentially with time [FOSC]

bull Avoid communication to save time

Annual improvements

Time_per_flop Bandwidth Latency

Network 26 15

DRAM 23 559

Why Minimize Communication (22)

Source John Shalf LBL

Why Minimize Communication (22)

Source John Shalf LBL

Minimize communication to save energy

Goals

6

bull Redesign algorithms to avoid communicationbull Between all memory hierarchy levels

bull L1 L2 DRAM network etc bull Attain lower bounds if possible

bull Current algorithms often far from lower boundsbull Large speedups and energy savings possible

ldquoNew Algorithm Improves Performance and Accuracy on Extreme-Scale Computing Systems On modern computer architectures communication between processors takes longer than the performance of a floating point arithmetic operation by a given processor ASCR researchers have developed a new method derived from commonly used linear algebra methods to minimize communications between processors and the memory hierarchy by reformulating the communication patterns specified within the algorithm This method has been implemented in the TRILINOS framework a highly-regarded suite of software which provides functionality for researchers around the world to solve large scale complex multi-physics problemsrdquo

FY 2010 Congressional Budget Volume 4 FY2010 Accomplishments Advanced Scientific Computing Research (ASCR) pages 65-67

President Obama cites Communication-Avoiding Algorithms in the FY 2012 Department of Energy Budget Request to Congress

CA-GMRES (Hoemmen Mohiyuddin Yelick JD)ldquoTall-Skinnyrdquo QR (Grigori Hoemmen Langou JD)

Collaborators and Supportersbull Michael Christ Jack Dongarra Ioana Dumitriu Armando Fox David Gleich Laura

Grigori Ming Gu Mike Heroux Mark Hoemmen Olga Holtz Kurt Keutzer Julien Langou Tom Scanlon Michelle Strout Sam Williams Hua Xiang Kathy Yelick

bull Michael Anderson Grey Ballard Austin Benson Abhinav Bhatele Aydin Buluc Erin Carson Maryam Dehnavi Michael Driscoll Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Marghoob Mohiyuddin Oded Schwartz Edgar Solomonik

bull Other members of ParLab BEBOP CACHE EASI FASTMath MAGMA PLASMA TOPS projects

bull Thanks to NSF DOE UC Discovery Intel Microsoft Mathworks National Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary of CA Algorithmsbull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication for linear algebra problems like Ax=b least squares Ax = λx SVD etc

bull New algorithms that attain these lower boundsbull Being added to libraries ScaLAPACK PLASMA MAGMAbull Large speed-ups possible

bull Autotuning to find optimal implementationbull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing arrays

(eg n-body)

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

12

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

13

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent ge words_moved largest_message_size

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

14

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz

Can we attain these lower bounds

bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

new ways to encode answers new data structures

ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15

16

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n for j = 1 to n

for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)

= + C(ij) A(i)

B(j)C(ij)

17

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory

= + C(ij) A(i)

B(j)C(ij)

18

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether

= + C(ij) A(i)

B(j)C(ij)

n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic

19

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory

= + C(ij) C(ij) A(ik)

B(kj)b-by-bblock

20

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes

= + C(ij) C(ij) A(ik)

B(kj)b-by-bblock

2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster

Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of

size M then readswrites = 2n3b + 2n2

bull Make b as large as possible 3b2 le M so readswrites ge 312n3M12 + 2n2

bull Attains lower bound = Ω (flops M12 )

bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm

21

Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m

bull C = = A middot B = middot middot

=

bull True when each Aij etc 1x1 or n2 x n2

22

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11middotB11 + A12middotB21 A11middotB12 + A12middotB22

A21middotB11 + A22middotB21 A21middotB12 + A22middotB22

func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

Recursive Matrix Multiplication (RMM) (22)

23

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n) = 8 middot A(n2) + 4(n2)2 if n gt 1 else 1 = 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n) = 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2 = O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

For big speedups see SC12 poster on ldquoBeating MKL and ScaLAPACK at Rectangular Matmulrdquo 1113 at 515-7pm

How hard is hand-tuning matmul anyway

24

bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)

bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results

How hard is hand-tuning matmul anyway

25

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

27

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS

bull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions

layouts and Cannon may use more memory

28

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

kk

B(kj)

C(ij)

29

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

kk

B(kj)

C(ij)

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

Brow

Acol

30

SUMMA Communication Costs

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

deg Total words = log P n2 P12

deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)

deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12

31

Can we do better

bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P

ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)

bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E

needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T

that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

39

Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU

for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update

for i=1 to n-1 update column i update trailing matrix

Communication in sequentialOne-sided Factorizations (LU QR hellip)

bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)

40

bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel

bull Need more ideas

TSQR QR of a Tall Skinny matrix

41

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

42

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01 R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others

44

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

45

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 4: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Why Minimize Communication (22)

Source John Shalf LBL

Why Minimize Communication (22)

Source John Shalf LBL

Minimize communication to save energy

Goals

6

bull Redesign algorithms to avoid communicationbull Between all memory hierarchy levels

bull L1 L2 DRAM network etc bull Attain lower bounds if possible

bull Current algorithms often far from lower boundsbull Large speedups and energy savings possible

ldquoNew Algorithm Improves Performance and Accuracy on Extreme-Scale Computing Systems On modern computer architectures communication between processors takes longer than the performance of a floating point arithmetic operation by a given processor ASCR researchers have developed a new method derived from commonly used linear algebra methods to minimize communications between processors and the memory hierarchy by reformulating the communication patterns specified within the algorithm This method has been implemented in the TRILINOS framework a highly-regarded suite of software which provides functionality for researchers around the world to solve large scale complex multi-physics problemsrdquo

FY 2010 Congressional Budget Volume 4 FY2010 Accomplishments Advanced Scientific Computing Research (ASCR) pages 65-67

President Obama cites Communication-Avoiding Algorithms in the FY 2012 Department of Energy Budget Request to Congress

CA-GMRES (Hoemmen Mohiyuddin Yelick JD)ldquoTall-Skinnyrdquo QR (Grigori Hoemmen Langou JD)

Collaborators and Supportersbull Michael Christ Jack Dongarra Ioana Dumitriu Armando Fox David Gleich Laura

Grigori Ming Gu Mike Heroux Mark Hoemmen Olga Holtz Kurt Keutzer Julien Langou Tom Scanlon Michelle Strout Sam Williams Hua Xiang Kathy Yelick

bull Michael Anderson Grey Ballard Austin Benson Abhinav Bhatele Aydin Buluc Erin Carson Maryam Dehnavi Michael Driscoll Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Marghoob Mohiyuddin Oded Schwartz Edgar Solomonik

bull Other members of ParLab BEBOP CACHE EASI FASTMath MAGMA PLASMA TOPS projects

bull Thanks to NSF DOE UC Discovery Intel Microsoft Mathworks National Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary of CA Algorithmsbull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication for linear algebra problems like Ax=b least squares Ax = λx SVD etc

bull New algorithms that attain these lower boundsbull Being added to libraries ScaLAPACK PLASMA MAGMAbull Large speed-ups possible

bull Autotuning to find optimal implementationbull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing arrays

(eg n-body)

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

12

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

13

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent ge words_moved largest_message_size

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

14

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz

Can we attain these lower bounds

bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

new ways to encode answers new data structures

ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15

16

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n for j = 1 to n

for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)

= + C(ij) A(i)

B(j)C(ij)

17

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory

= + C(ij) A(i)

B(j)C(ij)

18

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether

= + C(ij) A(i)

B(j)C(ij)

n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic

19

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory

= + C(ij) C(ij) A(ik)

B(kj)b-by-bblock

20

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes

= + C(ij) C(ij) A(ik)

B(kj)b-by-bblock

2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster

Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of

size M then readswrites = 2n3b + 2n2

bull Make b as large as possible 3b2 le M so readswrites ge 312n3M12 + 2n2

bull Attains lower bound = Ω (flops M12 )

bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm

21

Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m

bull C = = A middot B = middot middot

=

bull True when each Aij etc 1x1 or n2 x n2

22

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11middotB11 + A12middotB21 A11middotB12 + A12middotB22

A21middotB11 + A22middotB21 A21middotB12 + A22middotB22

func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

Recursive Matrix Multiplication (RMM) (22)

23

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n) = 8 middot A(n2) + 4(n2)2 if n gt 1 else 1 = 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n) = 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2 = O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

For big speedups see SC12 poster on ldquoBeating MKL and ScaLAPACK at Rectangular Matmulrdquo 1113 at 515-7pm

How hard is hand-tuning matmul anyway

24

bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)

bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results

How hard is hand-tuning matmul anyway

25

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

27

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS

bull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions

layouts and Cannon may use more memory

28

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

kk

B(kj)

C(ij)

29

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

kk

B(kj)

C(ij)

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

Brow

Acol

30

SUMMA Communication Costs

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

deg Total words = log P n2 P12

deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)

deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12

31

Can we do better

bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P

ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)

bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E

needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T

that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

39

Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU

for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update

for i=1 to n-1 update column i update trailing matrix

Communication in sequentialOne-sided Factorizations (LU QR hellip)

bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)

40

bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel

bull Need more ideas

TSQR QR of a Tall Skinny matrix

41

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

42

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01 R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others

44

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

45

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 5: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Why Minimize Communication (22)

Source John Shalf LBL

Minimize communication to save energy

Goals

6

bull Redesign algorithms to avoid communicationbull Between all memory hierarchy levels

bull L1 L2 DRAM network etc bull Attain lower bounds if possible

bull Current algorithms often far from lower boundsbull Large speedups and energy savings possible

ldquoNew Algorithm Improves Performance and Accuracy on Extreme-Scale Computing Systems On modern computer architectures communication between processors takes longer than the performance of a floating point arithmetic operation by a given processor ASCR researchers have developed a new method derived from commonly used linear algebra methods to minimize communications between processors and the memory hierarchy by reformulating the communication patterns specified within the algorithm This method has been implemented in the TRILINOS framework a highly-regarded suite of software which provides functionality for researchers around the world to solve large scale complex multi-physics problemsrdquo

FY 2010 Congressional Budget Volume 4 FY2010 Accomplishments Advanced Scientific Computing Research (ASCR) pages 65-67

President Obama cites Communication-Avoiding Algorithms in the FY 2012 Department of Energy Budget Request to Congress

CA-GMRES (Hoemmen Mohiyuddin Yelick JD)ldquoTall-Skinnyrdquo QR (Grigori Hoemmen Langou JD)

Collaborators and Supportersbull Michael Christ Jack Dongarra Ioana Dumitriu Armando Fox David Gleich Laura

Grigori Ming Gu Mike Heroux Mark Hoemmen Olga Holtz Kurt Keutzer Julien Langou Tom Scanlon Michelle Strout Sam Williams Hua Xiang Kathy Yelick

bull Michael Anderson Grey Ballard Austin Benson Abhinav Bhatele Aydin Buluc Erin Carson Maryam Dehnavi Michael Driscoll Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Marghoob Mohiyuddin Oded Schwartz Edgar Solomonik

bull Other members of ParLab BEBOP CACHE EASI FASTMath MAGMA PLASMA TOPS projects

bull Thanks to NSF DOE UC Discovery Intel Microsoft Mathworks National Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary of CA Algorithmsbull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication for linear algebra problems like Ax=b least squares Ax = λx SVD etc

bull New algorithms that attain these lower boundsbull Being added to libraries ScaLAPACK PLASMA MAGMAbull Large speed-ups possible

bull Autotuning to find optimal implementationbull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing arrays

(eg n-body)

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

12

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

13

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent ge words_moved largest_message_size

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

14

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz

Can we attain these lower bounds

bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

new ways to encode answers new data structures

ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15

16

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n for j = 1 to n

for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)

= + C(ij) A(i)

B(j)C(ij)

17

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory

= + C(ij) A(i)

B(j)C(ij)

18

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether

= + C(ij) A(i)

B(j)C(ij)

n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic

19

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory

= + C(ij) C(ij) A(ik)

B(kj)b-by-bblock

20

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes

= + C(ij) C(ij) A(ik)

B(kj)b-by-bblock

2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster

Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of

size M then readswrites = 2n3b + 2n2

bull Make b as large as possible 3b2 le M so readswrites ge 312n3M12 + 2n2

bull Attains lower bound = Ω (flops M12 )

bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm

21

Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m

bull C = = A middot B = middot middot

=

bull True when each Aij etc 1x1 or n2 x n2

22

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11middotB11 + A12middotB21 A11middotB12 + A12middotB22

A21middotB11 + A22middotB21 A21middotB12 + A22middotB22

func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

Recursive Matrix Multiplication (RMM) (22)

23

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n) = 8 middot A(n2) + 4(n2)2 if n gt 1 else 1 = 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n) = 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2 = O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

For big speedups see SC12 poster on ldquoBeating MKL and ScaLAPACK at Rectangular Matmulrdquo 1113 at 515-7pm

How hard is hand-tuning matmul anyway

24

bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)

bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results

How hard is hand-tuning matmul anyway

25

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

27

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS

bull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions

layouts and Cannon may use more memory

28

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

kk

B(kj)

C(ij)

29

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

kk

B(kj)

C(ij)

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

Brow

Acol

30

SUMMA Communication Costs

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

deg Total words = log P n2 P12

deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)

deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12

31

Can we do better

bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P

ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)

bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E

needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T

that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

39

Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU

for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update

for i=1 to n-1 update column i update trailing matrix

Communication in sequentialOne-sided Factorizations (LU QR hellip)

bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)

40

bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel

bull Need more ideas

TSQR QR of a Tall Skinny matrix

41

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

42

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01 R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others

44

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

45

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 6: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Goals

6

bull Redesign algorithms to avoid communicationbull Between all memory hierarchy levels

bull L1 L2 DRAM network etc bull Attain lower bounds if possible

bull Current algorithms often far from lower boundsbull Large speedups and energy savings possible

ldquoNew Algorithm Improves Performance and Accuracy on Extreme-Scale Computing Systems On modern computer architectures communication between processors takes longer than the performance of a floating point arithmetic operation by a given processor ASCR researchers have developed a new method derived from commonly used linear algebra methods to minimize communications between processors and the memory hierarchy by reformulating the communication patterns specified within the algorithm This method has been implemented in the TRILINOS framework a highly-regarded suite of software which provides functionality for researchers around the world to solve large scale complex multi-physics problemsrdquo

FY 2010 Congressional Budget Volume 4 FY2010 Accomplishments Advanced Scientific Computing Research (ASCR) pages 65-67

President Obama cites Communication-Avoiding Algorithms in the FY 2012 Department of Energy Budget Request to Congress

CA-GMRES (Hoemmen Mohiyuddin Yelick JD)ldquoTall-Skinnyrdquo QR (Grigori Hoemmen Langou JD)

Collaborators and Supportersbull Michael Christ Jack Dongarra Ioana Dumitriu Armando Fox David Gleich Laura

Grigori Ming Gu Mike Heroux Mark Hoemmen Olga Holtz Kurt Keutzer Julien Langou Tom Scanlon Michelle Strout Sam Williams Hua Xiang Kathy Yelick

bull Michael Anderson Grey Ballard Austin Benson Abhinav Bhatele Aydin Buluc Erin Carson Maryam Dehnavi Michael Driscoll Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Marghoob Mohiyuddin Oded Schwartz Edgar Solomonik

bull Other members of ParLab BEBOP CACHE EASI FASTMath MAGMA PLASMA TOPS projects

bull Thanks to NSF DOE UC Discovery Intel Microsoft Mathworks National Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary of CA Algorithmsbull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication for linear algebra problems like Ax=b least squares Ax = λx SVD etc

bull New algorithms that attain these lower boundsbull Being added to libraries ScaLAPACK PLASMA MAGMAbull Large speed-ups possible

bull Autotuning to find optimal implementationbull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing arrays

(eg n-body)

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

12

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

13

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent ge words_moved largest_message_size

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

14

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz

Can we attain these lower bounds

bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

new ways to encode answers new data structures

ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15

16

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n for j = 1 to n

for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)

= + C(ij) A(i)

B(j)C(ij)

17

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory

= + C(ij) A(i)

B(j)C(ij)

18

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether

= + C(ij) A(i)

B(j)C(ij)

n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic

19

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory

= + C(ij) C(ij) A(ik)

B(kj)b-by-bblock

20

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes

= + C(ij) C(ij) A(ik)

B(kj)b-by-bblock

2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster

Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of

size M then readswrites = 2n3b + 2n2

bull Make b as large as possible 3b2 le M so readswrites ge 312n3M12 + 2n2

bull Attains lower bound = Ω (flops M12 )

bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm

21

Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m

bull C = = A middot B = middot middot

=

bull True when each Aij etc 1x1 or n2 x n2

22

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11middotB11 + A12middotB21 A11middotB12 + A12middotB22

A21middotB11 + A22middotB21 A21middotB12 + A22middotB22

func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

Recursive Matrix Multiplication (RMM) (22)

23

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n) = 8 middot A(n2) + 4(n2)2 if n gt 1 else 1 = 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n) = 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2 = O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

For big speedups see SC12 poster on ldquoBeating MKL and ScaLAPACK at Rectangular Matmulrdquo 1113 at 515-7pm

How hard is hand-tuning matmul anyway

24

bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)

bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results

How hard is hand-tuning matmul anyway

25

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

27

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS

bull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions

layouts and Cannon may use more memory

28

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

kk

B(kj)

C(ij)

29

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

kk

B(kj)

C(ij)

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

Brow

Acol

30

SUMMA Communication Costs

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

deg Total words = log P n2 P12

deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)

deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12

31

Can we do better

bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P

ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)

bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E

needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T

that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

39

Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU

for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update

for i=1 to n-1 update column i update trailing matrix

Communication in sequentialOne-sided Factorizations (LU QR hellip)

bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)

40

bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel

bull Need more ideas

TSQR QR of a Tall Skinny matrix

41

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

42

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01 R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others

44

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

45

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 7: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

ldquoNew Algorithm Improves Performance and Accuracy on Extreme-Scale Computing Systems On modern computer architectures communication between processors takes longer than the performance of a floating point arithmetic operation by a given processor ASCR researchers have developed a new method derived from commonly used linear algebra methods to minimize communications between processors and the memory hierarchy by reformulating the communication patterns specified within the algorithm This method has been implemented in the TRILINOS framework a highly-regarded suite of software which provides functionality for researchers around the world to solve large scale complex multi-physics problemsrdquo

FY 2010 Congressional Budget Volume 4 FY2010 Accomplishments Advanced Scientific Computing Research (ASCR) pages 65-67

President Obama cites Communication-Avoiding Algorithms in the FY 2012 Department of Energy Budget Request to Congress

CA-GMRES (Hoemmen Mohiyuddin Yelick JD)ldquoTall-Skinnyrdquo QR (Grigori Hoemmen Langou JD)

Collaborators and Supportersbull Michael Christ Jack Dongarra Ioana Dumitriu Armando Fox David Gleich Laura

Grigori Ming Gu Mike Heroux Mark Hoemmen Olga Holtz Kurt Keutzer Julien Langou Tom Scanlon Michelle Strout Sam Williams Hua Xiang Kathy Yelick

bull Michael Anderson Grey Ballard Austin Benson Abhinav Bhatele Aydin Buluc Erin Carson Maryam Dehnavi Michael Driscoll Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Marghoob Mohiyuddin Oded Schwartz Edgar Solomonik

bull Other members of ParLab BEBOP CACHE EASI FASTMath MAGMA PLASMA TOPS projects

bull Thanks to NSF DOE UC Discovery Intel Microsoft Mathworks National Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary of CA Algorithmsbull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication for linear algebra problems like Ax=b least squares Ax = λx SVD etc

bull New algorithms that attain these lower boundsbull Being added to libraries ScaLAPACK PLASMA MAGMAbull Large speed-ups possible

bull Autotuning to find optimal implementationbull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing arrays

(eg n-body)

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

12

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

13

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent ge words_moved largest_message_size

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

14

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz

Can we attain these lower bounds

bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

new ways to encode answers new data structures

ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15

16

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n for j = 1 to n

for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)

= + C(ij) A(i)

B(j)C(ij)

17

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory

= + C(ij) A(i)

B(j)C(ij)

18

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether

= + C(ij) A(i)

B(j)C(ij)

n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic

19

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory

= + C(ij) C(ij) A(ik)

B(kj)b-by-bblock

20

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes

= + C(ij) C(ij) A(ik)

B(kj)b-by-bblock

2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster

Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of

size M then readswrites = 2n3b + 2n2

bull Make b as large as possible 3b2 le M so readswrites ge 312n3M12 + 2n2

bull Attains lower bound = Ω (flops M12 )

bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm

21

Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m

bull C = = A middot B = middot middot

=

bull True when each Aij etc 1x1 or n2 x n2

22

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11middotB11 + A12middotB21 A11middotB12 + A12middotB22

A21middotB11 + A22middotB21 A21middotB12 + A22middotB22

func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

Recursive Matrix Multiplication (RMM) (22)

23

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n) = 8 middot A(n2) + 4(n2)2 if n gt 1 else 1 = 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n) = 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2 = O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

For big speedups see SC12 poster on ldquoBeating MKL and ScaLAPACK at Rectangular Matmulrdquo 1113 at 515-7pm

How hard is hand-tuning matmul anyway

24

bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)

bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results

How hard is hand-tuning matmul anyway

25

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

27

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS

bull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions

layouts and Cannon may use more memory

28

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

kk

B(kj)

C(ij)

29

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

kk

B(kj)

C(ij)

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

Brow

Acol

30

SUMMA Communication Costs

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

deg Total words = log P n2 P12

deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)

deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12

31

Can we do better

bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P

ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)

bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E

needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T

that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

39

Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU

for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update

for i=1 to n-1 update column i update trailing matrix

Communication in sequentialOne-sided Factorizations (LU QR hellip)

bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)

40

bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel

bull Need more ideas

TSQR QR of a Tall Skinny matrix

41

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

42

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01 R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others

44

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

45

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 8: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Collaborators and Supportersbull Michael Christ Jack Dongarra Ioana Dumitriu Armando Fox David Gleich Laura

Grigori Ming Gu Mike Heroux Mark Hoemmen Olga Holtz Kurt Keutzer Julien Langou Tom Scanlon Michelle Strout Sam Williams Hua Xiang Kathy Yelick

bull Michael Anderson Grey Ballard Austin Benson Abhinav Bhatele Aydin Buluc Erin Carson Maryam Dehnavi Michael Driscoll Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Marghoob Mohiyuddin Oded Schwartz Edgar Solomonik

bull Other members of ParLab BEBOP CACHE EASI FASTMath MAGMA PLASMA TOPS projects

bull Thanks to NSF DOE UC Discovery Intel Microsoft Mathworks National Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary of CA Algorithmsbull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication for linear algebra problems like Ax=b least squares Ax = λx SVD etc

bull New algorithms that attain these lower boundsbull Being added to libraries ScaLAPACK PLASMA MAGMAbull Large speed-ups possible

bull Autotuning to find optimal implementationbull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing arrays

(eg n-body)

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

12

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

13

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent ge words_moved largest_message_size

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

14

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz

Can we attain these lower bounds

bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

new ways to encode answers new data structures

ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15

16

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n for j = 1 to n

for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)

= + C(ij) A(i)

B(j)C(ij)

17

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory

= + C(ij) A(i)

B(j)C(ij)

18

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether

= + C(ij) A(i)

B(j)C(ij)

n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic

19

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory

= + C(ij) C(ij) A(ik)

B(kj)b-by-bblock

20

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes

= + C(ij) C(ij) A(ik)

B(kj)b-by-bblock

2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster

Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of

size M then readswrites = 2n3b + 2n2

bull Make b as large as possible 3b2 le M so readswrites ge 312n3M12 + 2n2

bull Attains lower bound = Ω (flops M12 )

bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm

21

Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m

bull C = = A middot B = middot middot

=

bull True when each Aij etc 1x1 or n2 x n2

22

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11middotB11 + A12middotB21 A11middotB12 + A12middotB22

A21middotB11 + A22middotB21 A21middotB12 + A22middotB22

func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

Recursive Matrix Multiplication (RMM) (22)

23

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n) = 8 middot A(n2) + 4(n2)2 if n gt 1 else 1 = 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n) = 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2 = O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

For big speedups see SC12 poster on ldquoBeating MKL and ScaLAPACK at Rectangular Matmulrdquo 1113 at 515-7pm

How hard is hand-tuning matmul anyway

24

bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)

bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results

How hard is hand-tuning matmul anyway

25

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

27

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS

bull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions

layouts and Cannon may use more memory

28

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

kk

B(kj)

C(ij)

29

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

kk

B(kj)

C(ij)

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

Brow

Acol

30

SUMMA Communication Costs

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

deg Total words = log P n2 P12

deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)

deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12

31

Can we do better

bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P

ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)

bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E

needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T

that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

39

Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU

for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update

for i=1 to n-1 update column i update trailing matrix

Communication in sequentialOne-sided Factorizations (LU QR hellip)

bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)

40

bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel

bull Need more ideas

TSQR QR of a Tall Skinny matrix

41

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

42

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01 R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others

44

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

45

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 9: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Summary of CA Algorithmsbull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication for linear algebra problems like Ax=b least squares Ax = λx SVD etc

bull New algorithms that attain these lower boundsbull Being added to libraries ScaLAPACK PLASMA MAGMAbull Large speed-ups possible

bull Autotuning to find optimal implementationbull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing arrays

(eg n-body)

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

12

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

13

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent ge words_moved largest_message_size

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

14

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz

Can we attain these lower bounds

bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

new ways to encode answers new data structures

ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15

16

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n for j = 1 to n

for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)

= + C(ij) A(i)

B(j)C(ij)

17

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory

= + C(ij) A(i)

B(j)C(ij)

18

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether

= + C(ij) A(i)

B(j)C(ij)

n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic

19

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory

= + C(ij) C(ij) A(ik)

B(kj)b-by-bblock

20

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes

= + C(ij) C(ij) A(ik)

B(kj)b-by-bblock

2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster

Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of

size M then readswrites = 2n3b + 2n2

bull Make b as large as possible 3b2 le M so readswrites ge 312n3M12 + 2n2

bull Attains lower bound = Ω (flops M12 )

bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm

21

Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m

bull C = = A middot B = middot middot

=

bull True when each Aij etc 1x1 or n2 x n2

22

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11middotB11 + A12middotB21 A11middotB12 + A12middotB22

A21middotB11 + A22middotB21 A21middotB12 + A22middotB22

func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

Recursive Matrix Multiplication (RMM) (22)

23

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n) = 8 middot A(n2) + 4(n2)2 if n gt 1 else 1 = 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n) = 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2 = O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

For big speedups see SC12 poster on ldquoBeating MKL and ScaLAPACK at Rectangular Matmulrdquo 1113 at 515-7pm

How hard is hand-tuning matmul anyway

24

bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)

bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results

How hard is hand-tuning matmul anyway

25

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

27

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS

bull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions

layouts and Cannon may use more memory

28

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

kk

B(kj)

C(ij)

29

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

kk

B(kj)

C(ij)

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

Brow

Acol

30

SUMMA Communication Costs

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

deg Total words = log P n2 P12

deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)

deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12

31

Can we do better

bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P

ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)

bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E

needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T

that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

39

Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU

for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update

for i=1 to n-1 update column i update trailing matrix

Communication in sequentialOne-sided Factorizations (LU QR hellip)

bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)

40

bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel

bull Need more ideas

TSQR QR of a Tall Skinny matrix

41

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

42

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01 R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others

44

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

45

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 10: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

12

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

13

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent ge words_moved largest_message_size

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

14

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz

Can we attain these lower bounds

bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

new ways to encode answers new data structures

ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15

16

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n for j = 1 to n

for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)

= + C(ij) A(i)

B(j)C(ij)

17

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory

= + C(ij) A(i)

B(j)C(ij)

18

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether

= + C(ij) A(i)

B(j)C(ij)

n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic

19

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory

= + C(ij) C(ij) A(ik)

B(kj)b-by-bblock

20

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes

= + C(ij) C(ij) A(ik)

B(kj)b-by-bblock

2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster

Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of

size M then readswrites = 2n3b + 2n2

bull Make b as large as possible 3b2 le M so readswrites ge 312n3M12 + 2n2

bull Attains lower bound = Ω (flops M12 )

bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm

21

Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m

bull C = = A middot B = middot middot

=

bull True when each Aij etc 1x1 or n2 x n2

22

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11middotB11 + A12middotB21 A11middotB12 + A12middotB22

A21middotB11 + A22middotB21 A21middotB12 + A22middotB22

func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

Recursive Matrix Multiplication (RMM) (22)

23

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n) = 8 middot A(n2) + 4(n2)2 if n gt 1 else 1 = 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n) = 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2 = O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

For big speedups see SC12 poster on ldquoBeating MKL and ScaLAPACK at Rectangular Matmulrdquo 1113 at 515-7pm

How hard is hand-tuning matmul anyway

24

bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)

bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results

How hard is hand-tuning matmul anyway

25

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

27

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS

bull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions

layouts and Cannon may use more memory

28

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

kk

B(kj)

C(ij)

29

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

kk

B(kj)

C(ij)

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

Brow

Acol

30

SUMMA Communication Costs

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

deg Total words = log P n2 P12

deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)

deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12

31

Can we do better

bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P

ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)

bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E

needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T

that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

39

Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU

for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update

for i=1 to n-1 update column i update trailing matrix

Communication in sequentialOne-sided Factorizations (LU QR hellip)

bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)

40

bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel

bull Need more ideas

TSQR QR of a Tall Skinny matrix

41

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

42

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01 R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others

44

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

45

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 11: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

12

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

13

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent ge words_moved largest_message_size

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

14

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz

Can we attain these lower bounds

bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

new ways to encode answers new data structures

ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15

16

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n for j = 1 to n

for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)

= + C(ij) A(i)

B(j)C(ij)

17

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory

= + C(ij) A(i)

B(j)C(ij)

18

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether

= + C(ij) A(i)

B(j)C(ij)

n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic

19

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory

= + C(ij) C(ij) A(ik)

B(kj)b-by-bblock

20

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes

= + C(ij) C(ij) A(ik)

B(kj)b-by-bblock

2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster

Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of

size M then readswrites = 2n3b + 2n2

bull Make b as large as possible 3b2 le M so readswrites ge 312n3M12 + 2n2

bull Attains lower bound = Ω (flops M12 )

bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm

21

Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m

bull C = = A middot B = middot middot

=

bull True when each Aij etc 1x1 or n2 x n2

22

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11middotB11 + A12middotB21 A11middotB12 + A12middotB22

A21middotB11 + A22middotB21 A21middotB12 + A22middotB22

func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

Recursive Matrix Multiplication (RMM) (22)

23

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n) = 8 middot A(n2) + 4(n2)2 if n gt 1 else 1 = 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n) = 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2 = O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

For big speedups see SC12 poster on ldquoBeating MKL and ScaLAPACK at Rectangular Matmulrdquo 1113 at 515-7pm

How hard is hand-tuning matmul anyway

24

bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)

bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results

How hard is hand-tuning matmul anyway

25

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

27

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS

bull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions

layouts and Cannon may use more memory

28

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

kk

B(kj)

C(ij)

29

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

kk

B(kj)

C(ij)

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

Brow

Acol

30

SUMMA Communication Costs

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

deg Total words = log P n2 P12

deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)

deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12

31

Can we do better

bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P

ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)

bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E

needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T

that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

39

Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU

for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update

for i=1 to n-1 update column i update trailing matrix

Communication in sequentialOne-sided Factorizations (LU QR hellip)

bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)

40

bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel

bull Need more ideas

TSQR QR of a Tall Skinny matrix

41

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

42

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01 R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others

44

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

45

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 12: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

12

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

13

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent ge words_moved largest_message_size

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

14

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz

Can we attain these lower bounds

bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

new ways to encode answers new data structures

ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15

16

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n for j = 1 to n

for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)

= + C(ij) A(i)

B(j)C(ij)

17

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory

= + C(ij) A(i)

B(j)C(ij)

18

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether

= + C(ij) A(i)

B(j)C(ij)

n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic

19

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory

= + C(ij) C(ij) A(ik)

B(kj)b-by-bblock

20

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes

= + C(ij) C(ij) A(ik)

B(kj)b-by-bblock

2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster

Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of

size M then readswrites = 2n3b + 2n2

bull Make b as large as possible 3b2 le M so readswrites ge 312n3M12 + 2n2

bull Attains lower bound = Ω (flops M12 )

bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm

21

Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m

bull C = = A middot B = middot middot

=

bull True when each Aij etc 1x1 or n2 x n2

22

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11middotB11 + A12middotB21 A11middotB12 + A12middotB22

A21middotB11 + A22middotB21 A21middotB12 + A22middotB22

func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

Recursive Matrix Multiplication (RMM) (22)

23

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n) = 8 middot A(n2) + 4(n2)2 if n gt 1 else 1 = 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n) = 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2 = O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

For big speedups see SC12 poster on ldquoBeating MKL and ScaLAPACK at Rectangular Matmulrdquo 1113 at 515-7pm

How hard is hand-tuning matmul anyway

24

bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)

bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results

How hard is hand-tuning matmul anyway

25

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

27

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS

bull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions

layouts and Cannon may use more memory

28

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

kk

B(kj)

C(ij)

29

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

kk

B(kj)

C(ij)

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

Brow

Acol

30

SUMMA Communication Costs

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

deg Total words = log P n2 P12

deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)

deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12

31

Can we do better

bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P

ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)

bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E

needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T

that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

39

Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU

for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update

for i=1 to n-1 update column i update trailing matrix

Communication in sequentialOne-sided Factorizations (LU QR hellip)

bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)

40

bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel

bull Need more ideas

TSQR QR of a Tall Skinny matrix

41

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

42

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01 R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others

44

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

45

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 13: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

13

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent ge words_moved largest_message_size

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

14

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz

Can we attain these lower bounds

bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

new ways to encode answers new data structures

ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15

16

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n for j = 1 to n

for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)

= + C(ij) A(i)

B(j)C(ij)

17

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory

= + C(ij) A(i)

B(j)C(ij)

18

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether

= + C(ij) A(i)

B(j)C(ij)

n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic

19

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory

= + C(ij) C(ij) A(ik)

B(kj)b-by-bblock

20

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes

= + C(ij) C(ij) A(ik)

B(kj)b-by-bblock

2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster

Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of

size M then readswrites = 2n3b + 2n2

bull Make b as large as possible 3b2 le M so readswrites ge 312n3M12 + 2n2

bull Attains lower bound = Ω (flops M12 )

bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm

21

Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m

bull C = = A middot B = middot middot

=

bull True when each Aij etc 1x1 or n2 x n2

22

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11middotB11 + A12middotB21 A11middotB12 + A12middotB22

A21middotB11 + A22middotB21 A21middotB12 + A22middotB22

func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

Recursive Matrix Multiplication (RMM) (22)

23

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n) = 8 middot A(n2) + 4(n2)2 if n gt 1 else 1 = 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n) = 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2 = O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

For big speedups see SC12 poster on ldquoBeating MKL and ScaLAPACK at Rectangular Matmulrdquo 1113 at 515-7pm

How hard is hand-tuning matmul anyway

24

bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)

bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results

How hard is hand-tuning matmul anyway

25

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

27

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS

bull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions

layouts and Cannon may use more memory

28

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

kk

B(kj)

C(ij)

29

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

kk

B(kj)

C(ij)

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

Brow

Acol

30

SUMMA Communication Costs

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

deg Total words = log P n2 P12

deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)

deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12

31

Can we do better

bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P

ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)

bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E

needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T

that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

39

Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU

for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update

for i=1 to n-1 update column i update trailing matrix

Communication in sequentialOne-sided Factorizations (LU QR hellip)

bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)

40

bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel

bull Need more ideas

TSQR QR of a Tall Skinny matrix

41

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

42

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01 R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others

44

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

45

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 14: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

14

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz

Can we attain these lower bounds

bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

new ways to encode answers new data structures

ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15

16

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n for j = 1 to n

for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)

= + C(ij) A(i)

B(j)C(ij)

17

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory

= + C(ij) A(i)

B(j)C(ij)

18

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether

= + C(ij) A(i)

B(j)C(ij)

n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic

19

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory

= + C(ij) C(ij) A(ik)

B(kj)b-by-bblock

20

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes

= + C(ij) C(ij) A(ik)

B(kj)b-by-bblock

2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster

Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of

size M then readswrites = 2n3b + 2n2

bull Make b as large as possible 3b2 le M so readswrites ge 312n3M12 + 2n2

bull Attains lower bound = Ω (flops M12 )

bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm

21

Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m

bull C = = A middot B = middot middot

=

bull True when each Aij etc 1x1 or n2 x n2

22

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11middotB11 + A12middotB21 A11middotB12 + A12middotB22

A21middotB11 + A22middotB21 A21middotB12 + A22middotB22

func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

Recursive Matrix Multiplication (RMM) (22)

23

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n) = 8 middot A(n2) + 4(n2)2 if n gt 1 else 1 = 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n) = 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2 = O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

For big speedups see SC12 poster on ldquoBeating MKL and ScaLAPACK at Rectangular Matmulrdquo 1113 at 515-7pm

How hard is hand-tuning matmul anyway

24

bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)

bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results

How hard is hand-tuning matmul anyway

25

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

27

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS

bull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions

layouts and Cannon may use more memory

28

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

kk

B(kj)

C(ij)

29

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

kk

B(kj)

C(ij)

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

Brow

Acol

30

SUMMA Communication Costs

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

deg Total words = log P n2 P12

deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)

deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12

31

Can we do better

bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P

ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)

bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E

needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T

that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

39

Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU

for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update

for i=1 to n-1 update column i update trailing matrix

Communication in sequentialOne-sided Factorizations (LU QR hellip)

bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)

40

bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel

bull Need more ideas

TSQR QR of a Tall Skinny matrix

41

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

42

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01 R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others

44

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

45

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 15: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Can we attain these lower bounds

bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

new ways to encode answers new data structures

ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15

16

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n for j = 1 to n

for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)

= + C(ij) A(i)

B(j)C(ij)

17

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory

= + C(ij) A(i)

B(j)C(ij)

18

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether

= + C(ij) A(i)

B(j)C(ij)

n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic

19

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory

= + C(ij) C(ij) A(ik)

B(kj)b-by-bblock

20

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes

= + C(ij) C(ij) A(ik)

B(kj)b-by-bblock

2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster

Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of

size M then readswrites = 2n3b + 2n2

bull Make b as large as possible 3b2 le M so readswrites ge 312n3M12 + 2n2

bull Attains lower bound = Ω (flops M12 )

bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm

21

Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m

bull C = = A middot B = middot middot

=

bull True when each Aij etc 1x1 or n2 x n2

22

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11middotB11 + A12middotB21 A11middotB12 + A12middotB22

A21middotB11 + A22middotB21 A21middotB12 + A22middotB22

func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

Recursive Matrix Multiplication (RMM) (22)

23

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n) = 8 middot A(n2) + 4(n2)2 if n gt 1 else 1 = 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n) = 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2 = O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

For big speedups see SC12 poster on ldquoBeating MKL and ScaLAPACK at Rectangular Matmulrdquo 1113 at 515-7pm

How hard is hand-tuning matmul anyway

24

bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)

bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results

How hard is hand-tuning matmul anyway

25

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

27

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS

bull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions

layouts and Cannon may use more memory

28

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

kk

B(kj)

C(ij)

29

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

kk

B(kj)

C(ij)

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

Brow

Acol

30

SUMMA Communication Costs

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

deg Total words = log P n2 P12

deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)

deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12

31

Can we do better

bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P

ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)

bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E

needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T

that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

39

Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU

for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update

for i=1 to n-1 update column i update trailing matrix

Communication in sequentialOne-sided Factorizations (LU QR hellip)

bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)

40

bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel

bull Need more ideas

TSQR QR of a Tall Skinny matrix

41

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

42

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01 R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others

44

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

45

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 16: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

16

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n for j = 1 to n

for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)

= + C(ij) A(i)

B(j)C(ij)

17

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory

= + C(ij) A(i)

B(j)C(ij)

18

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether

= + C(ij) A(i)

B(j)C(ij)

n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic

19

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory

= + C(ij) C(ij) A(ik)

B(kj)b-by-bblock

20

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes

= + C(ij) C(ij) A(ik)

B(kj)b-by-bblock

2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster

Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of

size M then readswrites = 2n3b + 2n2

bull Make b as large as possible 3b2 le M so readswrites ge 312n3M12 + 2n2

bull Attains lower bound = Ω (flops M12 )

bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm

21

Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m

bull C = = A middot B = middot middot

=

bull True when each Aij etc 1x1 or n2 x n2

22

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11middotB11 + A12middotB21 A11middotB12 + A12middotB22

A21middotB11 + A22middotB21 A21middotB12 + A22middotB22

func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

Recursive Matrix Multiplication (RMM) (22)

23

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n) = 8 middot A(n2) + 4(n2)2 if n gt 1 else 1 = 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n) = 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2 = O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

For big speedups see SC12 poster on ldquoBeating MKL and ScaLAPACK at Rectangular Matmulrdquo 1113 at 515-7pm

How hard is hand-tuning matmul anyway

24

bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)

bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results

How hard is hand-tuning matmul anyway

25

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

27

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS

bull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions

layouts and Cannon may use more memory

28

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

kk

B(kj)

C(ij)

29

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

kk

B(kj)

C(ij)

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

Brow

Acol

30

SUMMA Communication Costs

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

deg Total words = log P n2 P12

deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)

deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12

31

Can we do better

bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P

ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)

bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E

needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T

that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

39

Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU

for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update

for i=1 to n-1 update column i update trailing matrix

Communication in sequentialOne-sided Factorizations (LU QR hellip)

bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)

40

bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel

bull Need more ideas

TSQR QR of a Tall Skinny matrix

41

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

42

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01 R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others

44

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

45

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 17: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

17

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory

= + C(ij) A(i)

B(j)C(ij)

18

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether

= + C(ij) A(i)

B(j)C(ij)

n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic

19

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory

= + C(ij) C(ij) A(ik)

B(kj)b-by-bblock

20

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes

= + C(ij) C(ij) A(ik)

B(kj)b-by-bblock

2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster

Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of

size M then readswrites = 2n3b + 2n2

bull Make b as large as possible 3b2 le M so readswrites ge 312n3M12 + 2n2

bull Attains lower bound = Ω (flops M12 )

bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm

21

Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m

bull C = = A middot B = middot middot

=

bull True when each Aij etc 1x1 or n2 x n2

22

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11middotB11 + A12middotB21 A11middotB12 + A12middotB22

A21middotB11 + A22middotB21 A21middotB12 + A22middotB22

func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

Recursive Matrix Multiplication (RMM) (22)

23

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n) = 8 middot A(n2) + 4(n2)2 if n gt 1 else 1 = 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n) = 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2 = O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

For big speedups see SC12 poster on ldquoBeating MKL and ScaLAPACK at Rectangular Matmulrdquo 1113 at 515-7pm

How hard is hand-tuning matmul anyway

24

bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)

bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results

How hard is hand-tuning matmul anyway

25

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

27

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS

bull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions

layouts and Cannon may use more memory

28

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

kk

B(kj)

C(ij)

29

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

kk

B(kj)

C(ij)

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

Brow

Acol

30

SUMMA Communication Costs

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

deg Total words = log P n2 P12

deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)

deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12

31

Can we do better

bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P

ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)

bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E

needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T

that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

39

Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU

for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update

for i=1 to n-1 update column i update trailing matrix

Communication in sequentialOne-sided Factorizations (LU QR hellip)

bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)

40

bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel

bull Need more ideas

TSQR QR of a Tall Skinny matrix

41

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

42

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01 R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others

44

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

45

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 18: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

18

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether

= + C(ij) A(i)

B(j)C(ij)

n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic

19

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory

= + C(ij) C(ij) A(ik)

B(kj)b-by-bblock

20

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes

= + C(ij) C(ij) A(ik)

B(kj)b-by-bblock

2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster

Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of

size M then readswrites = 2n3b + 2n2

bull Make b as large as possible 3b2 le M so readswrites ge 312n3M12 + 2n2

bull Attains lower bound = Ω (flops M12 )

bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm

21

Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m

bull C = = A middot B = middot middot

=

bull True when each Aij etc 1x1 or n2 x n2

22

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11middotB11 + A12middotB21 A11middotB12 + A12middotB22

A21middotB11 + A22middotB21 A21middotB12 + A22middotB22

func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

Recursive Matrix Multiplication (RMM) (22)

23

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n) = 8 middot A(n2) + 4(n2)2 if n gt 1 else 1 = 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n) = 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2 = O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

For big speedups see SC12 poster on ldquoBeating MKL and ScaLAPACK at Rectangular Matmulrdquo 1113 at 515-7pm

How hard is hand-tuning matmul anyway

24

bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)

bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results

How hard is hand-tuning matmul anyway

25

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

27

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS

bull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions

layouts and Cannon may use more memory

28

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

kk

B(kj)

C(ij)

29

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

kk

B(kj)

C(ij)

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

Brow

Acol

30

SUMMA Communication Costs

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

deg Total words = log P n2 P12

deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)

deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12

31

Can we do better

bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P

ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)

bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E

needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T

that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

39

Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU

for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update

for i=1 to n-1 update column i update trailing matrix

Communication in sequentialOne-sided Factorizations (LU QR hellip)

bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)

40

bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel

bull Need more ideas

TSQR QR of a Tall Skinny matrix

41

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

42

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01 R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others

44

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

45

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 19: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

19

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory

= + C(ij) C(ij) A(ik)

B(kj)b-by-bblock

20

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes

= + C(ij) C(ij) A(ik)

B(kj)b-by-bblock

2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster

Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of

size M then readswrites = 2n3b + 2n2

bull Make b as large as possible 3b2 le M so readswrites ge 312n3M12 + 2n2

bull Attains lower bound = Ω (flops M12 )

bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm

21

Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m

bull C = = A middot B = middot middot

=

bull True when each Aij etc 1x1 or n2 x n2

22

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11middotB11 + A12middotB21 A11middotB12 + A12middotB22

A21middotB11 + A22middotB21 A21middotB12 + A22middotB22

func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

Recursive Matrix Multiplication (RMM) (22)

23

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n) = 8 middot A(n2) + 4(n2)2 if n gt 1 else 1 = 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n) = 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2 = O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

For big speedups see SC12 poster on ldquoBeating MKL and ScaLAPACK at Rectangular Matmulrdquo 1113 at 515-7pm

How hard is hand-tuning matmul anyway

24

bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)

bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results

How hard is hand-tuning matmul anyway

25

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

27

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS

bull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions

layouts and Cannon may use more memory

28

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

kk

B(kj)

C(ij)

29

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

kk

B(kj)

C(ij)

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

Brow

Acol

30

SUMMA Communication Costs

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

deg Total words = log P n2 P12

deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)

deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12

31

Can we do better

bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P

ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)

bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E

needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T

that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

39

Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU

for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update

for i=1 to n-1 update column i update trailing matrix

Communication in sequentialOne-sided Factorizations (LU QR hellip)

bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)

40

bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel

bull Need more ideas

TSQR QR of a Tall Skinny matrix

41

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

42

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01 R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others

44

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

45

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 20: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

20

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes

= + C(ij) C(ij) A(ik)

B(kj)b-by-bblock

2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster

Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of

size M then readswrites = 2n3b + 2n2

bull Make b as large as possible 3b2 le M so readswrites ge 312n3M12 + 2n2

bull Attains lower bound = Ω (flops M12 )

bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm

21

Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m

bull C = = A middot B = middot middot

=

bull True when each Aij etc 1x1 or n2 x n2

22

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11middotB11 + A12middotB21 A11middotB12 + A12middotB22

A21middotB11 + A22middotB21 A21middotB12 + A22middotB22

func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

Recursive Matrix Multiplication (RMM) (22)

23

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n) = 8 middot A(n2) + 4(n2)2 if n gt 1 else 1 = 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n) = 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2 = O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

For big speedups see SC12 poster on ldquoBeating MKL and ScaLAPACK at Rectangular Matmulrdquo 1113 at 515-7pm

How hard is hand-tuning matmul anyway

24

bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)

bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results

How hard is hand-tuning matmul anyway

25

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

27

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS

bull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions

layouts and Cannon may use more memory

28

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

kk

B(kj)

C(ij)

29

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

kk

B(kj)

C(ij)

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

Brow

Acol

30

SUMMA Communication Costs

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

deg Total words = log P n2 P12

deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)

deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12

31

Can we do better

bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P

ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)

bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E

needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T

that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

39

Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU

for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update

for i=1 to n-1 update column i update trailing matrix

Communication in sequentialOne-sided Factorizations (LU QR hellip)

bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)

40

bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel

bull Need more ideas

TSQR QR of a Tall Skinny matrix

41

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

42

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01 R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others

44

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

45

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 21: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of

size M then readswrites = 2n3b + 2n2

bull Make b as large as possible 3b2 le M so readswrites ge 312n3M12 + 2n2

bull Attains lower bound = Ω (flops M12 )

bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm

21

Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m

bull C = = A middot B = middot middot

=

bull True when each Aij etc 1x1 or n2 x n2

22

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11middotB11 + A12middotB21 A11middotB12 + A12middotB22

A21middotB11 + A22middotB21 A21middotB12 + A22middotB22

func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

Recursive Matrix Multiplication (RMM) (22)

23

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n) = 8 middot A(n2) + 4(n2)2 if n gt 1 else 1 = 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n) = 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2 = O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

For big speedups see SC12 poster on ldquoBeating MKL and ScaLAPACK at Rectangular Matmulrdquo 1113 at 515-7pm

How hard is hand-tuning matmul anyway

24

bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)

bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results

How hard is hand-tuning matmul anyway

25

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

27

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS

bull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions

layouts and Cannon may use more memory

28

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

kk

B(kj)

C(ij)

29

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

kk

B(kj)

C(ij)

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

Brow

Acol

30

SUMMA Communication Costs

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

deg Total words = log P n2 P12

deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)

deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12

31

Can we do better

bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P

ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)

bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E

needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T

that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

39

Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU

for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update

for i=1 to n-1 update column i update trailing matrix

Communication in sequentialOne-sided Factorizations (LU QR hellip)

bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)

40

bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel

bull Need more ideas

TSQR QR of a Tall Skinny matrix

41

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

42

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01 R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others

44

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

45

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 22: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m

bull C = = A middot B = middot middot

=

bull True when each Aij etc 1x1 or n2 x n2

22

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11middotB11 + A12middotB21 A11middotB12 + A12middotB22

A21middotB11 + A22middotB21 A21middotB12 + A22middotB22

func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

Recursive Matrix Multiplication (RMM) (22)

23

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n) = 8 middot A(n2) + 4(n2)2 if n gt 1 else 1 = 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n) = 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2 = O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

For big speedups see SC12 poster on ldquoBeating MKL and ScaLAPACK at Rectangular Matmulrdquo 1113 at 515-7pm

How hard is hand-tuning matmul anyway

24

bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)

bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results

How hard is hand-tuning matmul anyway

25

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

27

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS

bull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions

layouts and Cannon may use more memory

28

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

kk

B(kj)

C(ij)

29

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

kk

B(kj)

C(ij)

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

Brow

Acol

30

SUMMA Communication Costs

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

deg Total words = log P n2 P12

deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)

deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12

31

Can we do better

bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P

ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)

bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E

needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T

that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

39

Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU

for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update

for i=1 to n-1 update column i update trailing matrix

Communication in sequentialOne-sided Factorizations (LU QR hellip)

bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)

40

bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel

bull Need more ideas

TSQR QR of a Tall Skinny matrix

41

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

42

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01 R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others

44

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

45

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 23: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Recursive Matrix Multiplication (RMM) (22)

23

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n) = 8 middot A(n2) + 4(n2)2 if n gt 1 else 1 = 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n) = 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2 = O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

For big speedups see SC12 poster on ldquoBeating MKL and ScaLAPACK at Rectangular Matmulrdquo 1113 at 515-7pm

How hard is hand-tuning matmul anyway

24

bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)

bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results

How hard is hand-tuning matmul anyway

25

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

27

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS

bull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions

layouts and Cannon may use more memory

28

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

kk

B(kj)

C(ij)

29

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

kk

B(kj)

C(ij)

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

Brow

Acol

30

SUMMA Communication Costs

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

deg Total words = log P n2 P12

deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)

deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12

31

Can we do better

bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P

ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)

bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E

needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T

that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

39

Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU

for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update

for i=1 to n-1 update column i update trailing matrix

Communication in sequentialOne-sided Factorizations (LU QR hellip)

bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)

40

bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel

bull Need more ideas

TSQR QR of a Tall Skinny matrix

41

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

42

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01 R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others

44

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

45

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 24: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

How hard is hand-tuning matmul anyway

24

bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)

bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results

How hard is hand-tuning matmul anyway

25

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

27

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS

bull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions

layouts and Cannon may use more memory

28

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

kk

B(kj)

C(ij)

29

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

kk

B(kj)

C(ij)

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

Brow

Acol

30

SUMMA Communication Costs

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

deg Total words = log P n2 P12

deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)

deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12

31

Can we do better

bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P

ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)

bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E

needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T

that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

39

Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU

for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update

for i=1 to n-1 update column i update trailing matrix

Communication in sequentialOne-sided Factorizations (LU QR hellip)

bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)

40

bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel

bull Need more ideas

TSQR QR of a Tall Skinny matrix

41

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

42

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01 R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others

44

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

45

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 25: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

How hard is hand-tuning matmul anyway

25

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

27

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS

bull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions

layouts and Cannon may use more memory

28

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

kk

B(kj)

C(ij)

29

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

kk

B(kj)

C(ij)

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

Brow

Acol

30

SUMMA Communication Costs

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

deg Total words = log P n2 P12

deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)

deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12

31

Can we do better

bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P

ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)

bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E

needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T

that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

39

Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU

for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update

for i=1 to n-1 update column i update trailing matrix

Communication in sequentialOne-sided Factorizations (LU QR hellip)

bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)

40

bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel

bull Need more ideas

TSQR QR of a Tall Skinny matrix

41

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

42

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01 R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others

44

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

45

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 26: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

27

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS

bull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions

layouts and Cannon may use more memory

28

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

kk

B(kj)

C(ij)

29

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

kk

B(kj)

C(ij)

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

Brow

Acol

30

SUMMA Communication Costs

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

deg Total words = log P n2 P12

deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)

deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12

31

Can we do better

bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P

ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)

bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E

needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T

that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

39

Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU

for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update

for i=1 to n-1 update column i update trailing matrix

Communication in sequentialOne-sided Factorizations (LU QR hellip)

bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)

40

bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel

bull Need more ideas

TSQR QR of a Tall Skinny matrix

41

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

42

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01 R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others

44

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

45

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 27: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

27

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Comes within factor of log P of lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layoutndash Used in practice in PBLAS = Parallel BLAS

bull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions

layouts and Cannon may use more memory

28

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

kk

B(kj)

C(ij)

29

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

kk

B(kj)

C(ij)

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

Brow

Acol

30

SUMMA Communication Costs

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

deg Total words = log P n2 P12

deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)

deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12

31

Can we do better

bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P

ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)

bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E

needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T

that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

39

Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU

for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update

for i=1 to n-1 update column i update trailing matrix

Communication in sequentialOne-sided Factorizations (LU QR hellip)

bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)

40

bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel

bull Need more ideas

TSQR QR of a Tall Skinny matrix

41

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

42

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01 R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others

44

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

45

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 28: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

28

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

kk

B(kj)

C(ij)

29

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

kk

B(kj)

C(ij)

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

Brow

Acol

30

SUMMA Communication Costs

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

deg Total words = log P n2 P12

deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)

deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12

31

Can we do better

bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P

ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)

bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E

needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T

that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

39

Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU

for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update

for i=1 to n-1 update column i update trailing matrix

Communication in sequentialOne-sided Factorizations (LU QR hellip)

bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)

40

bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel

bull Need more ideas

TSQR QR of a Tall Skinny matrix

41

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

42

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01 R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others

44

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

45

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 29: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

29

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

kk

B(kj)

C(ij)

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

Brow

Acol

30

SUMMA Communication Costs

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

deg Total words = log P n2 P12

deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)

deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12

31

Can we do better

bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P

ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)

bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E

needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T

that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

39

Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU

for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update

for i=1 to n-1 update column i update trailing matrix

Communication in sequentialOne-sided Factorizations (LU QR hellip)

bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)

40

bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel

bull Need more ideas

TSQR QR of a Tall Skinny matrix

41

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

42

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01 R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others

44

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

45

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 30: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

30

SUMMA Communication Costs

For k=0 to nb-1 for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree) hellip words = log P12 bnP12 messages = log P12

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree) hellip same words and messages Receive A(ik) into Acol Receive B(kj) into Brow C_myproc = C_myproc + Acol Brow

deg Total words = log P n2 P12

deg Within factor of log P of lower bounddeg (more complicated implementation removes log P factor)

deg Total messages = log P nbdeg Choose b close to maximum nP12 to approach lower bound P12

31

Can we do better

bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P

ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)

bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E

needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T

that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

39

Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU

for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update

for i=1 to n-1 update column i update trailing matrix

Communication in sequentialOne-sided Factorizations (LU QR hellip)

bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)

40

bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel

bull Need more ideas

TSQR QR of a Tall Skinny matrix

41

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

42

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01 R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others

44

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

45

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 31: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

31

Can we do better

bull Lower bound assumed 1 copy of data M = O(n2P) per procbull What if matrix small enough to fit cgt1 copies so M = cn2P

ndash words_moved = Ω( flops M12 ) = Ω( n2 ( c12 P12 ))ndash messages = Ω( flops M32 ) = Ω( P12 c32)

bull Can we attain new lower boundndash Special case ldquo3D Matmulrdquo c = P13

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E

needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T

that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

39

Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU

for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update

for i=1 to n-1 update column i update trailing matrix

Communication in sequentialOne-sided Factorizations (LU QR hellip)

bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)

40

bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel

bull Need more ideas

TSQR QR of a Tall Skinny matrix

41

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

42

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01 R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others

44

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

45

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 32: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E

needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T

that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

39

Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU

for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update

for i=1 to n-1 update column i update trailing matrix

Communication in sequentialOne-sided Factorizations (LU QR hellip)

bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)

40

bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel

bull Need more ideas

TSQR QR of a Tall Skinny matrix

41

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

42

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01 R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others

44

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

45

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 33: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E

needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T

that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

39

Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU

for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update

for i=1 to n-1 update column i update trailing matrix

Communication in sequentialOne-sided Factorizations (LU QR hellip)

bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)

40

bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel

bull Need more ideas

TSQR QR of a Tall Skinny matrix

41

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

42

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01 R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others

44

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

45

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 34: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E

needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T

that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

39

Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU

for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update

for i=1 to n-1 update column i update trailing matrix

Communication in sequentialOne-sided Factorizations (LU QR hellip)

bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)

40

bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel

bull Need more ideas

TSQR QR of a Tall Skinny matrix

41

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

42

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01 R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others

44

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

45

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 35: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E

needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T

that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

39

Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU

for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update

for i=1 to n-1 update column i update trailing matrix

Communication in sequentialOne-sided Factorizations (LU QR hellip)

bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)

40

bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel

bull Need more ideas

TSQR QR of a Tall Skinny matrix

41

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

42

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01 R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others

44

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

45

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 36: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E

needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T

that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

39

Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU

for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update

for i=1 to n-1 update column i update trailing matrix

Communication in sequentialOne-sided Factorizations (LU QR hellip)

bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)

40

bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel

bull Need more ideas

TSQR QR of a Tall Skinny matrix

41

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

42

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01 R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others

44

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

45

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 37: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E

needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T

that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

39

Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU

for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update

for i=1 to n-1 update column i update trailing matrix

Communication in sequentialOne-sided Factorizations (LU QR hellip)

bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)

40

bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel

bull Need more ideas

TSQR QR of a Tall Skinny matrix

41

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

42

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01 R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others

44

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

45

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 38: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum energy E

needed to achieve itbull Given a maximum energy budget E what is the minimum runtime T

that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

39

Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU

for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update

for i=1 to n-1 update column i update trailing matrix

Communication in sequentialOne-sided Factorizations (LU QR hellip)

bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)

40

bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel

bull Need more ideas

TSQR QR of a Tall Skinny matrix

41

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

42

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01 R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others

44

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

45

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 39: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

39

Extending to rest of Direct Linear AlgebraNaiumlve Gaussian Elimination A =LU

for i = 1 to n-1 A(i+1ni) = A(i+1ni) ( 1 A(ii) ) hellip scale a vector A(i+1ni+1n) = A(i+1n i+1n ) - A(i+1n i) A(i i+1n) hellip rank-1 update

for i=1 to n-1 update column i update trailing matrix

Communication in sequentialOne-sided Factorizations (LU QR hellip)

bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)

40

bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel

bull Need more ideas

TSQR QR of a Tall Skinny matrix

41

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

42

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01 R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others

44

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

45

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 40: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Communication in sequentialOne-sided Factorizations (LU QR hellip)

bull Naive Approach for i=1 to n-1 update column i update trailing matrixbull words_moved = O(n3)

40

bull Blocked Approach (LAPACK) for i=1 to nb - 1 update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approachesbull minimizes messagesbull handles eig() or svd()bull works in parallel

bull Need more ideas

TSQR QR of a Tall Skinny matrix

41

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

42

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01 R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others

44

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

45

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 41: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

TSQR QR of a Tall Skinny matrix

41

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

42

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01 R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others

44

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

45

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 42: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

TSQR QR of a Tall Skinny matrix

42

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01 R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others

44

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

45

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 43: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01 R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamicallyMulticore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others

44

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

45

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 44: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash 16x slower than accessing data twice (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer others

44

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

45

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 45: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

45

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 46: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 47: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 48: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

25D vs 2D LUWith and Without Pivoting

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 49: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Summary of dense sequential algorithms attaining communication lower bounds

bull Algorithms shown minimizing Messages use (recursive) block layoutbull Not possible with columnwise or rowwise layouts

bull Many references (see reports) only some shown plus oursbull Cache-oblivious are underlined Green are ours is unknownfuture work

Algorithm 2 Levels of Memory Multiple Levels of Memory

Words Moved and Messages Words Moved and Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested) or recursive [Gustavson97]

Cholesky LAPACK (with b = M12) [Gustavson 97]

[BDHS09]

[Gustavson97] [AhmedPingali00]

[BDHS09]

(larrsame) (larrsame)

LU withpivoting

LAPACK (sometimes)[Toledo97] [GDX 08]

[GDX 08] partial pivoting

[Toledo 97][GDX 08]

[GDX 08]

QRRank-revealing

LAPACK (sometimes) [ElmrothGustavson98]

[DGHL08]

[FrensWise03]but 3x flops

[DGHL08]

[ElmrothGustavson98]

[DGHL08]

[FrensWise03] [DGHL08]

Eig SVD Not LAPACK [BDD11] randomized but more flops [BDK11]

[BDD11BDK11]

[BDD11BDK11]

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 50: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processorsbull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix MultiplyCholeskyLUQRSym Eig SVDNonsym Eig

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 51: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1Cholesky ScaLAPACK log PLU ScaLAPACK log PQR ScaLAPACK log PSym Eig SVD ScaLAPACK log PNonsym Eig ScaLAPACK P12 log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 52: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (conventional approach)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU ScaLAPACK log P n log P P12

QR ScaLAPACK log P n log P P12

Sym Eig SVD ScaLAPACK log P n P12

Nonsym Eig ScaLAPACK P12 log P n log P

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 53: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors (better)bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply [Cannon 69] 1 1Cholesky ScaLAPACK log P log PLU [GDX10] log P log PQR [DGHL08] log P log3 PSym Eig SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 54: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Can we do even better

bull Assume nxn matrices on P processors bull Use c copies of data M = O(cn2 P) per processorbull Increasing M reduces lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 (c12 P12 ) ) messages = ( (n3 P) M32 ) = ( P12 c32 )

Algorithm Reference Factor exceeding lower bound for words_moved

Factor exceeding lower bound for messages

Matrix Multiply

[DS11SBD11] polylog P polylog P

Cholesky [SD11 in prog] polylog P c2 polylog P ndash optimal

LU [DS11SBD11] polylog P c2 polylog P ndash optimal

QR Via Cholesky QR polylog P c2 polylog P ndash optimal

Sym Eig SVD Nonsym Eig

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 55: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 56: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

b+1

b+1

Successive Band Reduction

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 57: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 58: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

1Q1

b+1

b+1

d+1

c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 59: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

12

Q1

b+1

b+1

d+1

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 60: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 61: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

Successive Band Reduction

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 62: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 63: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

Successive Band Reduction

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 64: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 65: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 66: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

Successive Band Reduction

Only need to zero out leadingparallelogram of each trapezoid

2

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 67: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Conventional vs CA - SBR

Conventional Communication-AvoidingMany tuning parameters Number of ldquosweepsrdquo diagonals cleared per sweep sizes of parallelograms bulges chased at one time how far to chase each bulgeRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 68: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 69: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz to appear in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 70: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 71: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

71

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

For details see SC12 talk onldquoCommunication-Avoiding Parallel Strassenrdquo by Grey Ballard et al 1115 at 4pm

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 72: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and

practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration into applications (on IBM BGQ)ndash Qbox (with LLNL IBM) molecular dynamicsndash CTF (with ANL) symmetric tensor contractions

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 73: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 74: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation

74

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 75: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 76: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 77: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 78: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 79: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 80: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 81: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 82: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 83: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 84: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 85: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 86: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 87: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 88: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost88

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 89: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

89

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 90: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

90

Requires Co-tuning Kernels

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 91: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

91

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 92: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 93: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 94: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be donendash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matricesndash Autotuning and synthesis

bull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 95: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for ldquoIterativerdquo Linear Algebra bull Ditto (work in progress) for programs accessing

arrays (eg n-body)

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 96: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Recall optimal sequential Matmulbull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 97: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = s

bull Thm words_moved = Ω(n3MS-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 98: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = s

bull Thm words_moved = Ω(n2MS-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 99: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 100: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = s

bull Thm words_moved = Ω(n6MS-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 101: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7) φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 102: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

General Communication Bound

bull Def Houmllder-Brascamp-Lieb Linear Program (HBL-LP) for s1hellipsm for all subgroups H lt Zk rank(H) le Σj sjrank(φj(H))

bull Thm Given a program with array refs given by φj choose sj to minimize sHBL = Σj sj subject to HBL-LP Then

words_moved = Ω (iterationsMsHBL-1)ndash Proof depends on recent result in pure mathematics by

ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 103: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Is this bound attainable (12)

bull But first Can we write it downndash Thm (bad news) Reduces to Hilbertrsquos 10th problem over Q

(conjectured to be undecidable)ndash Thm (good news) Can write it down explicitly in many cases

of interest (eg all φj = subset of indices)ndash Thm (good news) Easy to approximate

bull If you miss a constraint the lower bound may be too large (ie sHBL too small) but still worth trying to attain because your algorithm will still communicate less

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 104: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all φj = subset of indices dual of HBL-LP

gives optimal tile sizes HBL-LP minimize 1Ts st sTΔ ge 1T

Dual-HBL-LP maximize 1Tx st Δx le 1 Then for sequential algorithm tile ij by Mxj

bull Ex Matmul s = [ 12 12 12 ]T = xbull Extends to unimodular transforms of indices

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 105: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 106: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2013bull wwwcsberkeleyedu~demmel

ndash Prerecorded version planned in Spring 2013bull wwwxsedeorgbull Free supercomputer accounts to do homework

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary
Page 107: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC12_tutorial

Summary

Donrsquot Communichellip

108

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • SUMMA Communication Costs
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling in Time of 25D Matmul on BGP n=64K
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Extending to rest of Direct Linear Algebra Naiumlve Gaussian Elimi
  • Communication in sequential One-sided Factorizations (LU QR hellip
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Summary of dense sequential algorithms attaining communication
  • Summary of dense parallel algorithms attaining communication l
  • Summary of dense parallel algorithms attaining communication l (2)
  • Summary of dense parallel algorithms attaining communication l (3)
  • Summary of dense parallel algorithms attaining communication l (4)
  • Can we do even better
  • Symmetric Band Reduction
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 71
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 89
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 91
  • Slide 92
  • Slide 93
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • Outline (4)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • For more details
  • Summary