communicaon-avoiding algorithms for linear algebra and beyond
TRANSCRIPT
Communica)on-AvoidingAlgorithmsforLinearAlgebraandBeyond
JimDemmelEECS&MathDepartments
UCBerkeley
2
Whyavoidcommunica)on?(1/3)Algorithmshavetwocosts(measuredin)meorenergy):1. Arithme)c(FLOPS)
2. Communica)on:movingdatabetween– levelsofamemoryhierarchy(sequen)alcase)
– processorsoveranetwork(parallelcase).
CPUCache
DRAM
CPUDRAM
CPUDRAM
CPUDRAM
CPUDRAM
Whyavoidcommunica)on?(2/3)• Running)meofanalgorithmissumof3terms:
– #flops*)me_per_flop– #wordsmoved/bandwidth– #messages*latency
3
communica)on
• Time_per_flop<<1/bandwidth<<latency
• Gapsgrowingexponen)allywith)me[FOSC]
• Avoidcommunica)ontosave)me
• Goal:reorganizealgorithmstoavoidcommunica)on
Annualimprovements
Time_per_flop Bandwidth Latency
Network 26% 15%
DRAM 23% 5%59%
WhyMinimizeCommunica)on?(3/3)
PicoJoules
Source:JohnShalf,LBL
WhyMinimizeCommunica)on?(3/3)
PicoJoules
Source:JohnShalf,LBL
Minimizecommunica)ontosaveenergy
Goals
6
• Redesignalgorithmstoavoidcommunica)on• Betweenallmemoryhierarchylevels
• L1L2DRAMnetwork,etc
• Afainlowerboundsifpossible• Currentalgorithmsogenfarfromlowerbounds
• Largespeedupsandenergysavingspossible
SampleSpeedups• Up to 12x faster for 2.5D matmul on 64K core IBM BG/P
• Up to 3x faster for tensor contractions on 2K core Cray XE/6
• Up to 6.2x faster for APSP on 24K core Cray CE6
• Up to 2.1x faster for 2.5D LU on 64K core IBM BG/P
• Up to 11.8x faster for direct N-body on 32K core IBM BG/P
• Up to 13x faster for TSQR on Tesla C2050 Fermi NVIDIA GPU • Up to 6.7x faster for symeig (band A) on 10 core Intel Westmere
• Up to 2x faster for 2.5D Strassen on 38K core Cray XT4
• Up to 4.2x faster for MiniGMG benchmark bottom solver, using CA-BiCGStab (2.5x for overall solve)
– 2.5x / 1.5x for combustion simulation code
7
Outline
• SurveystateoftheartofCA(Comm-Avoiding)algorithms– ReviewpreviousMatmulalgorithms– CAO(n3)2.5DMatmul– TSQR:Tall-SkinnyQR
• Beyondlinearalgebra– Extendinglowerboundstoanyalgorithmwitharrays– Communica)on-op)malN-bodyalgorithm
• CA-Krylovmethods• RelatedTopics
Outline
• SurveystateoftheartofCA(Comm-Avoiding)algorithms– ReviewpreviousMatmulalgorithms– CAO(n3)2.5DMatmul– TSQR:Tall-SkinnyQR
• Beyondlinearalgebra– Extendinglowerboundstoanyalgorithmwitharrays– Communica)on-op)malN-bodyalgorithm
• CA-Krylovmethods• RelatedTopics
SummaryofCALinearAlgebra• “Direct”LinearAlgebra
• Lowerboundsoncommunica)onforlinearalgebraproblemslikeAx=b,leastsquares,Ax=λx,SVD,etc
• Mostlynotafainedbyalgorithmsinstandardlibraries
• Newalgorithmsthatafaintheselowerbounds• Beingaddedtolibraries:Sca/LAPACK,PLASMA,MAGMA
• Largespeed-upspossible• Autotuningtofindop)malimplementa)on
• Difofor“Itera)ve”LinearAlgebra
Lowerboundforall“n3-like”linearalgebra
• Holdsfor– Matmul,BLAS,LU,QR,eig,SVD,tensorcontrac)ons,…– Somewholeprograms(sequencesoftheseopera)ons,nomaferhowindividualopsareinterleaved,egAk)
– Denseandsparsematrices(where#flops<<n3)– Sequen)alandparallelalgorithms– Somegraph-theore)calgorithms(egFloyd-Warshall)
11
• LetM=“fast”memorysize(perprocessor)
#words_moved(perprocessor)=Ω(#flops(perprocessor)/M1/2)
#messages_sent(perprocessor)=Ω(#flops(perprocessor)/M3/2)
• Parallelcase:assumeeitherloadormemorybalanced
Lowerboundforall“n3-like”linearalgebra
• Holdsfor– Matmul,BLAS,LU,QR,eig,SVD,tensorcontrac)ons,…– Somewholeprograms(sequencesoftheseopera)ons,nomaferhowindividualopsareinterleaved,egAk)
– Denseandsparsematrices(where#flops<<n3)– Sequen)alandparallelalgorithms– Somegraph-theore)calgorithms(egFloyd-Warshall)
12
• LetM=“fast”memorysize(perprocessor)
#words_moved(perprocessor)=Ω(#flops(perprocessor)/M1/2)
#messages_sent≥#words_moved/largest_message_size
• Parallelcase:assumeeitherloadormemorybalanced
Lowerboundforall“n3-like”linearalgebra
• Holdsfor– Matmul,BLAS,LU,QR,eig,SVD,tensorcontrac)ons,…– Somewholeprograms(sequencesoftheseopera)ons,nomaferhowindividualopsareinterleaved,egAk)
– Denseandsparsematrices(where#flops<<n3)– Sequen)alandparallelalgorithms– Somegraph-theore)calgorithms(egFloyd-Warshall)
13
• LetM=“fast”memorysize(perprocessor)
#words_moved(perprocessor)=Ω(#flops(perprocessor)/M1/2)
#messages_sent(perprocessor)=Ω(#flops(perprocessor)/M3/2)
• Parallelcase:assumeeitherloadormemorybalanced
SIAMSIAG/LinearAlgebraPrize,2012Ballard,D.,Holtz,Schwartz
Canweafaintheselowerbounds?• Doconven)onaldensealgorithmsasimplementedinLAPACKandScaLAPACKafainthesebounds?– Ogennot
• Ifnot,arethereotheralgorithmsthatdo?– Yes,formuchofdenselinearalgebra,APSP– Newalgorithms,withnewnumericalproper)es,newwaystoencodeanswers,newdatastructures
– Notjustlooptransforma)ons(needthosetoo!)
• Onlyafewsparsealgorithmssofar– Ex:SparseCholeskyofmatriceswith“large”separators– Ex:Sparse*Densematrixmul)ply
• Lotsofworkinprogress14
Outline
• SurveystateoftheartofCA(Comm-Avoiding)algorithms– ReviewpreviousMatmulalgorithms– CAO(n3)2.5DMatmul– TSQR:Tall-SkinnyQR
• Beyondlinearalgebra– Extendinglowerboundstoanyalgorithmwitharrays– Communica)on-op)malN-bodyalgorithm
• CA-Krylovmethods• RelatedTopics
ReviewofMatmulalgorithms
• Naïvesequen)alalgorithm– fori=1:n,forj=1:n,fork=1:n,C(i,j)+=A(i,k)*B(k,j)– Pessimal:O(n3)wordsmovedCache!DRAM
• Blockedsequen)alalgorithm– Fori=1:n/b,forj=1:n/b,fork=1:n/b,C[i,j]+=A[i,k]*B[k,j]– A[i,k]*B[k,j]isab-by-bblockmatrixmul)ply– Ifb~M1/2,M=Cachesize,thenafainslowerbound:O(n3/M1/2)wordsmoved
– Ifdon’tknowM,ormul)plecachelevels,cacheoblivious(recursive)algorithmworks
17
SUMMA–nxnmatmulonP1/2xP1/2grid(nearly)op)malusingminimummemoryM=O(n2/P)
For k=0 to n/b-1 … b = block size = #cols in A(i,k) = #rows in B(k,j) for all i = 1 to P1/2
* = i
j
A(i,k)
kkB(k,j)
C(i,j)
Brow
Acol
Summaryofdenseparallelalgorithmsafainingcommunica)onlowerbounds
• AssumenxnmatricesonPprocessors• MinimumMemoryperprocessor=M=O(n2/P)• Recalllowerbounds:
#words_moved=Ω((n3/P)/M1/2)=Ω(n2/P1/2)#messages=Ω((n3/P)/M3/2)=Ω(P1/2)
• DoesScaLAPACKafainthesebounds?• For#words_moved:mostly,exceptnonsym.Eigenproblem• For#messages:asympto)callyworse,exceptCholesky
• Newalgorithmsafainallbounds,uptopolylog(P)factors• Cholesky,LU,QR,Sym.andNonsymeigenproblems,SVD
• Needtoreplacepar)alpivo)nginLU• Needrandomiza)onforNonsymeigenproblem(sofar)
CanwedoBefer?
Canwedobefer?
• Aren’twealreadyop)mal?• WhyassumeM=O(n2/p),i.e.minimal?
– Lowerbounds)lltrueifmorememory– Canweafainit?
• Specialcase:“3DMatmul”– UsesM=O(n2/p2/3)– Dekel,Nassimi,Sahni[81],Bernsten[89],Agarwal,Chandra,Snir[90],Johnson[93],Agarwal,Balle,Gustavson,Joshi,Palkar[95]
• Notalwaysp1/3)mesasmuchmemoryavailable…
Outline
• SurveystateoftheartofCA(Comm-Avoiding)algorithms– ReviewpreviousMatmulalgorithms– CAO(n3)2.5DMatmul– TSQR:Tall-SkinnyQR
• Beyondlinearalgebra– Extendinglowerboundstoanyalgorithmwitharrays– Communica)on-op)malN-bodyalgorithm
• CA-Krylovmethods• RelatedTopics
2.5DMatrixMul)plica)on
• Assumecanfitcn2/Pdataperprocessor,c>1• Processorsform(P/c)1/2x(P/c)1/2xcgrid
c
(P/c)1/2
Example:P=32,c=2
2.5DMatrixMul)plica)on
• Assumecanfitcn2/Pdataperprocessor,c>1• Processorsform(P/c)1/2x(P/c)1/2xcgrid
k
j
Ini)allyP(i,j,0)ownsA(i,j)andB(i,j)eachofsizen(c/P)1/2xn(c/P)1/2
(1)P(i,j,0)broadcastsA(i,j)andB(i,j)toP(i,j,k)(2)Processorsatlevelkperform1/c-thofSUMMA,i.e.1/c-thofΣmA(i,m)*B(m,j)(3)Sum-reducepar)alsumsΣmA(i,m)*B(m,j)alongk-axissoP(i,j,0)ownsC(i,j)
2.5DMatmulonBG/P,16Knodes/64Kcores
0
20
40
60
80
100
8192 131072
Pe
rce
nta
ge
of
ma
chin
e p
ea
k
n
Matrix multiplication on 16,384 nodes of BG/P
12X faster
2.7X faster
Using c=16 matrix copies
2D MM2.5D MM
2.5DMatmulonBG/P,16Knodes/64Kcores
0
0.2
0.4
0.6
0.8
1
1.2
1.4
n=8192, 2D
n=8192, 2.5D
n=131072, 2D
n=131072, 2.5D
Exe
cutio
n tim
e n
orm
aliz
ed b
y 2D
Matrix multiplication on 16,384 nodes of BG/P
95% reduction in comm computationidle
communication
c=16copies
DisOnguishedPaperAward,EuroPar’11(Solomonik,D.)SC’11paperbySolomonik,Bhatele,D.
12xfaster
2.7xfaster
IdeasadoptedbyNervana,“deeplearning”startup,acquiredbyIntelinAugust2016
PerfectStrongScaling–inTimeandEnergy• Every)meyouaddaprocessor,youshoulduseitsmemoryMtoo• Startwithminimalnumberofprocs:PM=3n2• IncreasePbyafactorofc"totalmemoryincreasesbyafactorofc• Nota)onfor)mingmodel:
– γT,βT,αT=secsperflop,perword_moved,permessageofsizem• T(cP)=n3/(cP)[γT+βT/M1/2+αT/(mM1/2)]=T(P)/c• Nota)onforenergymodel:
– γE,βE,αE=joulesforsameopera)ons– δE=joulesperwordofmemoryusedpersec– εE=joulespersecforleakage,etc.
• E(cP)=cP{n3/(cP)[γE+βE/M1/2+αE/(mM1/2)]+δEMT(cP)+εET(cP)}=E(P)• Limit:c≤P1/3(3Dalgorithm),ifstar)ngwith1copyofinputs• ExtendstoN-body,Strassen,…• Canprovelowerboundsonneedednetwork(eg3Dtorusformatmul)
Outline
• SurveystateoftheartofCA(Comm-Avoiding)algorithms– ReviewpreviousMatmulalgorithms– CAO(n3)2.5DMatmul– TSQR:Tall-SkinnyQR
• Beyondlinearalgebra– Extendinglowerboundstoanyalgorithmwitharrays– Communica)on-op)malN-bodyalgorithm
• CA-Krylovmethods• RelatedTopics
TSQR:QRofaTall,Skinnymatrix
27
W=
Q00R00Q10R10Q20R20Q30R30
W0
W1
W2W3
Q00
Q10
Q20Q30
= = .
R00R10R20R30
R00R10R20R30
=Q01R01Q11R11
Q01
Q11= . R01
R11
R01R11
= Q02R02
TSQR:QRofaTall,Skinnymatrix
28
W=
Q00R00Q10R10Q20R20Q30R30
W0
W1
W2W3
Q00
Q10
Q20Q30
= = .
R00R10R20R30
R00R10R20R30
=Q01R01Q11R11
Q01
Q11= . R01
R11
R01R11
= Q02R02
Output={Q00,Q10,Q20,Q30,Q01,Q11,Q02,R02}
TSQR:AnArchitecture-DependentAlgorithm
W=
W0
W1W2
W3
R00R10R20R30
R01
R11
R02Parallel:
W=
W0
W1W2
W3
R01R02
R00
R03Sequen)al:
W=
W0
W1W2
W3
R00R01
R01R11
R02
R11R03
DualCore:
Canchoosereduc)ontreedynamically
Mul)core/Mul)socket/Mul)rack/Mul)site/Out-of-core:?
TSQRPerformanceResults• Parallel
– IntelClovertown– Upto8xspeedup(8core,dualsocket,10Mx10)
– Pen)umIIIcluster,DolphinInterconnect,MPICH• Upto6.7xspeedup(16procs,100Kx200)
– BlueGene/L• Upto4xspeedup(32procs,1Mx50)
– TeslaC2050/Fermi• Upto13x(110,592x100)
– Grid–4xon4ci)esvs1city(Dongarra,Langouetal)– Cloud–1.6xslowerthanjustaccessingdatatwice(GleichandBenson)
• Sequen)al– “Infinitespeedup”forout-of-coreonPowerPClaptop
• Aslifleas2xslowdownvs(predicted)infiniteDRAM• LAPACKwithvirtualmemoryneverfinished
• SVDcostsaboutthesame• JointworkwithGrigori,Hoemmen,Langou,Anderson,Ballard,Keutzer,others
30DatafromGreyBallard,MarkHoemmen,LauraGrigori,JulienLangou,JackDongarra,MichaelAnderson
TSQRPerformanceResults• Parallel
– IntelClovertown– Upto8xspeedup(8core,dualsocket,10Mx10)
– Pen)umIIIcluster,DolphinInterconnect,MPICH• Upto6.7xspeedup(16procs,100Kx200)
– BlueGene/L• Upto4xspeedup(32procs,1Mx50)
– TeslaC2050/Fermi• Upto13x(110,592x100)
– Grid–4xon4ci)esvs1city(Dongarra,Langouetal)– Cloud–1.6xslowerthanjustaccessingdatatwice(GleichandBenson)
• Sequen)al– “Infinitespeedup”forout-of-coreonPowerPClaptop
• Aslifleas2xslowdownvs(predicted)infiniteDRAM• LAPACKwithvirtualmemoryneverfinished
• SVDcostsaboutthesame• JointworkwithGrigori,Hoemmen,Langou,Anderson,Ballard,Keutzer,others
31DatafromGreyBallard,MarkHoemmen,LauraGrigori,JulienLangou,JackDongarra,MichaelAnderson
SIAGonSupercompuOngBestPaperPrize,2016
Other/OngoingWork• AllresultsextendtoStrassen-likealgorithmswithcomplexityO(nω)
– LowerboundgoesfromO(n3/M1/2)=O((n/M1/2)3M)toO((n/M1/2)ωM)• Lotsmoreworkon
– Algorithms:• BLAS,LDLT,QRwithpivo)ng,otherpivo)ngschemes,eigenproblems,…• Sparsematrices,structuredmatrices• All-pairs-shortest-path,…• Both2D(c=1)and2.5D(c>1)• Butonlybandwidthmaydecreasewithc>1,notlatency(egLU)
– Pla�orms:• Mul)core,cluster,GPU,cloud,heterogeneous,low-energy,…
– Sogware:• Integra)onintoSca/LAPACK,PLASMA,MAGMA,…
• Integra)onintoapplica)ons– CTF(withANL):symmetrictensorcontrac)ons– Spark:TSQR
Outline
• SurveystateoftheartofCA(Comm-Avoiding)algorithms– ReviewpreviousMatmulalgorithms– CAO(n3)2.5DMatmul– TSQR:Tall-SkinnyQR
• Beyondlinearalgebra– Extendinglowerboundstoanyalgorithmwitharrays– Communica)on-op)malN-bodyalgorithm
• CA-Krylovmethods• RelatedTopics
Recallop)malsequen)alMatmul(1/2)
• Naïvecodefori=1:n,forj=1:n,fork=1:n,C(i,j)+=A(i,k)*B(k,j)
• “Blocked”codefori=1:n/b,forj=1:n/b,fork=1:n/bC[i,j]+=A[i,k]*B[k,j]…bxbmatmul
• Thm:Pickingb=M1/2afainslowerbound:#words_moved=Ω(n3/M1/2)• Wheredoes1/2comefrom?
Recallop)malsequen)alMatmul(2/2)
• “Blocked”codewithblocksizesMxi,Mxj,Mxkfori=1:n/Mxi,forj=1:n/Mxj,fork=1:n/Mxk
C[i,j]+=A[i,k]*B[k,j]…(MxixMxk)*(MxkxMxj)matmul• Pickop)malblocksizes
– Afitsincache:MxixMxk=Mxi+xk≤M=>xi+xk≤1– Bfitsincache:xk+xj≤1,Cfitsincache:xi+xj≤1– Maximize=MxixMxjxMxk=Mxi+xj+xk!Maximizexi+xj+xk
• Linearprogram,solu)onxi=xj=xk=1/2,xi+xj+xk=3/2• Memorytraffic=(n/M1/2)3*M=n3/M1/2
• Thm:This)lingminimizesmemorytraffic,overallreorderings– Includessequen)al,parallel,2.5Dcases
Applica)ontoDirectN-Body• fori=1:n,forj=1:n,F(i)+=force(P(i),P(j))• UseblocksizesMxiforiandMxjforj
• MaximizeMxixMxj=Mxi+xjs.t.Mxi≤MandMxj≤M– Maximizexi+xjs.t.xi≤1andxj≤1– Solu)on:xi=xj=1
• Memorytraffic=(n/M)2*M=n2/M
• Thm:This)lingminimizesmemorytraffic,overallreorderings– Includessequen)al,parallel,2.5Dcases
N-BodySpeedupsonIBM-BG/P(Intrepid)8Kcores,32Kpar)cles
11.8xspeedup
K.Yelick,E.Georganas,M.Driscoll,P.Koanantakool,E.Solomonik
04/07/2015 CS267 Lecture 21
Some Applications • Gravity, Turbulence, Molecular Dynamics (3-way
interactions, 42x speedups), Plasma Simulation, Electron-Beam Lithography Device Simulation
• Hair ... – www.fxguide.com/featured/brave-new-hair/ – graphics.pixar.com/library/CurlyHairA/paper.pdf
38
Approachtogeneralizinglowerbounds• Matmul
fori=1:n,forj=1:n,fork=1:n,
C(i,j)+=A(i,k)*B(k,j)
=>for(i,j,k)inS=subsetofZ3
Accessloca)onsindexedby(i,j),(i,k),(k,j)
• Generalcase
fori1=1:n,fori2=i1:m,…forik=i3:i4
C(i1+2*i3-i7)=func(A(i2+3*i4,i1,i2,i1+i2,…),B(pnt(3*i4)),…)
D(somethingelse)=func(somethingelse),…=>for(i1,i2,…,ik)inS=subsetofZk
Accessloca)onsindexedby“projec)ons”,eg
φC(i1,i2,…,ik)=(i1+2*i3-i7)
φA(i1,i2,…,ik)=(i2+3*i4,i1,i2,i1+i2,…),…
• Goal:Communica)onlowerbounds,op)malalgorithmsforanyprogramthatlookslikethis
GeneralCommunica)onBound
• Thm:Givenaprogramwitharrayrefsgivenbyprojec)onsφj,thenthereisane≥1suchthat
#words_moved=Ω(#itera)ons/Me-1)
whereeisthethevalueofalinearprogram:minimizee=Σjejsubjecttorank(H)≤Σjej*rank(φj(H))forallsubgroupsH<Zk
• Proofdependsonrecentresultinpuremathema)csbyChrist/Tao/Carbery/Bennef
Isthisboundafainable?
• Butfirst:Canwewriteitdown?• OneinequalitypersubgroupH<Zk,buts)llfinitelymany!
• Thm:(badnews)Wri)ngdownallinequali)esinLPreducestoHilbert’s10thproblemoverQ– Conjecturedtobeundecidable
• Thm:(goodnews)AnotherLPwithsamesolu)onisdecidable
• Afainabilitydependsonloopdependencies– Bestcase:none,orreduc)ons(likematmul)
• Thm:Wecanalwaysconstructanop)mal)ling,thatafainsthebound(dependenciespermi�ng)
OngoingWork
• Extend“perfectscaling”resultsfor)meandenergybyusingextramemory– “n.5Dalgorithms”
• Dealwithdependencies• Implement/improvealgorithmstogeneratelowerbounds,op)malalgorithms
• Incorporateintocompilers
Outline
• SurveystateoftheartofCA(Comm-Avoiding)algorithms– ReviewpreviousMatmulalgorithms– CAO(n3)2.5DMatmul– TSQR:Tall-SkinnyQR
• Beyondlinearalgebra– Extendinglowerboundstoanyalgorithmwitharrays– Communica)on-op)malN-bodyalgorithm
• CA-Krylovmethods• RelatedTopics
AvoidingCommunica)oninItera)veLinearAlgebra
• k-stepsofitera)vesolverforsparseAx=borAx=λx– DoeskSpMVswithAandstar)ngvector– Manysuch“KrylovSubspaceMethods”
• ConjugateGradients(CG),GMRES,Lanczos,Arnoldi,…
• Goal:minimizecommunica)on– Assumematrix“well-par))oned”– Serialimplementa)on
• Conven)onal:O(k)movesofdatafromslowtofastmemory• New:O(1)movesofdata–opOmal
– Parallelimplementa)ononpprocessors• Conven)onal:O(klogp)messages(kSpMVcalls,dotprods)• New:O(logp)messages-opOmal
• Lotsofspeeduppossible(modeledandmeasured)– Price:someredundantcomputa)on– Challenges:Poorpar))oning,Precondi)oning,Num.Stability44
1234… …32
x
A·x
A2·x
A3·x
Communica)onAvoidingKernels:TheMatrixPowersKernel:[Ax,A2x,…,Akx]
• Replacekitera)onsofy=A⋅xwith[Ax,A2x,…,Akx]
• Example:Atridiagonal,n=32,k=3
• Worksforany“well-par))oned”A
1234… …32
x
A·x
A2·x
A3·x
Communica)onAvoidingKernels:TheMatrixPowersKernel:[Ax,A2x,…,Akx]
• Replacekitera)onsofy=A⋅xwith[Ax,A2x,…,Akx]
• Example:Atridiagonal,n=32,k=3
1234… …32
x
A·x
A2·x
A3·x
Communica)onAvoidingKernels:TheMatrixPowersKernel:[Ax,A2x,…,Akx]
• Replacekitera)onsofy=A⋅xwith[Ax,A2x,…,Akx]
• Example:Atridiagonal,n=32,k=3
1234… …32
x
A·x
A2·x
A3·x
Communica)onAvoidingKernels:TheMatrixPowersKernel:[Ax,A2x,…,Akx]
• Replacekitera)onsofy=A⋅xwith[Ax,A2x,…,Akx]
• Example:Atridiagonal,n=32,k=3
1234… …32
x
A·x
A2·x
A3·x
Communica)onAvoidingKernels:TheMatrixPowersKernel:[Ax,A2x,…,Akx]
• Replacekitera)onsofy=A⋅xwith[Ax,A2x,…,Akx]
• Example:Atridiagonal,n=32,k=3
1234… …32
x
A·x
A2·x
A3·x
Communica)onAvoidingKernels:TheMatrixPowersKernel:[Ax,A2x,…,Akx]
• Replacekitera)onsofy=A⋅xwith[Ax,A2x,…,Akx]
• Example:Atridiagonal,n=32,k=3
1234… …32
x
A·x
A2·x
A3·x
Communica)onAvoidingKernels:TheMatrixPowersKernel:[Ax,A2x,…,Akx]
• Replacekitera)onsofy=A⋅xwith[Ax,A2x,…,Akx]• Sequen)alAlgorithm
• Example:Atridiagonal,n=32,k=3
Step1
1234… …32
x
A·x
A2·x
A3·x
Communica)onAvoidingKernels:TheMatrixPowersKernel:[Ax,A2x,…,Akx]
• Replacekitera)onsofy=A⋅xwith[Ax,A2x,…,Akx]• Sequen)alAlgorithm
• Example:Atridiagonal,n=32,k=3
Step1 Step2
1234… …32
x
A·x
A2·x
A3·x
Communica)onAvoidingKernels:TheMatrixPowersKernel:[Ax,A2x,…,Akx]
• Replacekitera)onsofy=A⋅xwith[Ax,A2x,…,Akx]• Sequen)alAlgorithm
• Example:Atridiagonal,n=32,k=3
Step1 Step2 Step3
1234… …32
x
A·x
A2·x
A3·x
Communica)onAvoidingKernels:TheMatrixPowersKernel:[Ax,A2x,…,Akx]
• Replacekitera)onsofy=A⋅xwith[Ax,A2x,…,Akx]• Sequen)alAlgorithm
• Example:Atridiagonal,n=32,k=3
Step1 Step2 Step3 Step4
1234… …32
x
A·x
A2·x
A3·x
Communica)onAvoidingKernels:TheMatrixPowersKernel:[Ax,A2x,…,Akx]
• Replacekitera)onsofy=A⋅xwith[Ax,A2x,…,Akx]• ParallelAlgorithm
• Example:Atridiagonal,n=32,k=3
• Eachprocessorcommunicatesoncewithneighbors
Proc1 Proc2 Proc3 Proc4
1234… …32
x
A·x
A2·x
A3·x
Communica)onAvoidingKernels:TheMatrixPowersKernel:[Ax,A2x,…,Akx]
• Replacekitera)onsofy=A⋅xwith[Ax,A2x,…,Akx]• ParallelAlgorithm
• Example:Atridiagonal,n=32,k=3
• Eachprocessorworkson(overlapping)trapezoid
Proc1 Proc2 Proc3 Proc4
TheMatrixPowersKernel:[Ax,A2x,…,Akx]onageneralmatrix(nearestkneighborsonagraph)
Sameideaforgeneralsparsematrices:k-wideneighboringregion57
Simpleblock-rowpar))oning"(hyper)graphpar))oning
Top-to-bofomprocessing"# TravelingSalesmanProblem
MinimizingCommunica)onofGMREStosolveAx=b
• GMRES:findxinspan{b,Ab,…,Akb}minimizing||Ax-b||2
StandardGMRESfori=1tokw=A·v(i-1)…SpMVMGS(w,v(0),…,v(i-1))updatev(i),HendforsolveLSQproblemwithH
Communica)on-avoidingGMRESW=[v,Av,A2v,…,Akv][Q,R]=TSQR(W)…“TallSkinnyQR”buildHfromRsolveLSQproblemwithH
• Oops–Wfrompowermethod,precisionlost!58
Sequen)alcase:#wordsmoveddecreasesbyafactorofkParallelcase:#messagesdecreasesbyafactorofk
MatrixPowersKernel+TSQRinGMRES
0 200 400 600 800 1000
Iteration count
10�5
10�4
10�3
10�2
10�1
100R
elat
ive
norm
ofre
sidu
alA
x�
bOriginal GMRES
0 200 400 600 800 1000
Iteration count
10�5
10�4
10�3
10�2
10�1
100R
elat
ive
norm
ofre
sidu
alA
x�
bOriginal GMRESCA-GMRES (Monomial basis)
0 200 400 600 800 1000
Iteration count
10�5
10�4
10�3
10�2
10�1
100R
elat
ive
norm
ofre
sidu
alA
x�
bOriginal GMRESCA-GMRES (Monomial basis)CA-GMRES (Newton basis)
59
SpeedupsofGMRESon8-coreIntelClovertown
[MHDY09]
60
RequiresCo-tuningKernels
SpeedupsforGMGw/CA-KSMBofomSolve
61
• ComparedBICGSTABvs.CA-BICGSTABwiths=4(monomialbasis)
• HopperatNERSC(CrayXE6),weakscaling:Upto4096MPIprocesses(1perchip,24,576corestotal)
• SpeedupsforminiGMGbenchmark(HPGMGbenchmarkpredecessor)– 4.2xinbofomsolve,2.5xoverallGMGsolve
• Implementedasasolverop)oninBoxLibandCHOMBOAMRframeworks
• SpeedupsfortwoBoxLibapplica)ons:– 3DLMC(alow-machnumbercombus)oncode)
• 2.5xinbofomsolve,1.5xoverallGMGsolve– 3DNyx(anN-bodyandgasdynamicscode)
• 2xinbofomsolve,1.15xoverallGMGsolve
SummaryofItera)veLinearAlgebra
• Newlowerbounds,op)malalgorithms,bigspeedupsintheoryandprac)ce
• Lotsofotherprogress,openproblems– Manydifferentalgorithmsreorganized
• Moreunderway,moretobedone
– Extendstomachinelearning(coordinatedescent)– Needtorecognizestablevariantsmoreeasily– Precondi)oning
• HierarchicallySemiseparableMatrices
– Autotuningandsynthesis• Differentkindsof“sparsematrices”
Outline
• SurveystateoftheartofCA(Comm-Avoiding)algorithms– ReviewpreviousMatmulalgorithms– CAO(n3)2.5DMatmul– TSQR:Tall-SkinnyQR
• Beyondlinearalgebra– Extendinglowerboundstoanyalgorithmwitharrays– Communica)on-op)malN-bodyalgorithm
• CA-Krylovmethods• RelatedTopics
– Write-AvoidingAlgorithms– Reproducilibity
Formoredetails
• Bebop.cs.berkeley.edu– 155pagesurveyinActaNumerica
• CS267–Berkeley’sParallelCompu)ngCourse– LivebroadcastinSpring2016,plannedSpring2017
• www.cs.berkeley.edu/~demmel• Allslides,videoavailable
– AvailableasSPOCfromXSEDE• www.xsede.org• Freesupercomputeraccountstodohomework• Freeautogradingofhomework
CollaboratorsandSupporters• JamesDemmel,KathyYelick,MichaelAnderson,GreyBallard,ErinCarson,Aditya
Devarakonda,MichaelDriscoll,DavidEliahu,AndrewGearhart,EvangelosGeorganas,NicholasKnight,PenpornKoanantakool,BenLipshitz,OdedSchwartz,EdgarSolomonik,OmerSpillinger
• Aus)nBenson,MaryamDehnavi,MarkHoemmen,ShoaibKamil,MarghoobMohiyuddin
• AbhinavBhatele,AydinBuluc,MichaelChrist,IoanaDumitriu,ArmandoFox,DavidGleich,MingGu,JeffHammond,MikeHeroux,OlgaHoltz,KurtKeutzer,JulienLangou,DevinMafhews,TomScanlon,MichelleStrout,SamWilliams,HuaXiang
• JackDongarra,DulceneiaBecker,IchitaroYamazaki
• SivanToledo,AlexDruinsky,InonPeled
• LauraGrigori,Sebas)enCayrols,SimpliceDonfack,MathiasJacquelin,AmalKhabou,SophieMoufawad,MikolajSzydlarski
• MembersofFASTMath,ParLab,ASPIRE,BEBOP,CACHE,EASI,MAGMA,PLASMA
• ThankstoDOE,NSF,UCDiscovery,INRIA,Intel,Microsog,Mathworks,Na)onalInstruments,NEC,Nokia,NVIDIA,Samsung,Oracle
• bebop.cs.berkeley.edu
Summary
Don’tCommunic…
66
Timetoredesignalllinearalgebra,n-body,…algorithmsandsogware
(andcompilers)