communicaon-avoiding algorithms for linear algebra and beyond · compu)ng systems. on modern...
TRANSCRIPT
Communica)on-AvoidingAlgorithmsforLinearAlgebraandBeyond
JimDemmelEECS&MathDepartments
UCBerkeley
2
Whyavoidcommunica)on?(1/3)Algorithmshavetwocosts(measuredin)meorenergy):1. Arithme)c(FLOPS)2. Communica)on:movingdatabetween
– levelsofamemoryhierarchy(sequen)alcase)– processorsoveranetwork(parallelcase).
CPUCache
DRAM
CPUDRAM
CPUDRAM
CPUDRAM
CPUDRAM
Whyavoidcommunica)on?(2/3)• Running)meofanalgorithmissumof3terms:
– #flops*)me_per_flop– #wordsmoved/bandwidth– #messages*latency
3"
communica)on
• Time_per_flop<<1/bandwidth<<latency• Gapsgrowingexponen)allywith)me[FOSC]
• Avoidcommunica)ontosave)me
• Goal:reorganizealgorithmstoavoidcommunica)on• Betweenallmemoryhierarchylevels
• L1L2DRAMnetwork,etc• Verylargespeedupspossible• Energysavingstoo!
Annualimprovements
Time_per_flop Bandwidth Latency
Network 26% 15%
DRAM 23% 5%59%
WhyMinimizeCommunica)on?(3/3)
1
10
100
1000
10000
PicoJoules
now
2018
Source:JohnShalf,LBL
WhyMinimizeCommunica)on?(3/3)
1
10
100
1000
10000
PicoJoules
now
2018
Source:JohnShalf,LBL
Minimizecommunica)ontosaveenergy
Goals
6"
• Redesignalgorithmstoavoidcommunica)on• Betweenallmemoryhierarchylevels
• L1L2DRAMnetwork,etc
• Aiainlowerboundsifpossible• Currentalgorithmsojenfarfromlowerbounds• Largespeedupsandenergysavingspossible
SampleSpeedups• Up to 12x faster for 2.5D matmul on 64K core IBM BG/P • Up to 3x faster for tensor contractions on 2K core Cray XE/6
• Up to 6.2x faster for APSP on 24K core Cray CE6
• Up to 2.1x faster for 2.5D LU on 64K core IBM BG/P
• Up to 11.8x faster for direct N-body on 32K core IBM BG/P
• Up to 13x faster for TSQR on Tesla C2050 Fermi NVIDIA GPU
• Up to 6.7x faster for symeig (band A) on 10 core Intel Westmere
• Up to 2x faster for 2.5D Strassen on 38K core Cray XT4
• Up to 4.2x faster for MiniGMG benchmark bottom solver, using CA-BiCGStab (2.5x for overall solve)
– 2.5x / 1.5x for combustion simulation code
7
“NewAlgorithmImprovesPerformanceandAccuracyonExtreme-ScaleCompu)ngSystems.Onmoderncomputerarchitectures,communica:onbetweenprocessorstakeslongerthantheperformanceofafloa:ngpointarithme:copera:onbyagivenprocessor.ASCRresearchershavedevelopedanewmethod,derivedfromcommonlyusedlinearalgebramethods,tominimizecommunica:onsbetweenprocessorsandthememoryhierarchy,byreformula:ngthecommunica:onpaDernsspecifiedwithinthealgorithm.ThismethodhasbeenimplementedintheTRILINOSframework,ahighly-regardedsuiteofsojware,whichprovidesfunc)onalityforresearchersaroundtheworldtosolvelargescale,complexmul)-physicsproblems.”FY2010CongressionalBudget,Volume4,FY2010Accomplishments,AdvancedScien)ficCompu)ng
Research(ASCR),pages65-67.
PresidentObamacitesCommunica)on-AvoidingAlgorithmsintheFY2012DepartmentofEnergyBudgetRequesttoCongress:
CA-GMRES(Hoemmen,Mohiyuddin,Yelick,JD)“Tall-Skinny”QR(Grigori,Hoemmen,Langou,JD)
Outline
• SurveystateoftheartofCA(Comm-Avoiding)algorithms– ReviewpreviousMatmulalgorithms– CAO(n3)2.5DMatmulandLU– TSQR:Tall-SkinnyQR– CAStrassenMatmul
• Beyondlinearalgebra– Extendinglowerboundstoanyalgorithmwitharrays– Communica)on-op)malN-bodyalgorithm
• CA-Krylovmethods• RelatedTopics
Outline
• SurveystateoftheartofCA(Comm-Avoiding)algorithms– ReviewpreviousMatmulalgorithms– CAO(n3)2.5DMatmulandLU– TSQR:Tall-SkinnyQR– CAStrassenMatmul
• Beyondlinearalgebra– Extendinglowerboundstoanyalgorithmwitharrays– Communica)on-op)malN-bodyalgorithm
• CA-Krylovmethods• RelatedTopics
SummaryofCALinearAlgebra• “Direct”LinearAlgebra
• Lowerboundsoncommunica)onforlinearalgebraproblemslikeAx=b,leastsquares,Ax=λx,SVD,etc
• Mostlynotaiainedbyalgorithmsinstandardlibraries• Newalgorithmsthataiaintheselowerbounds
• Beingaddedtolibraries:Sca/LAPACK,PLASMA,MAGMA
• Largespeed-upspossible• Autotuningtofindop)malimplementa)on
• Diiofor“Itera)ve”LinearAlgebra
Lowerboundforall“n3-like”linearalgebra
• Holdsfor– Matmul,BLAS,LU,QR,eig,SVD,tensorcontrac)ons,…– Somewholeprograms(sequencesoftheseopera)ons,nomaierhowindividualopsareinterleaved,egAk)
– Denseandsparsematrices(where#flops<<n3)– Sequen)alandparallelalgorithms– Somegraph-theore)calgorithms(egFloyd-Warshall)
12
• LetM=“fast”memorysize(perprocessor)
#words_moved(perprocessor)=Ω(#flops(perprocessor)/M1/2)
#messages_sent(perprocessor)=Ω(#flops(perprocessor)/M3/2)
• Parallelcase:assumeeitherloadormemorybalanced
Lowerboundforall“n3-like”linearalgebra
• Holdsfor– Matmul,BLAS,LU,QR,eig,SVD,tensorcontrac)ons,…– Somewholeprograms(sequencesoftheseopera)ons,nomaierhowindividualopsareinterleaved,egAk)
– Denseandsparsematrices(where#flops<<n3)– Sequen)alandparallelalgorithms– Somegraph-theore)calgorithms(egFloyd-Warshall)
13
• LetM=“fast”memorysize(perprocessor)
#words_moved(perprocessor)=Ω(#flops(perprocessor)/M1/2)
#messages_sent≥#words_moved/largest_message_size
• Parallelcase:assumeeitherloadormemorybalanced
Lowerboundforall“n3-like”linearalgebra
• Holdsfor– Matmul,BLAS,LU,QR,eig,SVD,tensorcontrac)ons,…– Somewholeprograms(sequencesoftheseopera)ons,nomaierhowindividualopsareinterleaved,egAk)
– Denseandsparsematrices(where#flops<<n3)– Sequen)alandparallelalgorithms– Somegraph-theore)calgorithms(egFloyd-Warshall)
14
• LetM=“fast”memorysize(perprocessor)
#words_moved(perprocessor)=Ω(#flops(perprocessor)/M1/2)
#messages_sent(perprocessor)=Ω(#flops(perprocessor)/M3/2)
• Parallelcase:assumeeitherloadormemorybalanced
SIAMSIAG/LinearAlgebraPrize,2012Ballard,D.,Holtz,Schwartz
Canweaiaintheselowerbounds?• Doconven)onaldensealgorithmsasimplementedinLAPACKandScaLAPACKaiainthesebounds?– Ojennot
• Ifnot,arethereotheralgorithmsthatdo?– Yes,formuchofdenselinearalgebra,APSP– Newalgorithms,withnewnumericalproper)es,newwaystoencodeanswers,newdatastructures
– Notjustlooptransforma)ons(needthosetoo!)• Onlyafewsparsealgorithmssofar
– Ex:Matmulof“random”sparsematrices– Ex:SparseCholeskyofmatriceswith“large”separators
• Lotsofworkinprogress15
Outline
• SurveystateoftheartofCA(Comm-Avoiding)algorithms– ReviewpreviousMatmulalgorithms– CAO(n3)2.5DMatmulandLU– TSQR:Tall-SkinnyQR– CAStrassenMatmul
• Beyondlinearalgebra– Extendinglowerboundstoanyalgorithmwitharrays– Communica)on-op)malN-bodyalgorithm
• CA-Krylovmethods• RelatedTopics
17"
NaïveMatrixMul)ply{implementsC=C+A*B}fori=1tonforj=1tonfork=1tonC(i,j)=C(i,j)+A(i,k)*B(k,j)
= + * C(i,j) A(i,:)
B(:,j) C(i,j)
18"
NaïveMatrixMul)ply{implementsC=C+A*B}fori=1ton{readrowiofAintofastmemory}forj=1ton{readC(i,j)intofastmemory}{readcolumnjofBintofastmemory}fork=1tonC(i,j)=C(i,j)+A(i,k)*B(k,j){writeC(i,j)backtoslowmemory}
= + * C(i,j) A(i,:)
B(:,j) C(i,j)
19"
NaïveMatrixMul)ply{implementsC=C+A*B}fori=1ton{readrowiofAintofastmemory}…n2readsaltogetherforj=1ton{readC(i,j)intofastmemory}…n2readsaltogether{readcolumnjofBintofastmemory}…n3readsaltogetherfork=1tonC(i,j)=C(i,j)+A(i,k)*B(k,j){writeC(i,j)backtoslowmemory}…n2writesaltogether
= + * C(i,j) A(i,:)
B(:,j) C(i,j)
n3+3n2reads/writesaltogether–dominates2n3arithme)c
20"
Blocked(Tiled)MatrixMul)plyConsiderA,B,Ctoben/b-by-n/bmatricesofb-by-bsubblockswhere
biscalledtheblocksize;assume3b-by-bblocksfitinfastmemoryfori=1ton/b
forj=1ton/b {readblockC(i,j)intofastmemory} fork=1ton/b {readblockA(i,k)intofastmemory} {readblockB(k,j)intofastmemory} C(i,j)=C(i,j)+A(i,k)*B(k,j){doamatrixmul)plyonblocks} {writeblockC(i,j)backtoslowmemory}
= + * C(i,j) C(i,j) A(i,k)
B(k,j) b-by-bblock
21"
Blocked(Tiled)MatrixMul)plyConsiderA,B,Ctoben/b-by-n/bmatricesofb-by-bsubblockswhere
biscalledtheblocksize;assume3b-by-bblocksfitinfastmemoryfori=1ton/b
forj=1ton/b {readblockC(i,j)intofastmemory}…b2×(n/b)2=n2reads fork=1ton/b {readblockA(i,k)intofastmemory}…b2×(n/b)3=n3/breads {readblockB(k,j)intofastmemory}…b2×(n/b)3=n3/breads C(i,j)=C(i,j)+A(i,k)*B(k,j){doamatrixmul)plyonblocks} {writeblockC(i,j)backtoslowmemory}…b2×(n/b)2=n2writes
= + * C(i,j) C(i,j) A(i,k)
B(k,j) b-by-bblock
2n3/b+2n2reads/writes<<2n3arithme)c-Faster!
Doesblockedmatmulaiainlowerbound?• Recall:if3b-by-bblocksfitinfastmemoryofsizeM,then#reads/writes=2n3/b+2n2
• Makebaslargeaspossible:3b2≤M,so#reads/writes≥31/2n3/M1/2+2n2
• Aiainslowerbound=Ω(#flops/M1/2)
• Butwhatifwedon’tknowM?• Oriftherearemul)plelevelsoffastmemory?• Canuse“CacheOblivious”algorithm
22
RecursiveMatrixMul)plica)on(RMM)(1/2)• Forsimplicity:squarematriceswithn=2m
• C==A·B=··
=
• TruewheneachAijetc1x1orn/2xn/2
23"
A11 A12 A21 A22
B11 B12 B21 B22
C11 C12 C21 C22
A11·B11 + A12·B21 A11·B12 + A12·B22 A21·B11 + A22·B21 A21·B12 + A22·B22
func C = RMM (A, B, n) if n = 1, C = A * B, else { C11 = RMM (A11 , B11 , n/2) + RMM (A12 , B21 , n/2) C12 = RMM (A11 , B12 , n/2) + RMM (A12 , B22 , n/2) C21 = RMM (A21 , B11 , n/2) + RMM (A22 , B21 , n/2) C22 = RMM (A21 , B12 , n/2) + RMM (A22 , B22 , n/2) } return
RecursiveMatrixMul)plica)on(RMM)(2/2)
24"
func C = RMM (A, B, n) if n=1, C = A * B, else { C11 = RMM (A11 , B11 , n/2) + RMM (A12 , B21 , n/2) C12 = RMM (A11 , B12 , n/2) + RMM (A12 , B22 , n/2) C21 = RMM (A21 , B11 , n/2) + RMM (A22 , B21 , n/2) C22 = RMM (A21 , B12 , n/2) + RMM (A22 , B22 , n/2) } return
A(n) = # arithmetic operations in RMM( . , . , n) = 8 · A(n/2) + 4(n/2)2 if n > 1, else 1 = 2n3 … same operations as usual, in different order W(n) = # words moved between fast, slow memory by RMM( . , . , n) = 8 · W(n/2) + 12(n/2)2 if 3n2 > M , else 3n2 = O( n3 / M1/2 + n2 ) … same as blocked matmul “Cache oblivious”, works for memory hierarchies, but not panacea
Forbigspeedups,seeSC12posteron“Bea)ngMKLandScaLAPACKatRectangularMatmul”11/13at5:15-7pm
Howhardishand-tuningmatmul,anyway?
25"
• Results of 22 student teams trying to tune matrix-multiply, in CS267 Spr09 • Students given “blocked” code to start with (7x faster than naïve)
• Still hard to get close to vendor tuned performance (ACML) (another 6x) • For more discussion, see www.cs.berkeley.edu/~volkov/cs267.sp09/hw1/results/
Howhardishand-tuningmatmul,anyway?
26"
27"
SUMMA–nxnmatmulonP1/2xP1/2grid(nearly)op)malusingminimummemoryM=O(n2/P)
For k=0 to n-1 … or n/b-1 where b is the block size … = # cols in A(i,k) and # rows in B(k,j) for all i = 1 to pr … in parallel owner of A(i,k) broadcasts it to whole processor row for all j = 1 to pc … in parallel owner of B(k,j) broadcasts it to whole processor column Receive A(i,k) into Acol Receive B(k,j) into Brow C_myproc = C_myproc + Acol * Brow
For k=0 to n/b-1 … b = block size = #cols in A(i,k) = #rows in B(k,j) for all i = 1 to P1/2
owner of A(i,k) broadcasts it to whole processor row (using binary tree) for all j = 1 to P1/2
owner of B(k,j) broadcasts it to whole processor column (using bin. tree) Receive A(i,k) into Acol Receive B(k,j) into Brow C_myproc = C_myproc + Acol * Brow
* = i!
j!
A(i,k)!
k!k!B(k,j)!
C(i,j)
Brow
Acol
Usingmorethantheminimummemory• Whatifmatrixsmallenoughtofitc>1copies,soM=cn2/P?
– #words_moved=Ω(#flops/M1/2)=Ω(n2/(c1/2P1/2))– #messages=Ω(#flops/M3/2)=Ω(P1/2/c3/2)
• Canweaiainnewlowerbound?
Summaryofdenseparallelalgorithmsaiainingcommunica)onlowerbounds
28
• AssumenxnmatricesonPprocessors• MinimumMemoryperprocessor=M=O(n2/P)• Recalllowerbounds:
#words_moved=Ω((n3/P)/M1/2)=Ω(n2/P1/2)#messages=Ω((n3/P)/M3/2)=Ω(P1/2)
• DoesScaLAPACKaiainthesebounds?• For#words_moved:mostly,exceptnonsym.Eigenproblem• For#messages:asympto)callyworse,exceptCholesky
• Newalgorithmsaiainallbounds,uptopolylog(P)factors• Cholesky,LU,QR,Sym.andNonsymeigenproblems,SVD
CanwedoBeier?
Canwedobeier?• Aren’twealreadyop)mal?• WhyassumeM=O(n2/p),i.e.minimal?
– Lowerbounds)lltrueifmorememory– Canweaiainit?
• Specialcase:“3DMatmul”– UsesM=O(n2/p2/3)– Dekel,Nassimi,Sahni[81],Bernsten[89],Agarwal,Chandra,Snir[90],Johnson[93],Agarwal,Balle,Gustavson,Joshi,Palkar[95]
• Notalwaysp1/3)mesasmuchmemoryavailable…
Canwedobeier?• Aren’twealreadyop)mal?• WhyassumeM=O(n2/p),i.e.minimal?
– Lowerbounds)lltrueifmorememory– Canweaiainit?
• Specialcase:“3DMatmul”– UsesM=O(n2/p2/3)– Dekel,Nassimi,Sahni[81],Bernsten[89],Agarwal,Chandra,Snir[90],Johnson[93],Agarwal,Balle,Gustavson,Joshi,Palkar[95]
– Processorsarrangedinp1/3xp1/3xp1/3grid– Processor(i,j,k)performsC(i,j)=C(i,j)+A(i,k)*B(k,j),whereeachsubmatrixisn/p1/3xn/p1/3
• Notalwaysthatmuchmemoryavailable…
Outline
• SurveystateoftheartofCA(Comm-Avoiding)algorithms– ReviewpreviousMatmulalgorithms– CAO(n3)2.5DMatmulandLU– TSQR:Tall-SkinnyQR– CAStrassenMatmul
• Beyondlinearalgebra– Extendinglowerboundstoanyalgorithmwitharrays– Communica)on-op)malN-bodyalgorithm
• CA-Krylovmethods• RelatedTopics
2.5DMatrixMul)plica)on
• Assumecanfitcn2/Pdataperprocessor,c>1• Processorsform(P/c)1/2x(P/c)1/2xcgrid
c
(P/c)1/2
Example:P=32,c=2
2.5DMatrixMul)plica)on
• Assumecanfitcn2/Pdataperprocessor,c>1• Processorsform(P/c)1/2x(P/c)1/2xcgrid
k
j
Ini)allyP(i,j,0)ownsA(i,j)andB(i,j)eachofsizen(c/P)1/2xn(c/P)1/2
(1)P(i,j,0)broadcastsA(i,j)andB(i,j)toP(i,j,k)(2)Processorsatlevelkperform1/c-thofSUMMA,i.e.1/c-thofΣmA(i,m)*B(m,j)(3)Sum-reducepar)alsumsΣmA(i,m)*B(m,j)alongk-axissoP(i,j,0)ownsC(i,j)
2.5DMatmulonBG/P,16Knodes/64Kcores
0
20
40
60
80
100
8192 131072
Pe
rce
nta
ge
of
ma
chin
e p
ea
k
n
Matrix multiplication on 16,384 nodes of BG/P
12X faster
2.7X faster
Using c=16 matrix copies
2D MM2.5D MM
2.5DMatmulonBG/P,16Knodes/64Kcores
0
0.2
0.4
0.6
0.8
1
1.2
1.4
n=8192, 2D
n=8192, 2.5D
n=131072, 2D
n=131072, 2.5D
Exe
cutio
n tim
e n
orm
aliz
ed b
y 2D
Matrix multiplication on 16,384 nodes of BG/P
95% reduction in comm computationidle
communication
c=16copies
Dis:nguishedPaperAward,EuroPar’11(Solomonik,D.)SC’11paperbySolomonik,Bhatele,D.
12xfaster
2.7xfaster
PerfectStrongScaling–inTimeandEnergy• Every)meyouaddaprocessor,youshoulduseitsmemoryMtoo• Startwithminimalnumberofprocs:PM=3n2• IncreasePbyafactorofcètotalmemoryincreasesbyafactorofc• Nota)onfor)mingmodel:
– γT,βT,αT=secsperflop,perword_moved,permessageofsizem• T(cP)=n3/(cP)[γT+βT/M1/2+αT/(mM1/2)]=T(P)/c• Nota)onforenergymodel:
– γE,βE,αE=joulesforsameopera)ons– δE=joulesperwordofmemoryusedpersec– εE=joulespersecforleakage,etc.
• E(cP)=cP{n3/(cP)[γE+βE/M1/2+αE/(mM1/2)]+δEMT(cP)+εET(cP)}=E(P)• ExtendstoN-body,Strassen,…• Canprovelowerboundsonneedednetwork(eg3Dtorusformatmul)
PerfectStrongScaling–inTimeandEnergy(2/2)• T(cP)=n3/(cP)[γT+βT/M1/2+αT/(mM1/2)]=T(P)/c• E(cP)=cP{n3/(cP)[γE+βE/M1/2+αE/(mM1/2)]+δEMT(cP)+εET(cP)}=E(P)• PerfectscalingextendstoN-body,Strassen,…• Wecanusethesemodelstoanswermanyques)ons,including:
• Whatistheminimumenergyrequiredforacomputa)on?• Givenamaximumallowedrun)meT,whatistheminimumenergyE
neededtoachieveit?• GivenamaximumenergybudgetE,whatistheminimumrun)meTthat
wecanaiain?• Thera)oP=E/Tgivesustheaveragepowerrequiredtorunthealgorithm.
Canweminimizetheaveragepowerconsumed?• Givenanalgorithm,problemsize,numberofprocessorsandtargetenergy
efficiency(GFLOPS/W),canwedetermineasetofarchitecturalparameterstodescribeaconformingcomputerarchitecture?
2.5Dvs2DLUWithandWithoutPivo)ng
0
20
40
60
80
100
NO-pivot 2D
NO-pivot 2.5D
CA-pivot 2D
CA-pivot 2.5D
Tim
e (
sec)
LU on 16,384 nodes of BG/P (n=131,072)
2X faster
2X faster
computeidle
communication
Thm:PerfectStrongScalingimpossible,becauseLatency*Bandwidth=Ω(n2)
Outline
• SurveystateoftheartofCA(Comm-Avoiding)algorithms– ReviewpreviousMatmulalgorithms– CAO(n3)2.5DMatmulandLU– TSQR:Tall-SkinnyQR– CAStrassenMatmul
• Beyondlinearalgebra– Extendinglowerboundstoanyalgorithmwitharrays– Communica)on-op)malN-bodyalgorithm
• CA-Krylovmethods• RelatedTopics
TSQR:QRofaTall,Skinnymatrix
40
W=
Q00R00Q10R10Q20R20Q30R30
W0W1W2W3
Q00Q10Q20Q30
= = .
R00R10R20R30
R00R10R20R30
=Q01R01Q11R11
Q01Q11
= . R01R11
R01R11
= Q02R02
TSQR:QRofaTall,Skinnymatrix
41
W=
Q00R00Q10R10Q20R20Q30R30
W0W1W2W3
Q00Q10Q20Q30
= = .
R00R10R20R30
R00R10R20R30
=Q01R01Q11R11
Q01Q11
= . R01R11
R01R11
= Q02R02
Output={Q00,Q10,Q20,Q30,Q01,Q11,Q02,R02}
TSQR:AnArchitecture-DependentAlgorithm
W=
W0W1W2W3
R00R10R20R30
R01
R11
R02Parallel:
W=
W0W1W2W3
R01 R02
R00
R03Sequen)al:
W=
W0W1W2W3
R00R01
R01R11
R02
R11R03
DualCore:
Canchoosereduc)ontreedynamicallyMul)core/Mul)socket/Mul)rack/Mul)site/Out-of-core:?
TSQRPerformanceResults• Parallel
– IntelClovertown– Upto8xspeedup(8core,dualsocket,10Mx10)
– Pen)umIIIcluster,DolphinInterconnect,MPICH• Upto6.7xspeedup(16procs,100Kx200)
– BlueGene/L• Upto4xspeedup(32procs,1Mx50)
– TeslaC2050/Fermi• Upto13x(110,592x100)
– Grid–4xon4ci)esvs1city(Dongarra,Langouetal)– Cloud–1.6xslowerthanjustaccessingdatatwice(GleichandBenson)
• Sequen)al– “Infinitespeedup”forout-of-coreonPowerPClaptop
• Asliileas2xslowdownvs(predicted)infiniteDRAM• LAPACKwithvirtualmemoryneverfinished
• SVDcostsaboutthesame• JointworkwithGrigori,Hoemmen,Langou,Anderson,Ballard,Keutzer,others
43DatafromGreyBallard,MarkHoemmen,LauraGrigori,JulienLangou,JackDongarra,MichaelAnderson
TSQRPerformanceResults• Parallel
– IntelClovertown– Upto8xspeedup(8core,dualsocket,10Mx10)
– Pen)umIIIcluster,DolphinInterconnect,MPICH• Upto6.7xspeedup(16procs,100Kx200)
– BlueGene/L• Upto4xspeedup(32procs,1Mx50)
– TeslaC2050/Fermi• Upto13x(110,592x100)
– Grid–4xon4ci)esvs1city(Dongarra,Langouetal)– Cloud–1.6xslowerthanjustaccessingdatatwice(GleichandBenson)
• Sequen)al– “Infinitespeedup”forout-of-coreonPowerPClaptop
• Asliileas2xslowdownvs(predicted)infiniteDRAM• LAPACKwithvirtualmemoryneverfinished
• SVDcostsaboutthesame• JointworkwithGrigori,Hoemmen,Langou,Anderson,Ballard,Keutzer,others
44DatafromGreyBallard,MarkHoemmen,LauraGrigori,JulienLangou,JackDongarra,MichaelAnderson
SIAGonSupercompu:ngBestPaperPrize,2016
Outline• SurveystateoftheartofCA(Comm-Avoiding)algorithms– ReviewpreviousMatmulalgorithms– CAO(n3)2.5DMatmul– TSQR:Tall-SkinnyQR– CAO(n3)2.5DLU– CAStrassenMatmul
• Beyondlinearalgebra– Extendinglowerboundstoanyalgorithmwitharrays– Communica)on-op)malN-bodyalgorithm
• CA-Krylovmethods
BacktoLU:UsingsimilarideaforTSLUasTSQR:Usereduc)ontree,todo“TournamentPivo)ng”
46
Wnxb =
W1 W2 W3 W4
P1·L1·U1 P2·L2·U2 P3·L3·U3 P4·L4·U4
=
Choose b pivot rows of W1, call them W1’ Choose b pivot rows of W2, call them W2’ Choose b pivot rows of W3, call them W3’ Choose b pivot rows of W4, call them W4’
W1’ W2’ W3’ W4’
P12·L12·U12 P34·L34·U34
= Choose b pivot rows, call them W12’ Choose b pivot rows, call them W34’
W12’ W34’
= P1234·L1234·U1234
Choose b pivot rows
• Go back to W and use these b pivot rows • Move them to top, do LU without pivoting • Extra work, but lower order term
• Thm: As numerically stable as Partial Pivoting on a larger matrix
ExascaleMachineParametersSource:DOEExascaleWorkshop
• 2^20≈1,000,000nodes• 1024cores/node(abillioncores!)• 100GB/secinterconnectbandwidth• 400GB/secDRAMbandwidth• 1microsecinterconnectlatency• 50nanosecmemorylatency• 32Petabytesofmemory• 1/2GBtotalL1onanode
ExascalepredictedspeedupsforGaussianElimina)on:
2DCA-LUvsScaLAPACK-LU
log2(p)
log 2(n
2 /p)=
log 2(m
emory_pe
r_proc)
OngoingWork• Lotsmoreworkon
– Algorithms:• BLAS,LDLT,QRwithpivo)ng,otherpivo)ngschemes,eigenproblems,…• Sparsematrices,structuredmatrices• All-pairs-shortest-path,…• Both2D(c=1)and2.5D(c>1)• Butonlybandwidthmaydecreasewithc>1,notlatency(egLU)
– Pla�orms:• Mul)core,cluster,GPU,cloud,heterogeneous,low-energy,…
– Sojware:• Integra)onintoSca/LAPACK,PLASMA,MAGMA,…
• Integra)onintoapplica)ons– CTF(withANL):symmetrictensorcontrac)ons
Outline
• SurveystateoftheartofCA(Comm-Avoiding)algorithms– ReviewpreviousMatmulalgorithms– CAO(n3)2.5DMatmulandLU– TSQR:Tall-SkinnyQR– CAStrassenMatmul
• Beyondlinearalgebra– Extendinglowerboundstoanyalgorithmwitharrays– Communica)on-op)malN-bodyalgorithm
• CA-Krylovmethods• RelatedTopics
Communica)onLowerBoundsforStrassen-likematmulalgorithms
• Proof:graphexpansion(differentfromclassicalmatmul)– Strassen-like:DAGmustbe“regular”[andconnected]
• ExtendsuptoM=n2/p2/ω
• BestPaperPrize(SPAA’11),Ballard,D.,Holtz,Schwartz,alsoinJACM
• Isthelowerboundaiainable?
ClassicalO(n3)matmul:
#words_moved=Ω(M(n/M1/2)3/P)
Strassen’sO(nlg7)matmul:
#words_moved=Ω(M(n/M1/2)lg7/P)
Strassen-likeO(nω)matmul:
#words_moved=Ω(M(n/M1/2)ω/P)
vs.
Runsall7mul)pliesinparallelEachonP/7processorsNeeds7/4asmuchmemory
Runsall7mul)pliessequen)allyEachonallPprocessorsNeeds1/4asmuchmemory
CAPSIfEnoughMemoryandP≥7thenBFSstepelseDFSstependif
Communication Avoiding Parallel Strassen (CAPS)
BestwaytointerleaveBFSandDFSisatuningparameter
53
Performance Benchmarking, Strong Scaling Plot Franklin (Cray XT4) n = 94080
Speedups:24%-184%(overpreviousStrassen-basedalgorithms)
InvitedtoappearasResearchHighlightinCACM
Outline
• SurveystateoftheartofCA(Comm-Avoiding)algorithms– ReviewpreviousMatmulalgorithms– CAO(n3)2.5DMatmulandLU– TSQR:Tall-SkinnyQR– CAStrassenMatmul
• Beyondlinearalgebra– Extendinglowerboundstoanyalgorithmwitharrays– Communica)on-op)malN-bodyalgorithm
• CA-Krylovmethods• RelatedTopics
Recallop)malsequen)alMatmul• Naïvecodefori=1:n,forj=1:n,fork=1:n,C(i,j)+=A(i,k)*B(k,j)• “Blocked”codefori1=1:b:n,forj1=1:b:n,fork1=1:b:nfori2=0:b-1,forj2=0:b-1,fork2=0:b-1i=i1+i2,j=j1+j2,k=k1+k2C(i,j)+=A(i,k)*B(k,j)• Thm:Pickingb=M1/2aiainslowerbound:#words_moved=Ω(n3/M1/2)• Wheredoes1/2comefrom?
bxbmatmul
NewThmappliedtoMatmul• fori=1:n,forj=1:n,fork=1:n,C(i,j)+=A(i,k)*B(k,j)• RecordarrayindicesinmatrixΔ
• SolveLPforx=[xi,xj,xk]T:max1Txs.t.Δx≤1
– Result:x=[1/2,1/2,1/2]T,1Tx=3/2=sHBL• Thm:#words_moved=Ω(n3/MSHBL-1)=Ω(n3/M1/2)AiainedbyblocksizesMxi,Mxj,Mxk=M1/2,M1/2,M1/2
i j k1 0 1 A
Δ= 0 1 1 B1 1 0 C
NewThmappliedtoDirectN-Body• fori=1:n,forj=1:n,F(i)+=force(P(i),P(j))• RecordarrayindicesinmatrixΔ
• SolveLPforx=[xi,xj]T:max1Txs.t.Δx≤1
– Result:x=[1,1],1Tx=2=sHBL• Thm:#words_moved=Ω(n2/MSHBL-1)=Ω(n2/M1)AiainedbyblocksizesMxi,Mxj=M1,M1
i j1 0 F
Δ= 1 0 P(i)0 1 P(j)
NewThmappliedtoRandomCode• fori1=1:n,fori2=1:n,…,fori6=1:nA1(i1,i3,i6)+=func1(A2(i1,i2,i4),A3(i2,i3,i5),A4(i3,i4,i6))A5(i2,i6)+=func2(A6(i1,i4,i5),A3(i3,i4,i6))• RecordarrayindicesinmatrixΔ
• SolveLPforx=[x1,…,x6]T:max1Txs.t.Δx≤1
– Result:x=[2/7,3/7,1/7,2/7,3/7,4/7],1Tx=15/7=sHBL• Thm:#words_moved=Ω(n6/MSHBL-1)=Ω(n6/M8/7)AiainedbyblocksizesM2/7,M3/7,M1/7,M2/7,M3/7,M4/7
i1 i2 i3 i4 i5 i6
1 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ= 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3,A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Wheredolowerandmatchingupperboundsoncommunica)oncomefrom?(1/3)
• OriginallyforC=A*BbyIrony/Tiskin/Toledo(2004)• Proofidea
– Supposewecanbound#useful_opera)ons≤GdoablewithdatainfastmemoryofsizeM
– SotodoF=#total_opera)ons,needtofillfastmemoryF/G)mes,andso#words_moved≥MF/G
• Hardpart:findingG• Aiaininglowerbound
– Needto“block”allopera)onstoperform~Gopera)onsoneverychunkofMwordsofdata
Proofofcommunica)onlowerbound(2/3)
60"
k
“A face”
“C face” Cube representing
C(1,1) += A(1,3)·B(3,1)
• If we have at most M “A squares”, M “B squares”, and M “C squares”,
how many cubes G can we have?
i
j
A(2,1)
A(1,3)
B(1
,3)
B(3
,1)
C(1,1)
C(2,3)
A(1,1) B(1
,1)
A(1,2)
B(2
,1)
C=A*B(Irony/Tiskin/Toledo
2004)
Proofofcommunica)onlowerbound(3/3)
61
x
z
z
y
xy
k
“Ashadow”
“Cshadow”
j
i
G=#cubesinblackboxwithsidelengthsx,yandz=Volumeofblackbox=x·y·z=(xz·zy·yx)1/2=(#A□s·#B□s·#C□s)1/2≤M3/2
(i,k)isin“Ashadow” if(i,j,k)in3Dset(j,k)isin“Bshadow” if(i,j,k)in3Dset(i,j)isin“Cshadow”if(i,j,k)in3DsetThm(Loomis&Whitney,1949)G=#cubesin3Dset=Volumeof3Dset≤(area(Ashadow)·area(Bshadow)·area(Cshadow))1/2≤M3/2
SummerSchoolLecture3
#words_moved=Ω(M(n3/P)/M3/2)=Ω(n3/(PM1/2))
Approachtogeneralizinglowerbounds• Matmulfori=1:n,forj=1:n,fork=1:n,C(i,j)+=A(i,k)*B(k,j)=>for(i,j,k)inS=subsetofZ3
Accessloca)onsindexedby(i,j),(i,k),(k,j)• Generalcasefori1=1:n,fori2=i1:m,…forik=i3:i4C(i1+2*i3-i7)=func(A(i2+3*i4,i1,i2,i1+i2,…),B(pnt(3*i4)),…)D(somethingelse)=func(somethingelse),…=>for(i1,i2,…,ik)inS=subsetofZkAccessloca)onsindexedbygrouphomomorphisms,egφC(i1,i2,…,ik)=(i1+2*i3-i7)φA(i1,i2,…,ik)=(i2+3*i4,i1,i2,i1+i2,…),…• Goal:Communica)onlowerbounds,op)malalgorithmsforany
programthatlookslikethis
Approachtogeneralizinglowerbounds• Matmulfori=1:n,forj=1:n,fork=1:n,C(i,j)+=A(i,k)*B(k,j)=>for(i,j,k)inS=subsetofZ3
Accessloca)onsindexedby(i,j),(i,k),(k,j)• Generalcasefori1=1:n,fori2=i1:m,…forik=i3:i4C(i1+2*i3-i7)=func(A(i2+3*i4,i1,i2,i1+i2,…),B(pnt(3*i4)),…)D(somethingelse)=func(somethingelse),…=>for(i1,i2,…,ik)inS=subsetofZkAccessloca)onsindexedbygrouphomomorphisms,egφC(i1,i2,…,ik)=(i1+2*i3-i7)φA(i1,i2,…,ik)=(i2+3*i4,i1,i2,i1+i2,…),…• Canwebound#loop_itera)ons(=|S|)givenboundson#pointsinitsimages,i.e.boundson|φC(S)|,|φA(S)|,…?
GeneralCommunica)onBound
• Givensubsetofloopitera)ons,howmuchdatadoweneed?– GivenSsubsetofZk,grouphomomorphismsφ1,φ2,…,bound|S|intermsof|φ1(S)|,|φ2(S)|,…,|φm(S)|
• Def:Hölder-Brascamp-LiebLP(HBL-LP)fors1,…,sm:forallsubgroupsH<Zk,rank(H)≤Σjsj*rank(φj(H))• Thm(Christ/Tao/Carbery/Bennei):Givens1,…,sm|S|≤Πj|φj(S)|sj• Thm:Givenaprogramwitharrayrefsgivenbyφj,choosesj
tominimizesHBL=ΣjsjsubjecttoHBL-LP.Then#words_moved=Ω(#itera)ons/MsHBL-1)
GeneralCommunica)onBound
• GivenSsubsetofZk,grouphomomorphismsφ1,φ2,…,bound|S|intermsof|φ1(S)|,|φ2(S)|,…,|φm(S)|
• Def:Hölder-Brascamp-LiebLP(HBL-LP)fors1,…,sm:forallsubgroupsH<Zk,rank(H)≤Σjsj*rank(φj(H))• Thm(Christ/Tao/Carbery/Bennei):Givens1,…,sm|S|≤Πj|φj(S)|
sj• Thm:Givenaprogramwitharrayrefsgivenbyφj,choosesjtominimizesHBL=ΣjsjsubjecttoHBL-LP.Then
#words_moved=Ω(#itera)ons/MsHBL-1)
Isthisboundaiainable?
• Butfirst:Canwewriteitdown?• Thm:(badnews)HBL-LPreducestoHilbert’s10thproblemoverQ(conjecturedtobeundecidable)
• Thm:(goodnews)AnotherLPwithsamesolu)onisdecidable
• Dependsonloopdependencies– Bestcase:none,orreduc)ons(likematmul)
• Thm:Inmanycases,solu)onxofDualHBL-LPgivesop)mal)ling– Ex:ForMatmul,x=[1/2,1/2,1/2]soop)mal)lingisM1/2xM1/2xM1/2
NewThmappliedtoDirectN-Body• fori=1:n,forj=1:n,F(i)+=force(P(i),P(j))• RecordarrayindicesinmatrixΔ
• SolveLPforx=[xi,xj]T:max1Txs.t.Δx≤1
– Result:x=[1,1],1Tx=2=sHBL• Thm:#words_moved=Ω(n2/MSHBL-1)=Ω(n2/M1)AiainedbyblocksizesMxi,Mxj=M1,M1
i j1 0 F
Δ= 1 0 P(i)0 1 P(j)
N-BodySpeedupsonIBM-BG/P(Intrepid)8Kcores,32Kpar)cles
11.8xspeedup
K.Yelick,E.Georganas,M.Driscoll,P.Koanantakool,E.Solomonik
VariableLoopBoundsareMoreComplicated
• Redundancyinn-bodycalcula)onf(i,j),f(j,i)• k-wayn-bodyproblems(“k-body”)hasevenmore
��
����
����
����
����
����
����
� � � �� �� �� �� ���
�� ��������������������������� �
����������������������� !"#�$##��"� �
%��#� ������&� ����� �
PenpornKoanantakoolandK.Yelick
• Canachievebothcommunica)onandcomputa)on(symmetryexploi)ng)op)mal
04/07/2015 CS267 Lecture 21
Some Applications • Gravity, Turbulence, Molecular Dynamics, Plasma
Simulation, … • Electron-Beam Lithography Device Simulation • Hair ...
– www.fxguide.com/featured/brave-new-hair/ – graphics.pixar.com/library/CurlyHairA/paper.pdf
70
Isthisboundaiainable(1/2)?• Butfirst:Canwewriteitdown?• Thm:(badnews)HBL-LPreducestoHilbert’s10thproblemoverQ(conjecturedtobeundecidable)
• Thm:(goodnews)AnotherLPwithsamesolu)onisdecidable(butexpensive,sofar)
• Thm:(beiernews)EasytowritedownLPexplicitlyinmanycasesofinterest(egallφj={subsetofindices})
• Thm:(goodnews)Easytoapproximate,i.e.getupperorlowerboundsonsHBL
• Ifyoumissaconstraint,thelowerboundmaybetoolarge(i.e.sHBLtoosmall)buts)llworthtryingtoaiain
• Tarski-decidabletogetsupersetofconstraints(maygetsHBLtoolarge)
Isthisboundaiainable(2/2)?
• Dependsonloopdependencies• Bestcase:none,orreduc)ons(matmul)• Thm:Whenallφj={subsetofindices},dualofHBL-LPgivesop)mal)lesizes:
HBL-LP:minimize1T*ss.t.sT*Δ≥1TDual-HBL-LP:maximize1T*xs.t.Δ*x≤1Thenforsequen)alalgorithm,)leijbyMxj
• Ex:Matmul:s=[1/2,1/2,1/2]T=x• Extendstounimodulartransformsofindices
OngoingWork
• Implement/improvealgorithmstogenerateforlowerbounds,op)malalgorithms
• Haveyettofindacasewherewecannotaiainlowerbound–canweprovethis?
• Extend“perfectscaling”resultsfor)meandenergybyusingextramemory– “n.5Dalgorithms”
• Incorporateintocompilers
OngoingWork
• Acceleratedecisionprocedureforlowerbounds– Ex:Atmost3arrays,or4loopnests
• Haveyettofindacasewherewecannotaiainlowerbound–canweprovethis?
• Extend“perfectscaling”resultsfor)meandenergybyusingextramemory– “n.5Dalgorithms”
• Incorporateintocompilers
Outline
• SurveystateoftheartofCA(Comm-Avoiding)algorithms– ReviewpreviousMatmulalgorithms– CAO(n3)2.5DMatmulandLU– TSQR:Tall-SkinnyQR– CAStrassenMatmul
• Beyondlinearalgebra– Extendinglowerboundstoanyalgorithmwitharrays– Communica)on-op)malN-bodyalgorithm
• CA-Krylovmethods• RelatedTopics
AvoidingCommunica)oninItera)veLinearAlgebra
• k-stepsofitera)vesolverforsparseAx=borAx=λx– DoeskSpMVswithAandstar)ngvector– Manysuch“KrylovSubspaceMethods”
• ConjugateGradients(CG),GMRES,Lanczos,Arnoldi,…• Goal:minimizecommunica)on
– Assumematrix“well-par))oned”– Serialimplementa)on
• Conven)onal:O(k)movesofdatafromslowtofastmemory• New:O(1)movesofdata–op:mal
– Parallelimplementa)ononpprocessors• Conven)onal:O(klogp)messages(kSpMVcalls,dotprods)• New:O(logp)messages-op:mal
• Lotsofspeeduppossible(modeledandmeasured)– Price:someredundantcomputa)on– Challenges:Poorpar))oning,Precondi)oning,Num.Stability
76
1234… …32
x
A·x
A2·x
A3·x
Communica)onAvoidingKernels:TheMatrixPowersKernel:[Ax,A2x,…,Akx]
• Replacekitera)onsofy=A⋅xwith[Ax,A2x,…,Akx]
• Example:Atridiagonal,n=32,k=3• Worksforany“well-par))oned”A
1234… …32
x
A·x
A2·x
A3·x
Communica)onAvoidingKernels:TheMatrixPowersKernel:[Ax,A2x,…,Akx]
• Replacekitera)onsofy=A⋅xwith[Ax,A2x,…,Akx]
• Example:Atridiagonal,n=32,k=3
1234… …32
x
A·x
A2·x
A3·x
Communica)onAvoidingKernels:TheMatrixPowersKernel:[Ax,A2x,…,Akx]
• Replacekitera)onsofy=A⋅xwith[Ax,A2x,…,Akx]
• Example:Atridiagonal,n=32,k=3
1234… …32
x
A·x
A2·x
A3·x
Communica)onAvoidingKernels:TheMatrixPowersKernel:[Ax,A2x,…,Akx]
• Replacekitera)onsofy=A⋅xwith[Ax,A2x,…,Akx]
• Example:Atridiagonal,n=32,k=3
1234… …32
x
A·x
A2·x
A3·x
Communica)onAvoidingKernels:TheMatrixPowersKernel:[Ax,A2x,…,Akx]
• Replacekitera)onsofy=A⋅xwith[Ax,A2x,…,Akx]
• Example:Atridiagonal,n=32,k=3
1234… …32
x
A·x
A2·x
A3·x
Communica)onAvoidingKernels:TheMatrixPowersKernel:[Ax,A2x,…,Akx]
• Replacekitera)onsofy=A⋅xwith[Ax,A2x,…,Akx]
• Example:Atridiagonal,n=32,k=3
1234… …32
x
A·x
A2·x
A3·x
Communica)onAvoidingKernels:TheMatrixPowersKernel:[Ax,A2x,…,Akx]
• Replacekitera)onsofy=A⋅xwith[Ax,A2x,…,Akx]• Sequen)alAlgorithm
• Example:Atridiagonal,n=32,k=3
Step1
1234… …32
x
A·x
A2·x
A3·x
Communica)onAvoidingKernels:TheMatrixPowersKernel:[Ax,A2x,…,Akx]
• Replacekitera)onsofy=A⋅xwith[Ax,A2x,…,Akx]• Sequen)alAlgorithm
• Example:Atridiagonal,n=32,k=3
Step1 Step2
1234… …32
x
A·x
A2·x
A3·x
Communica)onAvoidingKernels:TheMatrixPowersKernel:[Ax,A2x,…,Akx]
• Replacekitera)onsofy=A⋅xwith[Ax,A2x,…,Akx]• Sequen)alAlgorithm
• Example:Atridiagonal,n=32,k=3
Step1 Step2 Step3
1234… …32
x
A·x
A2·x
A3·x
Communica)onAvoidingKernels:TheMatrixPowersKernel:[Ax,A2x,…,Akx]
• Replacekitera)onsofy=A⋅xwith[Ax,A2x,…,Akx]• Sequen)alAlgorithm
• Example:Atridiagonal,n=32,k=3
Step1 Step2 Step3 Step4
1234… …32
x
A·x
A2·x
A3·x
Communica)onAvoidingKernels:TheMatrixPowersKernel:[Ax,A2x,…,Akx]
• Replacekitera)onsofy=A⋅xwith[Ax,A2x,…,Akx]• ParallelAlgorithm
• Example:Atridiagonal,n=32,k=3• Eachprocessorcommunicatesoncewithneighbors
Proc1 Proc2 Proc3 Proc4
1234… …32
x
A·x
A2·x
A3·x
Communica)onAvoidingKernels:TheMatrixPowersKernel:[Ax,A2x,…,Akx]
• Replacekitera)onsofy=A⋅xwith[Ax,A2x,…,Akx]• ParallelAlgorithm
• Example:Atridiagonal,n=32,k=3• Eachprocessorworkson(overlapping)trapezoid
Proc1 Proc2 Proc3 Proc4
TheMatrixPowersKernel:[Ax,A2x,…,Akx]onageneralmatrix(nearestkneighborsonagraph)
Sameideaforgeneralsparsematrices:k-wideneighboringregion89
Simpleblock-rowpar))oningè(hyper)graphpar))oningTop-to-boiomprocessingè TravelingSalesmanProblem
MinimizingCommunica)onofGMREStosolveAx=b
• GMRES:findxinspan{b,Ab,…,Akb}minimizing||Ax-b||2
StandardGMRESfori=1tokw=A·v(i-1)…SpMVMGS(w,v(0),…,v(i-1))updatev(i),HendforsolveLSQproblemwithH
Communica)on-avoidingGMRESW=[v,Av,A2v,…,Akv][Q,R]=TSQR(W)…“TallSkinnyQR”buildHfromRsolveLSQproblemwithH
• Oops–Wfrompowermethod,precisionlost!90
Sequen)alcase:#wordsmoveddecreasesbyafactorofkParallelcase:#messagesdecreasesbyafactorofk
MatrixPowersKernel+TSQRinGMRES
0 200 400 600 800 1000
Iteration count
10�5
10�4
10�3
10�2
10�1
100R
elat
ive
norm
ofre
sidu
alA
x�
bOriginal GMRES
0 200 400 600 800 1000
Iteration count
10�5
10�4
10�3
10�2
10�1
100R
elat
ive
norm
ofre
sidu
alA
x�
bOriginal GMRESCA-GMRES (Monomial basis)
0 200 400 600 800 1000
Iteration count
10�5
10�4
10�3
10�2
10�1
100R
elat
ive
norm
ofre
sidu
alA
x�
bOriginal GMRESCA-GMRES (Monomial basis)CA-GMRES (Newton basis)
91
SpeedupsofGMRESon8-coreIntelClovertown
[MHDY09]
92
RequiresCo-tuningKernels
93
CA-BiCGStab
Naive Monomial Newton Chebyshev
ReplacementIts. 74(1) [7,15,24,31,…,92,97,103](17)
[67,98](2) 68(1)
WithResidualReplacement(RR)alaVanderVorstandYe
SpeedupsforGMGw/CA-KSMBoiomSolve
96
• ComparedBICGSTABvs.CA-BICGSTABwiths=4(monomialbasis)
• HopperatNERSC(CrayXE6),weakscaling:Upto4096MPIprocesses(1perchip,24,576corestotal)
• SpeedupsforminiGMGbenchmark(HPGMGbenchmarkpredecessor)– 4.2xinboiomsolve,2.5xoverallGMGsolve
• Implementedasasolverop)oninBoxLibandCHOMBOAMRframeworks
• SpeedupsfortwoBoxLibapplica)ons:– 3DLMC(alow-machnumbercombus)oncode)
• 2.5xinboiomsolve,1.5xoverallGMGsolve– 3DNyx(anN-bodyandgasdynamicscode)
• 2xinboiomsolve,1.15xoverallGMGsolve
Communica)on-AvoidingMachineLearning
Dot products and axpys
Communica)on-AvoidingCoordinateDescent
CA-CD algorithm
Until convergence do: 1. Randomly select s data points 2. Compute Gram matrix 3. Solve minimization problem
for all data points 4. Update solution vector
• We expect 1st Hlops term to dominate • MPI: choose s that balances cost • Spark: choose large s to minimize rounds
• Parallel implementations in progress Numerically stable for (very) large s
GEMM, dot products, and axpys
0 2 4 6 8 10Iterations (H) �105
-3
-2.5
-2
-1.5
-1
-0.5
0
Rel
ativ
e er
ror
DCDCA-DCD s = 500CA-DCD s = 1000CA-DCD s = 2000
SummaryofItera)veLinearAlgebra
• Newlowerbounds,op)malalgorithms,bigspeedupsintheoryandprac)ce
• Lotsofotherprogress,openproblems– Manydifferentalgorithmsreorganized
• Moreunderway,moretobedone– Needtorecognizestablevariantsmoreeasily– Precondi)oning
• HierarchicallySemiseparableMatrices– Autotuningandsynthesis
• Differentkindsof“sparsematrices”
Outline• SurveystateoftheartofCA(Comm-Avoiding)algorithms
– ReviewpreviousMatmulalgorithms– CAO(n3)2.5DMatmulandLU– TSQR:Tall-SkinnyQR– CAStrassenMatmul
• Beyondlinearalgebra– Extendinglowerboundstoanyalgorithmwitharrays– Communica)on-op)malN-bodyalgorithm
• CA-Krylovmethods• RelatedTopics
– Write-AvoidingAlgorithms– Reproducilibity
Write-AvoidingAlgorithms• Whatifwritesaremoreexpensivethanreads?
– Nonvola)leMemory(Flash,PCM,…)– Savingintermediatestodiskincloud(egSpark)– Extracoherencytrafficinsharedmemory
• Canwedesign“write-avoiding(WA)”algorithms?– Goal:findandaiainbeierlowerboundforwrites– Thm:Forclassicalmatmul,possibletodoasympto)callyfewerwritesthanreadstogivenlayerofmemoryhierarchy
– Thm:Cache-obliviousalgorithmscannotbewrite-avoiding– Thm:StrassenandFFTcannotbewrite-avoiding
128$ 256$ 512$ 1K$ 2K$ 4K$ 8K$ 16K$ 32K$L3_VICTIMS.M$ 1.9$ 2.1$ 1.8$ 2.3$ 4.8$ 9.8$ 19.5$ 39.6$ 78.5$
L3_VICTIMS.E$ 0.4$ 0.8$ 1.6$ 4.2$ 8.8$ 17.9$ 36.6$ 75.4$ 147.5$
LLC_S_FILLS.E$ 2.4$ 2.8$ 3.8$ 6.9$ 14$ 28.1$ 56.5$ 115.5$ 226.6$
Misses$on$Ideal$Cache$ 2.512$ 3.024$ 4.048$ 6$ 12$ 24$ 48$ 96$ 192$
Write$L.B.$ 2$ 2$ 2$ 2$ 2$ 2$ 2$ 2$ 2$
0.1$
1$
10$
100$
1000$
Blocking Sizes
L3: CO L2: CO L1: CO
128$ 256$ 512$ 1K$ 2K$ 4K$ 8K$ 16K$ 32K$L3_VICTIMS.M$ 2$ 2$ 1.9$ 2$ 2.3$ 2.7$ 3$ 3.6$ 4.4$
L3_VICTIMS.E$ 0.4$ 0.9$ 1.9$ 4.2$ 11.8$ 25.3$ 50.7$ 101.7$ 203.2$
LLC_S_FILLS.E$ 2.5$ 3$ 4.1$ 6.5$ 14.5$ 28.3$ 54$ 105.7$ 208.2$
Write$L.B.$ 2$ 2$ 2$ 2$ 2$ 2$ 2$ 2$ 2$
0.1$
1$
10$
100$
1000$
Blocking Sizes
L3: 700 L2: MKL L1: MKL
Cache-ObliviousMatmul#DRAMreadsclosetoop)mal#DRAMwritesmuchlarger
MeasuredL3-DRAMtrafficonIntelNehalemXeon-7560Op)mal#DRAMreads=O(n3/M1/2)Op)mal#DRAMwrites=n2
Write-AvoidingMatmulTotalL3missesclosetoop)malTotalDRAMwritesmuchlarger
128$ 256$ 512$ 1K$ 2K$ 4K$ 8K$ 16K$ 32K$L3_VICTIMS.M$ 2$ 2$ 1.9$ 2$ 2.3$ 2.7$ 3$ 3.6$ 4.4$
L3_VICTIMS.E$ 0.4$ 0.9$ 1.9$ 4.2$ 11.8$ 25.3$ 50.7$ 101.7$ 203.2$
LLC_S_FILLS.E$ 2.5$ 3$ 4.1$ 6.5$ 14.5$ 28.3$ 54$ 105.7$ 208.2$
Write$L.B.$ 2$ 2$ 2$ 2$ 2$ 2$ 2$ 2$ 2$
0.1$
1$
10$
100$
1000$
Blocking Sizes
L3: 700 L2: MKL L1: MKL
IntelMKLMatmul#DRAMreads>2xop)mal#DRAMwritesmuchlarger
MeasuredL3-DRAMtrafficonIntelNehalemXeon-7560Op)mal#DRAMreads=O(n3/M1/2)Op)mal#DRAMwrites=n2
Write-AvoidingMatmulTotalL3missesclosetoop)malTotalDRAMwritesmuchlarger
128$ 256$ 512$ 1K$ 2K$ 4K$ 8K$ 16K$ 32K$L3_VICTIMS.M$ 2.1$ 2$ 4.1$ 8.4$ 17$ 34.2$ 68.5$ 137.2$ 274.4$
L3_VICTIMS.E$ 0.8$ 1.3$ 2.7$ 5.3$ 10.7$ 21.6$ 43.5$ 86.5$ 172.9$
LLC_S_FILLS.E$ 2.9$ 3.6$ 7$ 14$ 27.9$ 56$ 112.3$ 224.1$ 447.8$
Misses$on$Ideal$Cache$ 2.512$ 3.024$ 4.048$ 6$ 12$ 24$ 48$ 96$ 192$
Write$L.B.$ 2$ 2$ 2$ 2$ 2$ 2$ 2$ 2$ 2$
0.1$
1$
10$
100$
1000$
Blocking Sizes
L3: MKL L2: MKL L1: MKL
Reproducibility
• Wantbit-wiseiden)calresultsfromdifferentrunsofthesameprogramonthesamemachine,possiblywithdifferentavailableresources– Neededfortes)ng,debugging,correctness.– Requestedbyusers(e.g.FEM,climatemodeling,…)
• Hardjustforsumma)on,sincefloa)ngpointarithme)cisnotassocia)vebecauseofroundoff– (1e-16+1)–1≠1e-16+(1–1)
• Getbit-wiseiden)calanswerwhenyoutypea.outagain• NA-Digestsubmissionon8Sep2010
– FromKaiDiethelm,atGNS-MBH– Soughtreproducibleparallelsparselinearequa)onsolver,demandedbycustomers(construc)onengineers),otherwisetheydon’tbelieveresults
– Willingtosacrifice40%-50%ofperformanceforit• Emailto~110BerkeleyCSEfaculty,askingaboutit
– Most:“What?!HowwillIdebugwithoutreproducibility?”– Few:“Iknowbeier,anddocarefulerroranalysis”– S.Govindjee:needsitforfracturesimula)ons– S.Russell:needsitfornuclearblastdetec)on
105
ReproducibleFloa)ngPointComputa)on
0 10 20 30 40 500
0.05
0.1
0.15
variation (ulps)
AbsoluteErrorforRandomVectors
0 0.5 1 1.5 20
0.005
0.01
0.015
0.02
0.025
relative error
Samemagnitudeoppositesigns
IntelMKLnon-reproducibility
Rela)veErrorforOrthogonalvectors
Vectorsize:1e6.Dataalignedto16-byteboundaries.Foreachinputvector:• Dotproductsarecomputedusing1,2,3or4threads• Absoluteerror=maximum–minimum• Rela)veerror=Absoluteerror/maximumabsolutevalue
Signnotreproducible
• Intel MKL 11.2 with CBWR: reproducible only for fixed number of threads and using the same instruction set (SSE2/AVX)
• Goals1. Sameanswer,independentoflayout,#processors,orderofsummands2. Goodperformance(scaleswell)3. Portable(assumeIEEE754only)4. Usercanchooseaccuracy
• Approaches– Guaranteefixedreduc)ontree(not2.or3.)– Use(very)highprecisiontogetexactanswer(not2.)– Ourapproach(Nguyen,Ahrens,D.)
• Oversimplified:1. ComputeA=maximumabsolutevalue(reproducible)2. RoundallsummandstooneulpinA3. Computesum;roundingcausesthemtoactlikefixedpointnumbers
• Possiblewithonepassoverdata,onereduc)onopera)on,usingjusta6word“reproducibleaccumulator” 107
Goals/ApproachesforReproducibleSum
0
1
2
3
4
5
dasu
mA
lgo
3A
lgo
2A
lgo
6
dasu
mA
lgo
3A
lgo
2A
lgo
6
dasu
mA
lgo
3A
lgo
2A
lgo
6
dasu
mA
lgo
3A
lgo
2A
lgo
6
dasu
mA
lgo
3A
lgo
2A
lgo
6
dasu
mA
lgo
3A
lgo
2A
lgo
6
dasu
mA
lgo
3A
lgo
2A
lgo
6
dasu
mA
lgo
3A
lgo
2A
lgo
6
dasu
mA
lgo
3A
lgo
2A
lgo
6
dasu
mA
lgo
3A
lgo
2A
lgo
6
dasu
mA
lgo
3A
lgo
2A
lgo
6
1 2 4 8 16 32 64 128 256 512 1024
Tim
e (n
orm
aliz
ed b
y da
sum
tim
e)
# Processors
local sumAbsolute MaxcommunicationAll-Reduce
3.1
3.2
3.1
2.9
2.8
3.1
3.0
2.9
2.3 2.
4
2.2
4.7
4.7
4.4
4.1
3.9
3.7
3.6
3.1
2.3
2.3
2.2
4.9
5.0
4.6
4.3
3.9
3.4
3.0
2.2
1.4
1.2
1.2
108
Performanceresultson1024procCrayXC301.2xto3.2xslowdownvsfastestcode,forn=1M
ReproducibleSojwareandHardware
• SojwareforReproducibleBLAS1availableatbebop.cs.berkeley.edu/reproblas– BLAS2,3underdevelopment
• UsedChisel(hardwaredesignlanguage)todesignonenewinstruc)onthatwouldmakereproduciblesumma)onasfast(andmoreaccurate)thanstandardsumma)on
• Industrialinterest
ReproducibleSojwareandHardware
• SojwareforReproducibleBLASavailableatbebop.cs.berkeley.edu/reproblas
• IEEE754Floa)ngPointStandardsCommiieeconsideringaddingnewinstruc)onthatwouldacceleratethis
• BLASStandardCommiieealsoconsideringaddingthese
• Industrialinterest
Formoredetails
• Bebop.cs.berkeley.edu– 155pagesurveyinActaNumerica
• CS267–Berkeley’sParallelCompu)ngCourse– LivebroadcastinSpring2016
• www.cs.berkeley.edu/~demmel• Allslides,videoavailable
– AvailableasSPOCfromXSEDE• www.xsede.org• Freesupercomputeraccountstodohomework• Freeautogradingofhomework
CollaboratorsandSupporters• JamesDemmel,KathyYelick,MichaelAnderson,GreyBallard,ErinCarson,Aditya
Devarakonda,MichaelDriscoll,DavidEliahu,AndrewGearhart,EvangelosGeorganas,NicholasKnight,PenpornKoanantakool,BenLipshitz,OdedSchwartz,EdgarSolomonik,OmerSpillinger
• Aus)nBenson,MaryamDehnavi,MarkHoemmen,ShoaibKamil,MarghoobMohiyuddin• AbhinavBhatele,AydinBuluc,MichaelChrist,IoanaDumitriu,ArmandoFox,David
Gleich,MingGu,JeffHammond,MikeHeroux,OlgaHoltz,KurtKeutzer,JulienLangou,DevinMaihews,TomScanlon,MichelleStrout,SamWilliams,HuaXiang
• JackDongarra,DulceneiaBecker,IchitaroYamazaki• SivanToledo,AlexDruinsky,InonPeled• LauraGrigori,Sebas)enCayrols,SimpliceDonfack,MathiasJacquelin,AmalKhabou,
SophieMoufawad,MikolajSzydlarski• MembersofFASTMath,ParLab,ASPIRE,BEBOP,CACHE,EASI,MAGMA,PLASMA• ThankstoDOE,NSF,UCDiscovery,INRIA,Intel,Microsoj,Mathworks,Na)onal
Instruments,NEC,Nokia,NVIDIA,Samsung,Oracle• bebop.cs.berkeley.edu
Summary
Don’tCommunic…
113
Timetoredesignalllinearalgebra,n-body,…algorithmsandsojware
(andcompilers)