communicaon-avoiding algorithms for linear algebra and beyond · compu)ng systems. on modern...

Communica)on-AvoidingAlgorithmsforLinearAlgebraandBeyond

JimDemmelEECS&MathDepartments

UCBerkeley

2

Whyavoidcommunica)on?(1/3)Algorithmshavetwocosts(measuredin)meorenergy):1. Arithme)c(FLOPS)2. Communica)on:movingdatabetween

–  levelsofamemoryhierarchy(sequen)alcase)–  processorsoveranetwork(parallelcase).

CPUCache

DRAM

CPUDRAM

CPUDRAM

CPUDRAM

CPUDRAM

Whyavoidcommunica)on?(2/3)•  Running)meofanalgorithmissumof3terms:

–  #flops*)me_per_flop–  #wordsmoved/bandwidth–  #messages*latency

3"

communica)on

•  Time_per_flop<<1/bandwidth<<latency•  Gapsgrowingexponen)allywith)me[FOSC]

•  Avoidcommunica)ontosave)me

•  Goal:reorganizealgorithmstoavoidcommunica)on•  Betweenallmemoryhierarchylevels

•  L1L2DRAMnetwork,etc•  Verylargespeedupspossible•  Energysavingstoo!

Annualimprovements

Time_per_flop Bandwidth Latency

Network 26% 15%

DRAM 23% 5%59%

WhyMinimizeCommunica)on?(3/3)

1

10

100

1000

10000

PicoJoules

now

2018

Source:JohnShalf,LBL

WhyMinimizeCommunica)on?(3/3)

1

10

100

1000

10000

PicoJoules

now

2018

Source:JohnShalf,LBL

Minimizecommunica)ontosaveenergy

Goals

6"

•  Redesignalgorithmstoavoidcommunica)on•  Betweenallmemoryhierarchylevels

•  L1L2DRAMnetwork,etc

• Aiainlowerboundsifpossible•  Currentalgorithmsojenfarfromlowerbounds•  Largespeedupsandenergysavingspossible

SampleSpeedups•  Up to 12x faster for 2.5D matmul on 64K core IBM BG/P •  Up to 3x faster for tensor contractions on 2K core Cray XE/6

•  Up to 6.2x faster for APSP on 24K core Cray CE6

•  Up to 2.1x faster for 2.5D LU on 64K core IBM BG/P

•  Up to 11.8x faster for direct N-body on 32K core IBM BG/P

•  Up to 13x faster for TSQR on Tesla C2050 Fermi NVIDIA GPU

•  Up to 6.7x faster for symeig (band A) on 10 core Intel Westmere

•  Up to 2x faster for 2.5D Strassen on 38K core Cray XT4

•  Up to 4.2x faster for MiniGMG benchmark bottom solver, using CA-BiCGStab (2.5x for overall solve)

–  2.5x / 1.5x for combustion simulation code

7

“NewAlgorithmImprovesPerformanceandAccuracyonExtreme-ScaleCompu)ngSystems.Onmoderncomputerarchitectures,communica:onbetweenprocessorstakeslongerthantheperformanceofafloa:ngpointarithme:copera:onbyagivenprocessor.ASCRresearchershavedevelopedanewmethod,derivedfromcommonlyusedlinearalgebramethods,tominimizecommunica:onsbetweenprocessorsandthememoryhierarchy,byreformula:ngthecommunica:onpaDernsspecifiedwithinthealgorithm.ThismethodhasbeenimplementedintheTRILINOSframework,ahighly-regardedsuiteofsojware,whichprovidesfunc)onalityforresearchersaroundtheworldtosolvelargescale,complexmul)-physicsproblems.”FY2010CongressionalBudget,Volume4,FY2010Accomplishments,AdvancedScien)ficCompu)ng

Research(ASCR),pages65-67.

PresidentObamacitesCommunica)on-AvoidingAlgorithmsintheFY2012DepartmentofEnergyBudgetRequesttoCongress:

CA-GMRES(Hoemmen,Mohiyuddin,Yelick,JD)“Tall-Skinny”QR(Grigori,Hoemmen,Langou,JD)

Outline

•  SurveystateoftheartofCA(Comm-Avoiding)algorithms–  ReviewpreviousMatmulalgorithms–  CAO(n3)2.5DMatmulandLU–  TSQR:Tall-SkinnyQR–  CAStrassenMatmul

•  Beyondlinearalgebra–  Extendinglowerboundstoanyalgorithmwitharrays–  Communica)on-op)malN-bodyalgorithm

•  CA-Krylovmethods•  RelatedTopics

SummaryofCALinearAlgebra•  “Direct”LinearAlgebra

•  Lowerboundsoncommunica)onforlinearalgebraproblemslikeAx=b,leastsquares,Ax=λx,SVD,etc

•  Mostlynotaiainedbyalgorithmsinstandardlibraries•  Newalgorithmsthataiaintheselowerbounds

• Beingaddedtolibraries:Sca/LAPACK,PLASMA,MAGMA

•  Largespeed-upspossible• Autotuningtofindop)malimplementa)on

•  Diiofor“Itera)ve”LinearAlgebra

Lowerboundforall“n3-like”linearalgebra

•  Holdsfor– Matmul,BLAS,LU,QR,eig,SVD,tensorcontrac)ons,…–  Somewholeprograms(sequencesoftheseopera)ons,nomaierhowindividualopsareinterleaved,egAk)

–  Denseandsparsematrices(where#flops<<n3)–  Sequen)alandparallelalgorithms–  Somegraph-theore)calgorithms(egFloyd-Warshall)

12

• LetM=“fast”memorysize(perprocessor)

#words_moved(perprocessor)=Ω(#flops(perprocessor)/M1/2)

#messages_sent(perprocessor)=Ω(#flops(perprocessor)/M3/2)

• Parallelcase:assumeeitherloadormemorybalanced




13



#messages_sent≥#words_moved/largest_message_size





14



#messages_sent(perprocessor)=Ω(#flops(perprocessor)/M3/2)


SIAMSIAG/LinearAlgebraPrize,2012Ballard,D.,Holtz,Schwartz

Canweaiaintheselowerbounds?•  Doconven)onaldensealgorithmsasimplementedinLAPACKandScaLAPACKaiainthesebounds?– Ojennot

•  Ifnot,arethereotheralgorithmsthatdo?– Yes,formuchofdenselinearalgebra,APSP– Newalgorithms,withnewnumericalproper)es,newwaystoencodeanswers,newdatastructures

– Notjustlooptransforma)ons(needthosetoo!)•  Onlyafewsparsealgorithmssofar

– Ex:Matmulof“random”sparsematrices– Ex:SparseCholeskyofmatriceswith“large”separators

•  Lotsofworkinprogress15

Outline




17"

NaïveMatrixMul)ply{implementsC=C+A*B}fori=1tonforj=1tonfork=1tonC(i,j)=C(i,j)+A(i,k)*B(k,j)

= + * C(i,j) A(i,:)

B(:,j) C(i,j)

18"

NaïveMatrixMul)ply{implementsC=C+A*B}fori=1ton{readrowiofAintofastmemory}forj=1ton{readC(i,j)intofastmemory}{readcolumnjofBintofastmemory}fork=1tonC(i,j)=C(i,j)+A(i,k)*B(k,j){writeC(i,j)backtoslowmemory}

= + * C(i,j) A(i,:)

B(:,j) C(i,j)

19"

NaïveMatrixMul)ply{implementsC=C+A*B}fori=1ton{readrowiofAintofastmemory}…n2readsaltogetherforj=1ton{readC(i,j)intofastmemory}…n2readsaltogether{readcolumnjofBintofastmemory}…n3readsaltogetherfork=1tonC(i,j)=C(i,j)+A(i,k)*B(k,j){writeC(i,j)backtoslowmemory}…n2writesaltogether

= + * C(i,j) A(i,:)

B(:,j) C(i,j)

n3+3n2reads/writesaltogether–dominates2n3arithme)c

20"

Blocked(Tiled)MatrixMul)plyConsiderA,B,Ctoben/b-by-n/bmatricesofb-by-bsubblockswhere

biscalledtheblocksize;assume3b-by-bblocksfitinfastmemoryfori=1ton/b

forj=1ton/b {readblockC(i,j)intofastmemory} fork=1ton/b {readblockA(i,k)intofastmemory} {readblockB(k,j)intofastmemory} C(i,j)=C(i,j)+A(i,k)*B(k,j){doamatrixmul)plyonblocks} {writeblockC(i,j)backtoslowmemory}

= + * C(i,j) C(i,j) A(i,k)

B(k,j) b-by-bblock

21"

Blocked(Tiled)MatrixMul)plyConsiderA,B,Ctoben/b-by-n/bmatricesofb-by-bsubblockswhere

biscalledtheblocksize;assume3b-by-bblocksfitinfastmemoryfori=1ton/b

forj=1ton/b {readblockC(i,j)intofastmemory}…b2×(n/b)2=n2reads fork=1ton/b {readblockA(i,k)intofastmemory}…b2×(n/b)3=n3/breads {readblockB(k,j)intofastmemory}…b2×(n/b)3=n3/breads C(i,j)=C(i,j)+A(i,k)*B(k,j){doamatrixmul)plyonblocks} {writeblockC(i,j)backtoslowmemory}…b2×(n/b)2=n2writes

= + * C(i,j) C(i,j) A(i,k)

B(k,j) b-by-bblock

2n3/b+2n2reads/writes<<2n3arithme)c-Faster!

Doesblockedmatmulaiainlowerbound?•  Recall:if3b-by-bblocksfitinfastmemoryofsizeM,then#reads/writes=2n3/b+2n2

•  Makebaslargeaspossible:3b2≤M,so#reads/writes≥31/2n3/M1/2+2n2

•  Aiainslowerbound=Ω(#flops/M1/2)

•  Butwhatifwedon’tknowM?•  Oriftherearemul)plelevelsoffastmemory?•  Canuse“CacheOblivious”algorithm

22

RecursiveMatrixMul)plica)on(RMM)(1/2)•  Forsimplicity:squarematriceswithn=2m

•  C==A·B=··

=

•  TruewheneachAijetc1x1orn/2xn/2

23"

A11 A12 A21 A22

B11 B12 B21 B22

C11 C12 C21 C22

A11·B11 + A12·B21 A11·B12 + A12·B22 A21·B11 + A22·B21 A21·B12 + A22·B22

func C = RMM (A, B, n) if n = 1, C = A * B, else { C11 = RMM (A11 , B11 , n/2) + RMM (A12 , B21 , n/2) C12 = RMM (A11 , B12 , n/2) + RMM (A12 , B22 , n/2) C21 = RMM (A21 , B11 , n/2) + RMM (A22 , B21 , n/2) C22 = RMM (A21 , B12 , n/2) + RMM (A22 , B22 , n/2) } return

RecursiveMatrixMul)plica)on(RMM)(2/2)

24"

func C = RMM (A, B, n) if n=1, C = A * B, else { C11 = RMM (A11 , B11 , n/2) + RMM (A12 , B21 , n/2) C12 = RMM (A11 , B12 , n/2) + RMM (A12 , B22 , n/2) C21 = RMM (A21 , B11 , n/2) + RMM (A22 , B21 , n/2) C22 = RMM (A21 , B12 , n/2) + RMM (A22 , B22 , n/2) } return

A(n) = # arithmetic operations in RMM( . , . , n) = 8 · A(n/2) + 4(n/2)2 if n > 1, else 1 = 2n3 … same operations as usual, in different order W(n) = # words moved between fast, slow memory by RMM( . , . , n) = 8 · W(n/2) + 12(n/2)2 if 3n2 > M , else 3n2 = O( n3 / M1/2 + n2 ) … same as blocked matmul “Cache oblivious”, works for memory hierarchies, but not panacea

Forbigspeedups,seeSC12posteron“Bea)ngMKLandScaLAPACKatRectangularMatmul”11/13at5:15-7pm

Howhardishand-tuningmatmul,anyway?

25"

•  Results of 22 student teams trying to tune matrix-multiply, in CS267 Spr09 •  Students given “blocked” code to start with (7x faster than naïve)

•  Still hard to get close to vendor tuned performance (ACML) (another 6x) •  For more discussion, see www.cs.berkeley.edu/~volkov/cs267.sp09/hw1/results/

Howhardishand-tuningmatmul,anyway?

26"

27"

SUMMA–nxnmatmulonP1/2xP1/2grid(nearly)op)malusingminimummemoryM=O(n2/P)

For k=0 to n-1 … or n/b-1 where b is the block size … = # cols in A(i,k) and # rows in B(k,j) for all i = 1 to pr … in parallel owner of A(i,k) broadcasts it to whole processor row for all j = 1 to pc … in parallel owner of B(k,j) broadcasts it to whole processor column Receive A(i,k) into Acol Receive B(k,j) into Brow C_myproc = C_myproc + Acol * Brow

For k=0 to n/b-1 … b = block size = #cols in A(i,k) = #rows in B(k,j) for all i = 1 to P1/2

owner of A(i,k) broadcasts it to whole processor row (using binary tree) for all j = 1 to P1/2

owner of B(k,j) broadcasts it to whole processor column (using bin. tree) Receive A(i,k) into Acol Receive B(k,j) into Brow C_myproc = C_myproc + Acol * Brow

* = i!

j!

A(i,k)!

k!k!B(k,j)!

C(i,j)

Brow

Acol

Usingmorethantheminimummemory•  Whatifmatrixsmallenoughtofitc>1copies,soM=cn2/P?

–  #words_moved=Ω(#flops/M1/2)=Ω(n2/(c1/2P1/2))–  #messages=Ω(#flops/M3/2)=Ω(P1/2/c3/2)

•  Canweaiainnewlowerbound?

Summaryofdenseparallelalgorithmsaiainingcommunica)onlowerbounds

28

•  AssumenxnmatricesonPprocessors•  MinimumMemoryperprocessor=M=O(n2/P)•  Recalllowerbounds:

#words_moved=Ω((n3/P)/M1/2)=Ω(n2/P1/2)#messages=Ω((n3/P)/M3/2)=Ω(P1/2)

•  DoesScaLAPACKaiainthesebounds?•  For#words_moved:mostly,exceptnonsym.Eigenproblem•  For#messages:asympto)callyworse,exceptCholesky

•  Newalgorithmsaiainallbounds,uptopolylog(P)factors•  Cholesky,LU,QR,Sym.andNonsymeigenproblems,SVD

CanwedoBeier?

Canwedobeier?•  Aren’twealreadyop)mal?•  WhyassumeM=O(n2/p),i.e.minimal?

–  Lowerbounds)lltrueifmorememory–  Canweaiainit?

•  Specialcase:“3DMatmul”– UsesM=O(n2/p2/3)– Dekel,Nassimi,Sahni[81],Bernsten[89],Agarwal,Chandra,Snir[90],Johnson[93],Agarwal,Balle,Gustavson,Joshi,Palkar[95]

•  Notalwaysp1/3)mesasmuchmemoryavailable…

Canwedobeier?•  Aren’twealreadyop)mal?•  WhyassumeM=O(n2/p),i.e.minimal?

–  Lowerbounds)lltrueifmorememory–  Canweaiainit?

•  Specialcase:“3DMatmul”–  UsesM=O(n2/p2/3)–  Dekel,Nassimi,Sahni[81],Bernsten[89],Agarwal,Chandra,Snir[90],Johnson[93],Agarwal,Balle,Gustavson,Joshi,Palkar[95]

–  Processorsarrangedinp1/3xp1/3xp1/3grid–  Processor(i,j,k)performsC(i,j)=C(i,j)+A(i,k)*B(k,j),whereeachsubmatrixisn/p1/3xn/p1/3

•  Notalwaysthatmuchmemoryavailable…

Outline




2.5DMatrixMul)plica)on

•  Assumecanfitcn2/Pdataperprocessor,c>1•  Processorsform(P/c)1/2x(P/c)1/2xcgrid

c

(P/c)1/2

Example:P=32,c=2

2.5DMatrixMul)plica)on

•  Assumecanfitcn2/Pdataperprocessor,c>1•  Processorsform(P/c)1/2x(P/c)1/2xcgrid

k

j

Ini)allyP(i,j,0)ownsA(i,j)andB(i,j)eachofsizen(c/P)1/2xn(c/P)1/2

(1)P(i,j,0)broadcastsA(i,j)andB(i,j)toP(i,j,k)(2)Processorsatlevelkperform1/c-thofSUMMA,i.e.1/c-thofΣmA(i,m)*B(m,j)(3)Sum-reducepar)alsumsΣmA(i,m)*B(m,j)alongk-axissoP(i,j,0)ownsC(i,j)

2.5DMatmulonBG/P,16Knodes/64Kcores

0

20

40

60

80

100

8192 131072

Pe

rce

nta

ge

of

ma

chin

e p

ea

k

n

Matrix multiplication on 16,384 nodes of BG/P

12X faster

2.7X faster

Using c=16 matrix copies

2D MM2.5D MM

2.5DMatmulonBG/P,16Knodes/64Kcores

0

0.2

0.4

0.6

0.8

1

1.2

1.4

n=8192, 2D

n=8192, 2.5D

n=131072, 2D

n=131072, 2.5D

Exe

cutio

n tim

e n

orm

aliz

ed b

y 2D

Matrix multiplication on 16,384 nodes of BG/P

95% reduction in comm computationidle

communication

c=16copies

Dis:nguishedPaperAward,EuroPar’11(Solomonik,D.)SC’11paperbySolomonik,Bhatele,D.

12xfaster

2.7xfaster

PerfectStrongScaling–inTimeandEnergy•  Every)meyouaddaprocessor,youshoulduseitsmemoryMtoo•  Startwithminimalnumberofprocs:PM=3n2•  IncreasePbyafactorofcètotalmemoryincreasesbyafactorofc•  Nota)onfor)mingmodel:

–  γT,βT,αT=secsperflop,perword_moved,permessageofsizem•  T(cP)=n3/(cP)[γT+βT/M1/2+αT/(mM1/2)]=T(P)/c•  Nota)onforenergymodel:

–  γE,βE,αE=joulesforsameopera)ons–  δE=joulesperwordofmemoryusedpersec–  εE=joulespersecforleakage,etc.

•  E(cP)=cP{n3/(cP)[γE+βE/M1/2+αE/(mM1/2)]+δEMT(cP)+εET(cP)}=E(P)•  ExtendstoN-body,Strassen,…•  Canprovelowerboundsonneedednetwork(eg3Dtorusformatmul)

PerfectStrongScaling–inTimeandEnergy(2/2)•  T(cP)=n3/(cP)[γT+βT/M1/2+αT/(mM1/2)]=T(P)/c•  E(cP)=cP{n3/(cP)[γE+βE/M1/2+αE/(mM1/2)]+δEMT(cP)+εET(cP)}=E(P)•  PerfectscalingextendstoN-body,Strassen,…•  Wecanusethesemodelstoanswermanyques)ons,including:

•  Whatistheminimumenergyrequiredforacomputa)on?•  Givenamaximumallowedrun)meT,whatistheminimumenergyE

neededtoachieveit?•  GivenamaximumenergybudgetE,whatistheminimumrun)meTthat

wecanaiain?•  Thera)oP=E/Tgivesustheaveragepowerrequiredtorunthealgorithm.

Canweminimizetheaveragepowerconsumed?•  Givenanalgorithm,problemsize,numberofprocessorsandtargetenergy

efficiency(GFLOPS/W),canwedetermineasetofarchitecturalparameterstodescribeaconformingcomputerarchitecture?

2.5Dvs2DLUWithandWithoutPivo)ng

0

20

40

60

80

100

NO-pivot 2D

NO-pivot 2.5D

CA-pivot 2D

CA-pivot 2.5D

Tim

e (

sec)

LU on 16,384 nodes of BG/P (n=131,072)

2X faster

2X faster

computeidle

communication

Thm:PerfectStrongScalingimpossible,becauseLatency*Bandwidth=Ω(n2)

Outline




TSQR:QRofaTall,Skinnymatrix

40

W=

Q00R00Q10R10Q20R20Q30R30

W0W1W2W3

Q00Q10Q20Q30

= = .

R00R10R20R30

R00R10R20R30

=Q01R01Q11R11

Q01Q11

= . R01R11

R01R11

= Q02R02

TSQR:QRofaTall,Skinnymatrix

41

W=

Q00R00Q10R10Q20R20Q30R30

W0W1W2W3

Q00Q10Q20Q30

= = .

R00R10R20R30

R00R10R20R30

=Q01R01Q11R11

Q01Q11

= . R01R11

R01R11

= Q02R02

Output={Q00,Q10,Q20,Q30,Q01,Q11,Q02,R02}

TSQR:AnArchitecture-DependentAlgorithm

W=

W0W1W2W3

R00R10R20R30

R01

R11

R02Parallel:

W=

W0W1W2W3

R01 R02

R00

R03Sequen)al:

W=

W0W1W2W3

R00R01

R01R11

R02

R11R03

DualCore:

Canchoosereduc)ontreedynamicallyMul)core/Mul)socket/Mul)rack/Mul)site/Out-of-core:?

TSQRPerformanceResults•  Parallel

–  IntelClovertown–  Upto8xspeedup(8core,dualsocket,10Mx10)

–  Pen)umIIIcluster,DolphinInterconnect,MPICH•  Upto6.7xspeedup(16procs,100Kx200)

–  BlueGene/L•  Upto4xspeedup(32procs,1Mx50)

–  TeslaC2050/Fermi•  Upto13x(110,592x100)

–  Grid–4xon4ci)esvs1city(Dongarra,Langouetal)–  Cloud–1.6xslowerthanjustaccessingdatatwice(GleichandBenson)

•  Sequen)al–  “Infinitespeedup”forout-of-coreonPowerPClaptop

•  Asliileas2xslowdownvs(predicted)infiniteDRAM•  LAPACKwithvirtualmemoryneverfinished

•  SVDcostsaboutthesame•  JointworkwithGrigori,Hoemmen,Langou,Anderson,Ballard,Keutzer,others

43DatafromGreyBallard,MarkHoemmen,LauraGrigori,JulienLangou,JackDongarra,MichaelAnderson

TSQRPerformanceResults•  Parallel

–  IntelClovertown–  Upto8xspeedup(8core,dualsocket,10Mx10)

–  Pen)umIIIcluster,DolphinInterconnect,MPICH•  Upto6.7xspeedup(16procs,100Kx200)

–  BlueGene/L•  Upto4xspeedup(32procs,1Mx50)

–  TeslaC2050/Fermi•  Upto13x(110,592x100)

–  Grid–4xon4ci)esvs1city(Dongarra,Langouetal)–  Cloud–1.6xslowerthanjustaccessingdatatwice(GleichandBenson)

•  Sequen)al–  “Infinitespeedup”forout-of-coreonPowerPClaptop

•  Asliileas2xslowdownvs(predicted)infiniteDRAM•  LAPACKwithvirtualmemoryneverfinished

•  SVDcostsaboutthesame•  JointworkwithGrigori,Hoemmen,Langou,Anderson,Ballard,Keutzer,others

44DatafromGreyBallard,MarkHoemmen,LauraGrigori,JulienLangou,JackDongarra,MichaelAnderson

SIAGonSupercompu:ngBestPaperPrize,2016

Outline•  SurveystateoftheartofCA(Comm-Avoiding)algorithms–  ReviewpreviousMatmulalgorithms–  CAO(n3)2.5DMatmul–  TSQR:Tall-SkinnyQR–  CAO(n3)2.5DLU–  CAStrassenMatmul


•  CA-Krylovmethods

BacktoLU:UsingsimilarideaforTSLUasTSQR:Usereduc)ontree,todo“TournamentPivo)ng”

46

Wnxb =

W1 W2 W3 W4

P1·L1·U1 P2·L2·U2 P3·L3·U3 P4·L4·U4

=

Choose b pivot rows of W1, call them W1’ Choose b pivot rows of W2, call them W2’ Choose b pivot rows of W3, call them W3’ Choose b pivot rows of W4, call them W4’

W1’ W2’ W3’ W4’

P12·L12·U12 P34·L34·U34

= Choose b pivot rows, call them W12’ Choose b pivot rows, call them W34’

W12’ W34’

= P1234·L1234·U1234

Choose b pivot rows

•  Go back to W and use these b pivot rows •  Move them to top, do LU without pivoting •  Extra work, but lower order term

•  Thm: As numerically stable as Partial Pivoting on a larger matrix

ExascaleMachineParametersSource:DOEExascaleWorkshop

•  2^20≈1,000,000nodes•  1024cores/node(abillioncores!)•  100GB/secinterconnectbandwidth•  400GB/secDRAMbandwidth•  1microsecinterconnectlatency•  50nanosecmemorylatency•  32Petabytesofmemory•  1/2GBtotalL1onanode

ExascalepredictedspeedupsforGaussianElimina)on:

2DCA-LUvsScaLAPACK-LU

log2(p)

log 2(n

2 /p)=

log 2(m

emory_pe

r_proc)

OngoingWork•  Lotsmoreworkon

–  Algorithms:•  BLAS,LDLT,QRwithpivo)ng,otherpivo)ngschemes,eigenproblems,…•  Sparsematrices,structuredmatrices•  All-pairs-shortest-path,…•  Both2D(c=1)and2.5D(c>1)•  Butonlybandwidthmaydecreasewithc>1,notlatency(egLU)

–  Pla�orms:•  Mul)core,cluster,GPU,cloud,heterogeneous,low-energy,…

–  Sojware:•  Integra)onintoSca/LAPACK,PLASMA,MAGMA,…

•  Integra)onintoapplica)ons–  CTF(withANL):symmetrictensorcontrac)ons

Outline




Communica)onLowerBoundsforStrassen-likematmulalgorithms

•  Proof:graphexpansion(differentfromclassicalmatmul)–  Strassen-like:DAGmustbe“regular”[andconnected]

•  ExtendsuptoM=n2/p2/ω

•  BestPaperPrize(SPAA’11),Ballard,D.,Holtz,Schwartz,alsoinJACM

•  Isthelowerboundaiainable?

ClassicalO(n3)matmul:

#words_moved=Ω(M(n/M1/2)3/P)

Strassen’sO(nlg7)matmul:

#words_moved=Ω(M(n/M1/2)lg7/P)

Strassen-likeO(nω)matmul:

#words_moved=Ω(M(n/M1/2)ω/P)

vs.

Runsall7mul)pliesinparallelEachonP/7processorsNeeds7/4asmuchmemory

Runsall7mul)pliessequen)allyEachonallPprocessorsNeeds1/4asmuchmemory

CAPSIfEnoughMemoryandP≥7thenBFSstepelseDFSstependif

Communication Avoiding Parallel Strassen (CAPS)

BestwaytointerleaveBFSandDFSisatuningparameter

53

Performance Benchmarking, Strong Scaling Plot Franklin (Cray XT4) n = 94080

Speedups:24%-184%(overpreviousStrassen-basedalgorithms)

InvitedtoappearasResearchHighlightinCACM

Outline




Recallop)malsequen)alMatmul•  Naïvecodefori=1:n,forj=1:n,fork=1:n,C(i,j)+=A(i,k)*B(k,j)•  “Blocked”codefori1=1:b:n,forj1=1:b:n,fork1=1:b:nfori2=0:b-1,forj2=0:b-1,fork2=0:b-1i=i1+i2,j=j1+j2,k=k1+k2C(i,j)+=A(i,k)*B(k,j)•  Thm:Pickingb=M1/2aiainslowerbound:#words_moved=Ω(n3/M1/2)•  Wheredoes1/2comefrom?

bxbmatmul

NewThmappliedtoMatmul•  fori=1:n,forj=1:n,fork=1:n,C(i,j)+=A(i,k)*B(k,j)•  RecordarrayindicesinmatrixΔ

•  SolveLPforx=[xi,xj,xk]T:max1Txs.t.Δx≤1

– Result:x=[1/2,1/2,1/2]T,1Tx=3/2=sHBL•  Thm:#words_moved=Ω(n3/MSHBL-1)=Ω(n3/M1/2)AiainedbyblocksizesMxi,Mxj,Mxk=M1/2,M1/2,M1/2

i j k1 0 1 A

Δ= 0 1 1 B1 1 0 C

NewThmappliedtoDirectN-Body•  fori=1:n,forj=1:n,F(i)+=force(P(i),P(j))•  RecordarrayindicesinmatrixΔ

•  SolveLPforx=[xi,xj]T:max1Txs.t.Δx≤1

– Result:x=[1,1],1Tx=2=sHBL•  Thm:#words_moved=Ω(n2/MSHBL-1)=Ω(n2/M1)AiainedbyblocksizesMxi,Mxj=M1,M1

i j1 0 F

Δ= 1 0 P(i)0 1 P(j)

NewThmappliedtoRandomCode•  fori1=1:n,fori2=1:n,…,fori6=1:nA1(i1,i3,i6)+=func1(A2(i1,i2,i4),A3(i2,i3,i5),A4(i3,i4,i6))A5(i2,i6)+=func2(A6(i1,i4,i5),A3(i3,i4,i6))•  RecordarrayindicesinmatrixΔ

•  SolveLPforx=[x1,…,x6]T:max1Txs.t.Δx≤1

–  Result:x=[2/7,3/7,1/7,2/7,3/7,4/7],1Tx=15/7=sHBL•  Thm:#words_moved=Ω(n6/MSHBL-1)=Ω(n6/M8/7)AiainedbyblocksizesM2/7,M3/7,M1/7,M2/7,M3/7,M4/7

i1 i2 i3 i4 i5 i6

1 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ= 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3,A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Wheredolowerandmatchingupperboundsoncommunica)oncomefrom?(1/3)

•  OriginallyforC=A*BbyIrony/Tiskin/Toledo(2004)•  Proofidea

–  Supposewecanbound#useful_opera)ons≤GdoablewithdatainfastmemoryofsizeM

–  SotodoF=#total_opera)ons,needtofillfastmemoryF/G)mes,andso#words_moved≥MF/G

•  Hardpart:findingG•  Aiaininglowerbound

– Needto“block”allopera)onstoperform~Gopera)onsoneverychunkofMwordsofdata

Proofofcommunica)onlowerbound(2/3)

60"

k

“A face”

“C face” Cube representing

C(1,1) += A(1,3)·B(3,1)

•  If we have at most M “A squares”, M “B squares”, and M “C squares”,

how many cubes G can we have?

i

j

A(2,1)

A(1,3)

B(1

,3)

B(3

,1)

C(1,1)

C(2,3)

A(1,1) B(1

,1)

A(1,2)

B(2

,1)

C=A*B(Irony/Tiskin/Toledo

2004)

Proofofcommunica)onlowerbound(3/3)

61

x

z

z

y

xy

k

“Ashadow”

“Cshadow”

j

i

G=#cubesinblackboxwithsidelengthsx,yandz=Volumeofblackbox=x·y·z=(xz·zy·yx)1/2=(#A□s·#B□s·#C□s)1/2≤M3/2

(i,k)isin“Ashadow” if(i,j,k)in3Dset(j,k)isin“Bshadow” if(i,j,k)in3Dset(i,j)isin“Cshadow”if(i,j,k)in3DsetThm(Loomis&Whitney,1949)G=#cubesin3Dset=Volumeof3Dset≤(area(Ashadow)·area(Bshadow)·area(Cshadow))1/2≤M3/2

SummerSchoolLecture3

#words_moved=Ω(M(n3/P)/M3/2)=Ω(n3/(PM1/2))

Approachtogeneralizinglowerbounds•  Matmulfori=1:n,forj=1:n,fork=1:n,C(i,j)+=A(i,k)*B(k,j)=>for(i,j,k)inS=subsetofZ3

Accessloca)onsindexedby(i,j),(i,k),(k,j)•  Generalcasefori1=1:n,fori2=i1:m,…forik=i3:i4C(i1+2*i3-i7)=func(A(i2+3*i4,i1,i2,i1+i2,…),B(pnt(3*i4)),…)D(somethingelse)=func(somethingelse),…=>for(i1,i2,…,ik)inS=subsetofZkAccessloca)onsindexedbygrouphomomorphisms,egφC(i1,i2,…,ik)=(i1+2*i3-i7)φA(i1,i2,…,ik)=(i2+3*i4,i1,i2,i1+i2,…),…•  Goal:Communica)onlowerbounds,op)malalgorithmsforany

programthatlookslikethis

Approachtogeneralizinglowerbounds•  Matmulfori=1:n,forj=1:n,fork=1:n,C(i,j)+=A(i,k)*B(k,j)=>for(i,j,k)inS=subsetofZ3

Accessloca)onsindexedby(i,j),(i,k),(k,j)•  Generalcasefori1=1:n,fori2=i1:m,…forik=i3:i4C(i1+2*i3-i7)=func(A(i2+3*i4,i1,i2,i1+i2,…),B(pnt(3*i4)),…)D(somethingelse)=func(somethingelse),…=>for(i1,i2,…,ik)inS=subsetofZkAccessloca)onsindexedbygrouphomomorphisms,egφC(i1,i2,…,ik)=(i1+2*i3-i7)φA(i1,i2,…,ik)=(i2+3*i4,i1,i2,i1+i2,…),…•  Canwebound#loop_itera)ons(=|S|)givenboundson#pointsinitsimages,i.e.boundson|φC(S)|,|φA(S)|,…?

GeneralCommunica)onBound

•  Givensubsetofloopitera)ons,howmuchdatadoweneed?–  GivenSsubsetofZk,grouphomomorphismsφ1,φ2,…,bound|S|intermsof|φ1(S)|,|φ2(S)|,…,|φm(S)|

•  Def:Hölder-Brascamp-LiebLP(HBL-LP)fors1,…,sm:forallsubgroupsH<Zk,rank(H)≤Σjsj*rank(φj(H))•  Thm(Christ/Tao/Carbery/Bennei):Givens1,…,sm|S|≤Πj|φj(S)|sj•  Thm:Givenaprogramwitharrayrefsgivenbyφj,choosesj

tominimizesHBL=ΣjsjsubjecttoHBL-LP.Then#words_moved=Ω(#itera)ons/MsHBL-1)

GeneralCommunica)onBound

•  GivenSsubsetofZk,grouphomomorphismsφ1,φ2,…,bound|S|intermsof|φ1(S)|,|φ2(S)|,…,|φm(S)|

•  Def:Hölder-Brascamp-LiebLP(HBL-LP)fors1,…,sm:forallsubgroupsH<Zk,rank(H)≤Σjsj*rank(φj(H))•  Thm(Christ/Tao/Carbery/Bennei):Givens1,…,sm|S|≤Πj|φj(S)|

sj•  Thm:Givenaprogramwitharrayrefsgivenbyφj,choosesjtominimizesHBL=ΣjsjsubjecttoHBL-LP.Then

#words_moved=Ω(#itera)ons/MsHBL-1)

Isthisboundaiainable?

•  Butfirst:Canwewriteitdown?•  Thm:(badnews)HBL-LPreducestoHilbert’s10thproblemoverQ(conjecturedtobeundecidable)

•  Thm:(goodnews)AnotherLPwithsamesolu)onisdecidable

•  Dependsonloopdependencies–  Bestcase:none,orreduc)ons(likematmul)

•  Thm:Inmanycases,solu)onxofDualHBL-LPgivesop)mal)ling–  Ex:ForMatmul,x=[1/2,1/2,1/2]soop)mal)lingisM1/2xM1/2xM1/2

NewThmappliedtoDirectN-Body•  fori=1:n,forj=1:n,F(i)+=force(P(i),P(j))•  RecordarrayindicesinmatrixΔ

•  SolveLPforx=[xi,xj]T:max1Txs.t.Δx≤1

– Result:x=[1,1],1Tx=2=sHBL•  Thm:#words_moved=Ω(n2/MSHBL-1)=Ω(n2/M1)AiainedbyblocksizesMxi,Mxj=M1,M1

i j1 0 F

Δ= 1 0 P(i)0 1 P(j)

N-BodySpeedupsonIBM-BG/P(Intrepid)8Kcores,32Kpar)cles

11.8xspeedup

K.Yelick,E.Georganas,M.Driscoll,P.Koanantakool,E.Solomonik

VariableLoopBoundsareMoreComplicated

•  Redundancyinn-bodycalcula)onf(i,j),f(j,i)•  k-wayn-bodyproblems(“k-body”)hasevenmore

��

��

��

��

��

��

��

� � � ��

��

�� !"#�$##��"� �

%��#� ��&� ��

PenpornKoanantakoolandK.Yelick

•  Canachievebothcommunica)onandcomputa)on(symmetryexploi)ng)op)mal

04/07/2015 CS267 Lecture 21

Some Applications •  Gravity, Turbulence, Molecular Dynamics, Plasma

Simulation, … •  Electron-Beam Lithography Device Simulation •  Hair ...

– www.fxguide.com/featured/brave-new-hair/ –  graphics.pixar.com/library/CurlyHairA/paper.pdf

70

Isthisboundaiainable(1/2)?•  Butfirst:Canwewriteitdown?•  Thm:(badnews)HBL-LPreducestoHilbert’s10thproblemoverQ(conjecturedtobeundecidable)

•  Thm:(goodnews)AnotherLPwithsamesolu)onisdecidable(butexpensive,sofar)

•  Thm:(beiernews)EasytowritedownLPexplicitlyinmanycasesofinterest(egallφj={subsetofindices})

•  Thm:(goodnews)Easytoapproximate,i.e.getupperorlowerboundsonsHBL

•  Ifyoumissaconstraint,thelowerboundmaybetoolarge(i.e.sHBLtoosmall)buts)llworthtryingtoaiain

•  Tarski-decidabletogetsupersetofconstraints(maygetsHBLtoolarge)

Isthisboundaiainable(2/2)?

•  Dependsonloopdependencies•  Bestcase:none,orreduc)ons(matmul)•  Thm:Whenallφj={subsetofindices},dualofHBL-LPgivesop)mal)lesizes:

HBL-LP:minimize1T*ss.t.sT*Δ≥1TDual-HBL-LP:maximize1T*xs.t.Δ*x≤1Thenforsequen)alalgorithm,)leijbyMxj

•  Ex:Matmul:s=[1/2,1/2,1/2]T=x•  Extendstounimodulartransformsofindices

OngoingWork

•  Implement/improvealgorithmstogenerateforlowerbounds,op)malalgorithms

•  Haveyettofindacasewherewecannotaiainlowerbound–canweprovethis?

•  Extend“perfectscaling”resultsfor)meandenergybyusingextramemory– “n.5Dalgorithms”

•  Incorporateintocompilers

OngoingWork

•  Acceleratedecisionprocedureforlowerbounds– Ex:Atmost3arrays,or4loopnests

•  Haveyettofindacasewherewecannotaiainlowerbound–canweprovethis?

•  Extend“perfectscaling”resultsfor)meandenergybyusingextramemory– “n.5Dalgorithms”

•  Incorporateintocompilers

Outline




AvoidingCommunica)oninItera)veLinearAlgebra

•  k-stepsofitera)vesolverforsparseAx=borAx=λx– DoeskSpMVswithAandstar)ngvector– Manysuch“KrylovSubspaceMethods”

•  ConjugateGradients(CG),GMRES,Lanczos,Arnoldi,…•  Goal:minimizecommunica)on

– Assumematrix“well-par))oned”–  Serialimplementa)on

•  Conven)onal:O(k)movesofdatafromslowtofastmemory•  New:O(1)movesofdata–op:mal

–  Parallelimplementa)ononpprocessors•  Conven)onal:O(klogp)messages(kSpMVcalls,dotprods)•  New:O(logp)messages-op:mal

•  Lotsofspeeduppossible(modeledandmeasured)–  Price:someredundantcomputa)on–  Challenges:Poorpar))oning,Precondi)oning,Num.Stability

76

1234… …32

x

A·x

A2·x

A3·x

Communica)onAvoidingKernels:TheMatrixPowersKernel:[Ax,A2x,…,Akx]

•  Replacekitera)onsofy=A⋅xwith[Ax,A2x,…,Akx]

•  Example:Atridiagonal,n=32,k=3•  Worksforany“well-par))oned”A

1234… …32

x

A·x

A2·x

A3·x


•  Replacekitera)onsofy=A⋅xwith[Ax,A2x,…,Akx]

•  Example:Atridiagonal,n=32,k=3

1234… …32

x

A·x

A2·x

A3·x


•  Replacekitera)onsofy=A⋅xwith[Ax,A2x,…,Akx]•  Sequen)alAlgorithm


Step1

1234… …32

x

A·x

A2·x

A3·x




Step1 Step2

1234… …32

x

A·x

A2·x

A3·x




Step1 Step2 Step3

1234… …32

x

A·x

A2·x

A3·x




Step1 Step2 Step3 Step4

1234… …32

x

A·x

A2·x

A3·x


•  Replacekitera)onsofy=A⋅xwith[Ax,A2x,…,Akx]•  ParallelAlgorithm

•  Example:Atridiagonal,n=32,k=3•  Eachprocessorcommunicatesoncewithneighbors

Proc1 Proc2 Proc3 Proc4

1234… …32

x

A·x

A2·x

A3·x


•  Replacekitera)onsofy=A⋅xwith[Ax,A2x,…,Akx]•  ParallelAlgorithm

•  Example:Atridiagonal,n=32,k=3•  Eachprocessorworkson(overlapping)trapezoid

Proc1 Proc2 Proc3 Proc4

TheMatrixPowersKernel:[Ax,A2x,…,Akx]onageneralmatrix(nearestkneighborsonagraph)

Sameideaforgeneralsparsematrices:k-wideneighboringregion89

Simpleblock-rowpar))oningè(hyper)graphpar))oningTop-to-boiomprocessingè TravelingSalesmanProblem

MinimizingCommunica)onofGMREStosolveAx=b

•  GMRES:findxinspan{b,Ab,…,Akb}minimizing||Ax-b||2

StandardGMRESfori=1tokw=A·v(i-1)…SpMVMGS(w,v(0),…,v(i-1))updatev(i),HendforsolveLSQproblemwithH

Communica)on-avoidingGMRESW=[v,Av,A2v,…,Akv][Q,R]=TSQR(W)…“TallSkinnyQR”buildHfromRsolveLSQproblemwithH

• Oops–Wfrompowermethod,precisionlost!90

Sequen)alcase:#wordsmoveddecreasesbyafactorofkParallelcase:#messagesdecreasesbyafactorofk

MatrixPowersKernel+TSQRinGMRES

0 200 400 600 800 1000

Iteration count

10�5

10�4

10�3

10�2

10�1

100R

elat

ive

norm

ofre

sidu

alA

x�

bOriginal GMRES

0 200 400 600 800 1000

Iteration count

10�5

10�4

10�3

10�2

10�1

100R

elat

ive

norm

ofre

sidu

alA

x�

bOriginal GMRESCA-GMRES (Monomial basis)

0 200 400 600 800 1000

Iteration count

10�5

10�4

10�3

10�2

10�1

100R

elat

ive

norm

ofre

sidu

alA

x�

bOriginal GMRESCA-GMRES (Monomial basis)CA-GMRES (Newton basis)

91

SpeedupsofGMRESon8-coreIntelClovertown

[MHDY09]

92

RequiresCo-tuningKernels

93

CA-BiCGStab

Naive Monomial Newton Chebyshev

ReplacementIts. 74(1) [7,15,24,31,…,92,97,103](17)

[67,98](2) 68(1)

WithResidualReplacement(RR)alaVanderVorstandYe

SpeedupsforGMGw/CA-KSMBoiomSolve

96

•  ComparedBICGSTABvs.CA-BICGSTABwiths=4(monomialbasis)

• HopperatNERSC(CrayXE6),weakscaling:Upto4096MPIprocesses(1perchip,24,576corestotal)

•  SpeedupsforminiGMGbenchmark(HPGMGbenchmarkpredecessor)– 4.2xinboiomsolve,2.5xoverallGMGsolve

•  Implementedasasolverop)oninBoxLibandCHOMBOAMRframeworks

•  SpeedupsfortwoBoxLibapplica)ons:– 3DLMC(alow-machnumbercombus)oncode)

•  2.5xinboiomsolve,1.5xoverallGMGsolve– 3DNyx(anN-bodyandgasdynamicscode)

•  2xinboiomsolve,1.15xoverallGMGsolve

Communica)on-AvoidingMachineLearning

Dot products and axpys

Communica)on-AvoidingCoordinateDescent

CA-CD algorithm

Until convergence do: 1.  Randomly select s data points 2.  Compute Gram matrix 3.  Solve minimization problem

for all data points 4.  Update solution vector

•  We expect 1st Hlops term to dominate •  MPI: choose s that balances cost •  Spark: choose large s to minimize rounds

•  Parallel implementations in progress Numerically stable for (very) large s

GEMM, dot products, and axpys

0 2 4 6 8 10Iterations (H) �105

-3

-2.5

-2

-1.5

-1

-0.5

0

Rel

ativ

e er

ror

DCDCA-DCD s = 500CA-DCD s = 1000CA-DCD s = 2000

SummaryofItera)veLinearAlgebra

•  Newlowerbounds,op)malalgorithms,bigspeedupsintheoryandprac)ce

•  Lotsofotherprogress,openproblems–  Manydifferentalgorithmsreorganized

•  Moreunderway,moretobedone– Needtorecognizestablevariantsmoreeasily– Precondi)oning

•  HierarchicallySemiseparableMatrices– Autotuningandsynthesis

•  Differentkindsof“sparsematrices”

Outline•  SurveystateoftheartofCA(Comm-Avoiding)algorithms

–  ReviewpreviousMatmulalgorithms–  CAO(n3)2.5DMatmulandLU–  TSQR:Tall-SkinnyQR–  CAStrassenMatmul



– Write-AvoidingAlgorithms–  Reproducilibity

Write-AvoidingAlgorithms•  Whatifwritesaremoreexpensivethanreads?

– Nonvola)leMemory(Flash,PCM,…)–  Savingintermediatestodiskincloud(egSpark)–  Extracoherencytrafficinsharedmemory

•  Canwedesign“write-avoiding(WA)”algorithms?– Goal:findandaiainbeierlowerboundforwrites–  Thm:Forclassicalmatmul,possibletodoasympto)callyfewerwritesthanreadstogivenlayerofmemoryhierarchy

–  Thm:Cache-obliviousalgorithmscannotbewrite-avoiding–  Thm:StrassenandFFTcannotbewrite-avoiding

128$ 256$ 512$ 1K$ 2K$ 4K$ 8K$ 16K$ 32K$L3_VICTIMS.M$ 1.9$ 2.1$ 1.8$ 2.3$ 4.8$ 9.8$ 19.5$ 39.6$ 78.5$

L3_VICTIMS.E$ 0.4$ 0.8$ 1.6$ 4.2$ 8.8$ 17.9$ 36.6$ 75.4$ 147.5$

LLC_S_FILLS.E$ 2.4$ 2.8$ 3.8$ 6.9$ 14$ 28.1$ 56.5$ 115.5$ 226.6$

Misses$on$Ideal$Cache$ 2.512$ 3.024$ 4.048$ 6$ 12$ 24$ 48$ 96$ 192$

Write$L.B.$ 2$ 2$ 2$ 2$ 2$ 2$ 2$ 2$ 2$

0.1$

1$

10$

100$

1000$

Blocking Sizes

L3: CO L2: CO L1: CO

128$ 256$ 512$ 1K$ 2K$ 4K$ 8K$ 16K$ 32K$L3_VICTIMS.M$ 2$ 2$ 1.9$ 2$ 2.3$ 2.7$ 3$ 3.6$ 4.4$

L3_VICTIMS.E$ 0.4$ 0.9$ 1.9$ 4.2$ 11.8$ 25.3$ 50.7$ 101.7$ 203.2$

LLC_S_FILLS.E$ 2.5$ 3$ 4.1$ 6.5$ 14.5$ 28.3$ 54$ 105.7$ 208.2$

Write$L.B.$ 2$ 2$ 2$ 2$ 2$ 2$ 2$ 2$ 2$

0.1$

1$

10$

100$

1000$

Blocking Sizes

L3: 700 L2: MKL L1: MKL

Cache-ObliviousMatmul#DRAMreadsclosetoop)mal#DRAMwritesmuchlarger

MeasuredL3-DRAMtrafficonIntelNehalemXeon-7560Op)mal#DRAMreads=O(n3/M1/2)Op)mal#DRAMwrites=n2

Write-AvoidingMatmulTotalL3missesclosetoop)malTotalDRAMwritesmuchlarger

128$ 256$ 512$ 1K$ 2K$ 4K$ 8K$ 16K$ 32K$L3_VICTIMS.M$ 2$ 2$ 1.9$ 2$ 2.3$ 2.7$ 3$ 3.6$ 4.4$

L3_VICTIMS.E$ 0.4$ 0.9$ 1.9$ 4.2$ 11.8$ 25.3$ 50.7$ 101.7$ 203.2$

LLC_S_FILLS.E$ 2.5$ 3$ 4.1$ 6.5$ 14.5$ 28.3$ 54$ 105.7$ 208.2$

Write$L.B.$ 2$ 2$ 2$ 2$ 2$ 2$ 2$ 2$ 2$

0.1$

1$

10$

100$

1000$

Blocking Sizes

L3: 700 L2: MKL L1: MKL

IntelMKLMatmul#DRAMreads>2xop)mal#DRAMwritesmuchlarger

MeasuredL3-DRAMtrafficonIntelNehalemXeon-7560Op)mal#DRAMreads=O(n3/M1/2)Op)mal#DRAMwrites=n2

Write-AvoidingMatmulTotalL3missesclosetoop)malTotalDRAMwritesmuchlarger

128$ 256$ 512$ 1K$ 2K$ 4K$ 8K$ 16K$ 32K$L3_VICTIMS.M$ 2.1$ 2$ 4.1$ 8.4$ 17$ 34.2$ 68.5$ 137.2$ 274.4$

L3_VICTIMS.E$ 0.8$ 1.3$ 2.7$ 5.3$ 10.7$ 21.6$ 43.5$ 86.5$ 172.9$

LLC_S_FILLS.E$ 2.9$ 3.6$ 7$ 14$ 27.9$ 56$ 112.3$ 224.1$ 447.8$

Misses$on$Ideal$Cache$ 2.512$ 3.024$ 4.048$ 6$ 12$ 24$ 48$ 96$ 192$

Write$L.B.$ 2$ 2$ 2$ 2$ 2$ 2$ 2$ 2$ 2$

0.1$

1$

10$

100$

1000$

Blocking Sizes

L3: MKL L2: MKL L1: MKL

Reproducibility

•  Wantbit-wiseiden)calresultsfromdifferentrunsofthesameprogramonthesamemachine,possiblywithdifferentavailableresources–  Neededfortes)ng,debugging,correctness.–  Requestedbyusers(e.g.FEM,climatemodeling,…)

•  Hardjustforsumma)on,sincefloa)ngpointarithme)cisnotassocia)vebecauseofroundoff–  (1e-16+1)–1≠1e-16+(1–1)

•  Getbit-wiseiden)calanswerwhenyoutypea.outagain•  NA-Digestsubmissionon8Sep2010

–  FromKaiDiethelm,atGNS-MBH–  Soughtreproducibleparallelsparselinearequa)onsolver,demandedbycustomers(construc)onengineers),otherwisetheydon’tbelieveresults

– Willingtosacrifice40%-50%ofperformanceforit•  Emailto~110BerkeleyCSEfaculty,askingaboutit

– Most:“What?!HowwillIdebugwithoutreproducibility?”–  Few:“Iknowbeier,anddocarefulerroranalysis”–  S.Govindjee:needsitforfracturesimula)ons–  S.Russell:needsitfornuclearblastdetec)on

105

ReproducibleFloa)ngPointComputa)on

0 10 20 30 40 500

0.05

0.1

0.15

variation (ulps)

AbsoluteErrorforRandomVectors

0 0.5 1 1.5 20

0.005

0.01

0.015

0.02

0.025

relative error

Samemagnitudeoppositesigns

IntelMKLnon-reproducibility

Rela)veErrorforOrthogonalvectors

Vectorsize:1e6.Dataalignedto16-byteboundaries.Foreachinputvector:•  Dotproductsarecomputedusing1,2,3or4threads•  Absoluteerror=maximum–minimum•  Rela)veerror=Absoluteerror/maximumabsolutevalue

Signnotreproducible

•  Intel MKL 11.2 with CBWR: reproducible only for fixed number of threads and using the same instruction set (SSE2/AVX)

•  Goals1.  Sameanswer,independentoflayout,#processors,orderofsummands2.  Goodperformance(scaleswell)3.  Portable(assumeIEEE754only)4.  Usercanchooseaccuracy

•  Approaches–  Guaranteefixedreduc)ontree(not2.or3.)–  Use(very)highprecisiontogetexactanswer(not2.)–  Ourapproach(Nguyen,Ahrens,D.)

•  Oversimplified:1.  ComputeA=maximumabsolutevalue(reproducible)2.  RoundallsummandstooneulpinA3.  Computesum;roundingcausesthemtoactlikefixedpointnumbers

•  Possiblewithonepassoverdata,onereduc)onopera)on,usingjusta6word“reproducibleaccumulator” 107

Goals/ApproachesforReproducibleSum

0

1

2

3

4

5

dasu

mA

lgo

3A

lgo

2A

lgo

6

dasu

mA

lgo

3A

lgo

2A

lgo

6

dasu

mA

lgo

3A

lgo

2A

lgo

6

dasu

mA

lgo

3A

lgo

2A

lgo

6

dasu

mA

lgo

3A

lgo

2A

lgo

6

dasu

mA

lgo

3A

lgo

2A

lgo

6

dasu

mA

lgo

3A

lgo

2A

lgo

6

dasu

mA

lgo

3A

lgo

2A

lgo

6

dasu

mA

lgo

3A

lgo

2A

lgo

6

dasu

mA

lgo

3A

lgo

2A

lgo

6

dasu

mA

lgo

3A

lgo

2A

lgo

6

1 2 4 8 16 32 64 128 256 512 1024

Tim

e (n

orm

aliz

ed b

y da

sum

tim

e)

# Processors

local sumAbsolute MaxcommunicationAll-Reduce

3.1

3.2

3.1

2.9

2.8

3.1

3.0

2.9

2.3 2.

4

2.2

4.7

4.7

4.4

4.1

3.9

3.7

3.6

3.1

2.3

2.3

2.2

4.9

5.0

4.6

4.3

3.9

3.4

3.0

2.2

1.4

1.2

1.2

108

Performanceresultson1024procCrayXC301.2xto3.2xslowdownvsfastestcode,forn=1M

ReproducibleSojwareandHardware

•  SojwareforReproducibleBLAS1availableatbebop.cs.berkeley.edu/reproblas– BLAS2,3underdevelopment

•  UsedChisel(hardwaredesignlanguage)todesignonenewinstruc)onthatwouldmakereproduciblesumma)onasfast(andmoreaccurate)thanstandardsumma)on

•  Industrialinterest

ReproducibleSojwareandHardware

•  SojwareforReproducibleBLASavailableatbebop.cs.berkeley.edu/reproblas

•  IEEE754Floa)ngPointStandardsCommiieeconsideringaddingnewinstruc)onthatwouldacceleratethis

•  BLASStandardCommiieealsoconsideringaddingthese

•  Industrialinterest

Formoredetails

•  Bebop.cs.berkeley.edu– 155pagesurveyinActaNumerica

•  CS267–Berkeley’sParallelCompu)ngCourse– LivebroadcastinSpring2016

•  www.cs.berkeley.edu/~demmel•  Allslides,videoavailable

– AvailableasSPOCfromXSEDE•  www.xsede.org•  Freesupercomputeraccountstodohomework•  Freeautogradingofhomework

CollaboratorsandSupporters•  JamesDemmel,KathyYelick,MichaelAnderson,GreyBallard,ErinCarson,Aditya

Devarakonda,MichaelDriscoll,DavidEliahu,AndrewGearhart,EvangelosGeorganas,NicholasKnight,PenpornKoanantakool,BenLipshitz,OdedSchwartz,EdgarSolomonik,OmerSpillinger

•  Aus)nBenson,MaryamDehnavi,MarkHoemmen,ShoaibKamil,MarghoobMohiyuddin•  AbhinavBhatele,AydinBuluc,MichaelChrist,IoanaDumitriu,ArmandoFox,David

Gleich,MingGu,JeffHammond,MikeHeroux,OlgaHoltz,KurtKeutzer,JulienLangou,DevinMaihews,TomScanlon,MichelleStrout,SamWilliams,HuaXiang

•  JackDongarra,DulceneiaBecker,IchitaroYamazaki•  SivanToledo,AlexDruinsky,InonPeled•  LauraGrigori,Sebas)enCayrols,SimpliceDonfack,MathiasJacquelin,AmalKhabou,

SophieMoufawad,MikolajSzydlarski•  MembersofFASTMath,ParLab,ASPIRE,BEBOP,CACHE,EASI,MAGMA,PLASMA•  ThankstoDOE,NSF,UCDiscovery,INRIA,Intel,Microsoj,Mathworks,Na)onal

Instruments,NEC,Nokia,NVIDIA,Samsung,Oracle•  bebop.cs.berkeley.edu

Summary

Don’tCommunic…

113

Timetoredesignalllinearalgebra,n-body,…algorithmsandsojware

(andcompilers)

communicaon-avoiding algorithms for linear algebra and beyond · compu)ng systems. on modern...

Documents