communicaon-avoiding algorithms for linear algebra and beyond

Communica)on-AvoidingAlgorithmsforLinearAlgebraandBeyond

JimDemmelEECS&MathDepartments

UCBerkeley

2

Whyavoidcommunica)on?(1/3)Algorithmshavetwocosts(measuredin)meorenergy):1. Arithme)c(FLOPS)

2. Communica)on:movingdatabetween–  levelsofamemoryhierarchy(sequen)alcase)

–  processorsoveranetwork(parallelcase).

CPUCache

DRAM

CPUDRAM

CPUDRAM

CPUDRAM

CPUDRAM

Whyavoidcommunica)on?(2/3)•  Running)meofanalgorithmissumof3terms:

–  #flops*)me_per_flop–  #wordsmoved/bandwidth–  #messages*latency

3

communica)on

•  Time_per_flop<<1/bandwidth<<latency

•  Gapsgrowingexponen)allywith)me[FOSC]

•  Avoidcommunica)ontosave)me

•  Goal:reorganizealgorithmstoavoidcommunica)on

Annualimprovements

Time_per_flop Bandwidth Latency

Network 26% 15%

DRAM 23% 5%59%

WhyMinimizeCommunica)on?(3/3)

PicoJoules

Source:JohnShalf,LBL

WhyMinimizeCommunica)on?(3/3)

PicoJoules

Source:JohnShalf,LBL

Minimizecommunica)ontosaveenergy

Goals

6

•  Redesignalgorithmstoavoidcommunica)on•  Betweenallmemoryhierarchylevels

•  L1L2DRAMnetwork,etc

• Afainlowerboundsifpossible•  Currentalgorithmsogenfarfromlowerbounds

•  Largespeedupsandenergysavingspossible

SampleSpeedups•  Up to 12x faster for 2.5D matmul on 64K core IBM BG/P

•  Up to 3x faster for tensor contractions on 2K core Cray XE/6

•  Up to 6.2x faster for APSP on 24K core Cray CE6

•  Up to 2.1x faster for 2.5D LU on 64K core IBM BG/P

•  Up to 11.8x faster for direct N-body on 32K core IBM BG/P

•  Up to 13x faster for TSQR on Tesla C2050 Fermi NVIDIA GPU •  Up to 6.7x faster for symeig (band A) on 10 core Intel Westmere

•  Up to 2x faster for 2.5D Strassen on 38K core Cray XT4

•  Up to 4.2x faster for MiniGMG benchmark bottom solver, using CA-BiCGStab (2.5x for overall solve)

–  2.5x / 1.5x for combustion simulation code

7

Outline

•  SurveystateoftheartofCA(Comm-Avoiding)algorithms–  ReviewpreviousMatmulalgorithms–  CAO(n3)2.5DMatmul–  TSQR:Tall-SkinnyQR

•  Beyondlinearalgebra–  Extendinglowerboundstoanyalgorithmwitharrays–  Communica)on-op)malN-bodyalgorithm

•  CA-Krylovmethods•  RelatedTopics

SummaryofCALinearAlgebra•  “Direct”LinearAlgebra

•  Lowerboundsoncommunica)onforlinearalgebraproblemslikeAx=b,leastsquares,Ax=λx,SVD,etc

•  Mostlynotafainedbyalgorithmsinstandardlibraries

•  Newalgorithmsthatafaintheselowerbounds• Beingaddedtolibraries:Sca/LAPACK,PLASMA,MAGMA

•  Largespeed-upspossible• Autotuningtofindop)malimplementa)on

•  Difofor“Itera)ve”LinearAlgebra

Lowerboundforall“n3-like”linearalgebra

•  Holdsfor– Matmul,BLAS,LU,QR,eig,SVD,tensorcontrac)ons,…–  Somewholeprograms(sequencesoftheseopera)ons,nomaferhowindividualopsareinterleaved,egAk)

–  Denseandsparsematrices(where#flops<<n3)–  Sequen)alandparallelalgorithms–  Somegraph-theore)calgorithms(egFloyd-Warshall)

11

• LetM=“fast”memorysize(perprocessor)

#words_moved(perprocessor)=Ω(#flops(perprocessor)/M1/2)

#messages_sent(perprocessor)=Ω(#flops(perprocessor)/M3/2)

• Parallelcase:assumeeitherloadormemorybalanced




12



#messages_sent≥#words_moved/largest_message_size





13



#messages_sent(perprocessor)=Ω(#flops(perprocessor)/M3/2)


SIAMSIAG/LinearAlgebraPrize,2012Ballard,D.,Holtz,Schwartz

Canweafaintheselowerbounds?•  Doconven)onaldensealgorithmsasimplementedinLAPACKandScaLAPACKafainthesebounds?– Ogennot

•  Ifnot,arethereotheralgorithmsthatdo?–  Yes,formuchofdenselinearalgebra,APSP– Newalgorithms,withnewnumericalproper)es,newwaystoencodeanswers,newdatastructures

– Notjustlooptransforma)ons(needthosetoo!)

•  Onlyafewsparsealgorithmssofar–  Ex:SparseCholeskyofmatriceswith“large”separators–  Ex:Sparse*Densematrixmul)ply

•  Lotsofworkinprogress14

Outline




ReviewofMatmulalgorithms

•  Naïvesequen)alalgorithm–  fori=1:n,forj=1:n,fork=1:n,C(i,j)+=A(i,k)*B(k,j)–  Pessimal:O(n3)wordsmovedCache!DRAM

•  Blockedsequen)alalgorithm–  Fori=1:n/b,forj=1:n/b,fork=1:n/b,C[i,j]+=A[i,k]*B[k,j]– A[i,k]*B[k,j]isab-by-bblockmatrixmul)ply–  Ifb~M1/2,M=Cachesize,thenafainslowerbound:O(n3/M1/2)wordsmoved

–  Ifdon’tknowM,ormul)plecachelevels,cacheoblivious(recursive)algorithmworks

17

SUMMA–nxnmatmulonP1/2xP1/2grid(nearly)op)malusingminimummemoryM=O(n2/P)

For k=0 to n/b-1 … b = block size = #cols in A(i,k) = #rows in B(k,j) for all i = 1 to P1/2

* = i

j

A(i,k)

kkB(k,j)

C(i,j)

Brow

Acol

Summaryofdenseparallelalgorithmsafainingcommunica)onlowerbounds

•  AssumenxnmatricesonPprocessors•  MinimumMemoryperprocessor=M=O(n2/P)•  Recalllowerbounds:

#words_moved=Ω((n3/P)/M1/2)=Ω(n2/P1/2)#messages=Ω((n3/P)/M3/2)=Ω(P1/2)

•  DoesScaLAPACKafainthesebounds?•  For#words_moved:mostly,exceptnonsym.Eigenproblem•  For#messages:asympto)callyworse,exceptCholesky

•  Newalgorithmsafainallbounds,uptopolylog(P)factors•  Cholesky,LU,QR,Sym.andNonsymeigenproblems,SVD

•  Needtoreplacepar)alpivo)nginLU•  Needrandomiza)onforNonsymeigenproblem(sofar)

CanwedoBefer?

Canwedobefer?

•  Aren’twealreadyop)mal?•  WhyassumeM=O(n2/p),i.e.minimal?

–  Lowerbounds)lltrueifmorememory–  Canweafainit?

•  Specialcase:“3DMatmul”– UsesM=O(n2/p2/3)– Dekel,Nassimi,Sahni[81],Bernsten[89],Agarwal,Chandra,Snir[90],Johnson[93],Agarwal,Balle,Gustavson,Joshi,Palkar[95]

•  Notalwaysp1/3)mesasmuchmemoryavailable…

Outline




2.5DMatrixMul)plica)on

•  Assumecanfitcn2/Pdataperprocessor,c>1•  Processorsform(P/c)1/2x(P/c)1/2xcgrid

c

(P/c)1/2

Example:P=32,c=2

2.5DMatrixMul)plica)on

•  Assumecanfitcn2/Pdataperprocessor,c>1•  Processorsform(P/c)1/2x(P/c)1/2xcgrid

k

j

Ini)allyP(i,j,0)ownsA(i,j)andB(i,j)eachofsizen(c/P)1/2xn(c/P)1/2

(1)P(i,j,0)broadcastsA(i,j)andB(i,j)toP(i,j,k)(2)Processorsatlevelkperform1/c-thofSUMMA,i.e.1/c-thofΣmA(i,m)*B(m,j)(3)Sum-reducepar)alsumsΣmA(i,m)*B(m,j)alongk-axissoP(i,j,0)ownsC(i,j)

2.5DMatmulonBG/P,16Knodes/64Kcores

0

20

40

60

80

100

8192 131072

Pe

rce

nta

ge

of

ma

chin

e p

ea

k

n

Matrix multiplication on 16,384 nodes of BG/P

12X faster

2.7X faster

Using c=16 matrix copies

2D MM2.5D MM

2.5DMatmulonBG/P,16Knodes/64Kcores

0

0.2

0.4

0.6

0.8

1

1.2

1.4

n=8192, 2D

n=8192, 2.5D

n=131072, 2D

n=131072, 2.5D

Exe

cutio

n tim

e n

orm

aliz

ed b

y 2D

Matrix multiplication on 16,384 nodes of BG/P

95% reduction in comm computationidle

communication

c=16copies

DisOnguishedPaperAward,EuroPar’11(Solomonik,D.)SC’11paperbySolomonik,Bhatele,D.

12xfaster

2.7xfaster

IdeasadoptedbyNervana,“deeplearning”startup,acquiredbyIntelinAugust2016

PerfectStrongScaling–inTimeandEnergy•  Every)meyouaddaprocessor,youshoulduseitsmemoryMtoo•  Startwithminimalnumberofprocs:PM=3n2•  IncreasePbyafactorofc"totalmemoryincreasesbyafactorofc•  Nota)onfor)mingmodel:

–  γT,βT,αT=secsperflop,perword_moved,permessageofsizem•  T(cP)=n3/(cP)[γT+βT/M1/2+αT/(mM1/2)]=T(P)/c•  Nota)onforenergymodel:

–  γE,βE,αE=joulesforsameopera)ons–  δE=joulesperwordofmemoryusedpersec–  εE=joulespersecforleakage,etc.

•  E(cP)=cP{n3/(cP)[γE+βE/M1/2+αE/(mM1/2)]+δEMT(cP)+εET(cP)}=E(P)•  Limit:c≤P1/3(3Dalgorithm),ifstar)ngwith1copyofinputs•  ExtendstoN-body,Strassen,…•  Canprovelowerboundsonneedednetwork(eg3Dtorusformatmul)

Outline




TSQR:QRofaTall,Skinnymatrix

27

W=

Q00R00Q10R10Q20R20Q30R30

W0

W1

W2W3

Q00

Q10

Q20Q30

= = .

R00R10R20R30

R00R10R20R30

=Q01R01Q11R11

Q01

Q11= . R01

R11

R01R11

= Q02R02

TSQR:QRofaTall,Skinnymatrix

28

W=

Q00R00Q10R10Q20R20Q30R30

W0

W1

W2W3

Q00

Q10

Q20Q30

= = .

R00R10R20R30

R00R10R20R30

=Q01R01Q11R11

Q01

Q11= . R01

R11

R01R11

= Q02R02

Output={Q00,Q10,Q20,Q30,Q01,Q11,Q02,R02}

TSQR:AnArchitecture-DependentAlgorithm

W=

W0

W1W2

W3

R00R10R20R30

R01

R11

R02Parallel:

W=

W0

W1W2

W3

R01R02

R00

R03Sequen)al:

W=

W0

W1W2

W3

R00R01

R01R11

R02

R11R03

DualCore:

Canchoosereduc)ontreedynamically

Mul)core/Mul)socket/Mul)rack/Mul)site/Out-of-core:?

TSQRPerformanceResults•  Parallel

–  IntelClovertown–  Upto8xspeedup(8core,dualsocket,10Mx10)

–  Pen)umIIIcluster,DolphinInterconnect,MPICH•  Upto6.7xspeedup(16procs,100Kx200)

–  BlueGene/L•  Upto4xspeedup(32procs,1Mx50)

–  TeslaC2050/Fermi•  Upto13x(110,592x100)

–  Grid–4xon4ci)esvs1city(Dongarra,Langouetal)–  Cloud–1.6xslowerthanjustaccessingdatatwice(GleichandBenson)

•  Sequen)al–  “Infinitespeedup”forout-of-coreonPowerPClaptop

•  Aslifleas2xslowdownvs(predicted)infiniteDRAM•  LAPACKwithvirtualmemoryneverfinished

•  SVDcostsaboutthesame•  JointworkwithGrigori,Hoemmen,Langou,Anderson,Ballard,Keutzer,others

30DatafromGreyBallard,MarkHoemmen,LauraGrigori,JulienLangou,JackDongarra,MichaelAnderson

TSQRPerformanceResults•  Parallel

–  IntelClovertown–  Upto8xspeedup(8core,dualsocket,10Mx10)

–  Pen)umIIIcluster,DolphinInterconnect,MPICH•  Upto6.7xspeedup(16procs,100Kx200)

–  BlueGene/L•  Upto4xspeedup(32procs,1Mx50)

–  TeslaC2050/Fermi•  Upto13x(110,592x100)

–  Grid–4xon4ci)esvs1city(Dongarra,Langouetal)–  Cloud–1.6xslowerthanjustaccessingdatatwice(GleichandBenson)

•  Sequen)al–  “Infinitespeedup”forout-of-coreonPowerPClaptop

•  Aslifleas2xslowdownvs(predicted)infiniteDRAM•  LAPACKwithvirtualmemoryneverfinished

•  SVDcostsaboutthesame•  JointworkwithGrigori,Hoemmen,Langou,Anderson,Ballard,Keutzer,others

31DatafromGreyBallard,MarkHoemmen,LauraGrigori,JulienLangou,JackDongarra,MichaelAnderson

SIAGonSupercompuOngBestPaperPrize,2016

Other/OngoingWork•  AllresultsextendtoStrassen-likealgorithmswithcomplexityO(nω)

–  LowerboundgoesfromO(n3/M1/2)=O((n/M1/2)3M)toO((n/M1/2)ωM)•  Lotsmoreworkon

–  Algorithms:•  BLAS,LDLT,QRwithpivo)ng,otherpivo)ngschemes,eigenproblems,…•  Sparsematrices,structuredmatrices•  All-pairs-shortest-path,…•  Both2D(c=1)and2.5D(c>1)•  Butonlybandwidthmaydecreasewithc>1,notlatency(egLU)

–  Pla�orms:•  Mul)core,cluster,GPU,cloud,heterogeneous,low-energy,…

–  Sogware:•  Integra)onintoSca/LAPACK,PLASMA,MAGMA,…

•  Integra)onintoapplica)ons–  CTF(withANL):symmetrictensorcontrac)ons–  Spark:TSQR

Outline




Recallop)malsequen)alMatmul(1/2)

•  Naïvecodefori=1:n,forj=1:n,fork=1:n,C(i,j)+=A(i,k)*B(k,j)

•  “Blocked”codefori=1:n/b,forj=1:n/b,fork=1:n/bC[i,j]+=A[i,k]*B[k,j]…bxbmatmul

•  Thm:Pickingb=M1/2afainslowerbound:#words_moved=Ω(n3/M1/2)•  Wheredoes1/2comefrom?

Recallop)malsequen)alMatmul(2/2)

•  “Blocked”codewithblocksizesMxi,Mxj,Mxkfori=1:n/Mxi,forj=1:n/Mxj,fork=1:n/Mxk

C[i,j]+=A[i,k]*B[k,j]…(MxixMxk)*(MxkxMxj)matmul•  Pickop)malblocksizes

–  Afitsincache:MxixMxk=Mxi+xk≤M=>xi+xk≤1–  Bfitsincache:xk+xj≤1,Cfitsincache:xi+xj≤1– Maximize=MxixMxjxMxk=Mxi+xj+xk!Maximizexi+xj+xk

•  Linearprogram,solu)onxi=xj=xk=1/2,xi+xj+xk=3/2•  Memorytraffic=(n/M1/2)3*M=n3/M1/2

•  Thm:This)lingminimizesmemorytraffic,overallreorderings–  Includessequen)al,parallel,2.5Dcases

Applica)ontoDirectN-Body•  fori=1:n,forj=1:n,F(i)+=force(P(i),P(j))•  UseblocksizesMxiforiandMxjforj

•  MaximizeMxixMxj=Mxi+xjs.t.Mxi≤MandMxj≤M– Maximizexi+xjs.t.xi≤1andxj≤1– Solu)on:xi=xj=1

•  Memorytraffic=(n/M)2*M=n2/M

•  Thm:This)lingminimizesmemorytraffic,overallreorderings–  Includessequen)al,parallel,2.5Dcases

N-BodySpeedupsonIBM-BG/P(Intrepid)8Kcores,32Kpar)cles

11.8xspeedup

K.Yelick,E.Georganas,M.Driscoll,P.Koanantakool,E.Solomonik

04/07/2015 CS267 Lecture 21

Some Applications •  Gravity, Turbulence, Molecular Dynamics (3-way

interactions, 42x speedups), Plasma Simulation, Electron-Beam Lithography Device Simulation

•  Hair ... – www.fxguide.com/featured/brave-new-hair/ – graphics.pixar.com/library/CurlyHairA/paper.pdf

38

Approachtogeneralizinglowerbounds•  Matmul

fori=1:n,forj=1:n,fork=1:n,

C(i,j)+=A(i,k)*B(k,j)

=>for(i,j,k)inS=subsetofZ3

Accessloca)onsindexedby(i,j),(i,k),(k,j)

•  Generalcase

fori1=1:n,fori2=i1:m,…forik=i3:i4

C(i1+2*i3-i7)=func(A(i2+3*i4,i1,i2,i1+i2,…),B(pnt(3*i4)),…)

D(somethingelse)=func(somethingelse),…=>for(i1,i2,…,ik)inS=subsetofZk

Accessloca)onsindexedby“projec)ons”,eg

φC(i1,i2,…,ik)=(i1+2*i3-i7)

φA(i1,i2,…,ik)=(i2+3*i4,i1,i2,i1+i2,…),…

•  Goal:Communica)onlowerbounds,op)malalgorithmsforanyprogramthatlookslikethis

GeneralCommunica)onBound

•  Thm:Givenaprogramwitharrayrefsgivenbyprojec)onsφj,thenthereisane≥1suchthat

#words_moved=Ω(#itera)ons/Me-1)

whereeisthethevalueofalinearprogram:minimizee=Σjejsubjecttorank(H)≤Σjej*rank(φj(H))forallsubgroupsH<Zk

•  Proofdependsonrecentresultinpuremathema)csbyChrist/Tao/Carbery/Bennef

Isthisboundafainable?

•  Butfirst:Canwewriteitdown?•  OneinequalitypersubgroupH<Zk,buts)llfinitelymany!

•  Thm:(badnews)Wri)ngdownallinequali)esinLPreducestoHilbert’s10thproblemoverQ–  Conjecturedtobeundecidable

•  Thm:(goodnews)AnotherLPwithsamesolu)onisdecidable

•  Afainabilitydependsonloopdependencies–  Bestcase:none,orreduc)ons(likematmul)

•  Thm:Wecanalwaysconstructanop)mal)ling,thatafainsthebound(dependenciespermi�ng)

OngoingWork

•  Extend“perfectscaling”resultsfor)meandenergybyusingextramemory– “n.5Dalgorithms”

•  Dealwithdependencies•  Implement/improvealgorithmstogeneratelowerbounds,op)malalgorithms

•  Incorporateintocompilers

Outline




AvoidingCommunica)oninItera)veLinearAlgebra

•  k-stepsofitera)vesolverforsparseAx=borAx=λx– DoeskSpMVswithAandstar)ngvector– Manysuch“KrylovSubspaceMethods”

•  ConjugateGradients(CG),GMRES,Lanczos,Arnoldi,…

•  Goal:minimizecommunica)on– Assumematrix“well-par))oned”–  Serialimplementa)on

•  Conven)onal:O(k)movesofdatafromslowtofastmemory•  New:O(1)movesofdata–opOmal

–  Parallelimplementa)ononpprocessors•  Conven)onal:O(klogp)messages(kSpMVcalls,dotprods)•  New:O(logp)messages-opOmal

•  Lotsofspeeduppossible(modeledandmeasured)–  Price:someredundantcomputa)on–  Challenges:Poorpar))oning,Precondi)oning,Num.Stability44

1234… …32

x

A·x

A2·x

A3·x

Communica)onAvoidingKernels:TheMatrixPowersKernel:[Ax,A2x,…,Akx]

•  Replacekitera)onsofy=A⋅xwith[Ax,A2x,…,Akx]

•  Example:Atridiagonal,n=32,k=3

•  Worksforany“well-par))oned”A

1234… …32

x

A·x

A2·x

A3·x


•  Replacekitera)onsofy=A⋅xwith[Ax,A2x,…,Akx]


1234… …32

x

A·x

A2·x

A3·x


•  Replacekitera)onsofy=A⋅xwith[Ax,A2x,…,Akx]•  Sequen)alAlgorithm


Step1

1234… …32

x

A·x

A2·x

A3·x




Step1 Step2

1234… …32

x

A·x

A2·x

A3·x




Step1 Step2 Step3

1234… …32

x

A·x

A2·x

A3·x




Step1 Step2 Step3 Step4

1234… …32

x

A·x

A2·x

A3·x


•  Replacekitera)onsofy=A⋅xwith[Ax,A2x,…,Akx]•  ParallelAlgorithm


•  Eachprocessorcommunicatesoncewithneighbors

Proc1 Proc2 Proc3 Proc4

1234… …32

x

A·x

A2·x

A3·x


•  Replacekitera)onsofy=A⋅xwith[Ax,A2x,…,Akx]•  ParallelAlgorithm


•  Eachprocessorworkson(overlapping)trapezoid

Proc1 Proc2 Proc3 Proc4

TheMatrixPowersKernel:[Ax,A2x,…,Akx]onageneralmatrix(nearestkneighborsonagraph)

Sameideaforgeneralsparsematrices:k-wideneighboringregion57

Simpleblock-rowpar))oning"(hyper)graphpar))oning

Top-to-bofomprocessing"# TravelingSalesmanProblem

MinimizingCommunica)onofGMREStosolveAx=b

•  GMRES:findxinspan{b,Ab,…,Akb}minimizing||Ax-b||2

StandardGMRESfori=1tokw=A·v(i-1)…SpMVMGS(w,v(0),…,v(i-1))updatev(i),HendforsolveLSQproblemwithH

Communica)on-avoidingGMRESW=[v,Av,A2v,…,Akv][Q,R]=TSQR(W)…“TallSkinnyQR”buildHfromRsolveLSQproblemwithH

• Oops–Wfrompowermethod,precisionlost!58

Sequen)alcase:#wordsmoveddecreasesbyafactorofkParallelcase:#messagesdecreasesbyafactorofk

MatrixPowersKernel+TSQRinGMRES

0 200 400 600 800 1000

Iteration count

10�5

10�4

10�3

10�2

10�1

100R

elat

ive

norm

ofre

sidu

alA

x�

bOriginal GMRES

0 200 400 600 800 1000

Iteration count

10�5

10�4

10�3

10�2

10�1

100R

elat

ive

norm

ofre

sidu

alA

x�

bOriginal GMRESCA-GMRES (Monomial basis)

0 200 400 600 800 1000

Iteration count

10�5

10�4

10�3

10�2

10�1

100R

elat

ive

norm

ofre

sidu

alA

x�

bOriginal GMRESCA-GMRES (Monomial basis)CA-GMRES (Newton basis)

59

SpeedupsofGMRESon8-coreIntelClovertown

[MHDY09]

60

RequiresCo-tuningKernels

SpeedupsforGMGw/CA-KSMBofomSolve

61

•  ComparedBICGSTABvs.CA-BICGSTABwiths=4(monomialbasis)

• HopperatNERSC(CrayXE6),weakscaling:Upto4096MPIprocesses(1perchip,24,576corestotal)

•  SpeedupsforminiGMGbenchmark(HPGMGbenchmarkpredecessor)– 4.2xinbofomsolve,2.5xoverallGMGsolve

•  Implementedasasolverop)oninBoxLibandCHOMBOAMRframeworks

•  SpeedupsfortwoBoxLibapplica)ons:– 3DLMC(alow-machnumbercombus)oncode)

•  2.5xinbofomsolve,1.5xoverallGMGsolve– 3DNyx(anN-bodyandgasdynamicscode)

•  2xinbofomsolve,1.15xoverallGMGsolve

SummaryofItera)veLinearAlgebra

•  Newlowerbounds,op)malalgorithms,bigspeedupsintheoryandprac)ce

•  Lotsofotherprogress,openproblems–  Manydifferentalgorithmsreorganized

•  Moreunderway,moretobedone

–  Extendstomachinelearning(coordinatedescent)– Needtorecognizestablevariantsmoreeasily–  Precondi)oning

•  HierarchicallySemiseparableMatrices

– Autotuningandsynthesis•  Differentkindsof“sparsematrices”

Outline




– Write-AvoidingAlgorithms–  Reproducilibity

Formoredetails

•  Bebop.cs.berkeley.edu–  155pagesurveyinActaNumerica

•  CS267–Berkeley’sParallelCompu)ngCourse–  LivebroadcastinSpring2016,plannedSpring2017

•  www.cs.berkeley.edu/~demmel•  Allslides,videoavailable

– AvailableasSPOCfromXSEDE•  www.xsede.org•  Freesupercomputeraccountstodohomework•  Freeautogradingofhomework

CollaboratorsandSupporters•  JamesDemmel,KathyYelick,MichaelAnderson,GreyBallard,ErinCarson,Aditya

Devarakonda,MichaelDriscoll,DavidEliahu,AndrewGearhart,EvangelosGeorganas,NicholasKnight,PenpornKoanantakool,BenLipshitz,OdedSchwartz,EdgarSolomonik,OmerSpillinger

•  Aus)nBenson,MaryamDehnavi,MarkHoemmen,ShoaibKamil,MarghoobMohiyuddin

•  AbhinavBhatele,AydinBuluc,MichaelChrist,IoanaDumitriu,ArmandoFox,DavidGleich,MingGu,JeffHammond,MikeHeroux,OlgaHoltz,KurtKeutzer,JulienLangou,DevinMafhews,TomScanlon,MichelleStrout,SamWilliams,HuaXiang

•  JackDongarra,DulceneiaBecker,IchitaroYamazaki

•  SivanToledo,AlexDruinsky,InonPeled

•  LauraGrigori,Sebas)enCayrols,SimpliceDonfack,MathiasJacquelin,AmalKhabou,SophieMoufawad,MikolajSzydlarski

•  MembersofFASTMath,ParLab,ASPIRE,BEBOP,CACHE,EASI,MAGMA,PLASMA

•  ThankstoDOE,NSF,UCDiscovery,INRIA,Intel,Microsog,Mathworks,Na)onalInstruments,NEC,Nokia,NVIDIA,Samsung,Oracle

•  bebop.cs.berkeley.edu

Summary

Don’tCommunic…

66

Timetoredesignalllinearalgebra,n-body,…algorithmsandsogware

(andcompilers)

communicaon-avoiding algorithms for linear algebra and beyond

Documents