hpc: linear algebra challengespolaris.imag.fr/arnaud.legrand/teaching/2013/pc_11_linalg.pdf · hpc:...

126
HPC: Linear Algebra Challenges A. Legrand Communication “Avoiding” Algorithms Cache oblivious algorithm Parallel Algorithm Synchronization- reducing algorithms Moving to Scheduling DAGs DAG generation Granularity and Hybrid Computing Auto-tuning Block Size Compiler Optimization Portability, Performance, Power, ... Reproducibility and Mixed-Precision Methods HPC: Linear Algebra Challenges Arnaud Legrand, CNRS, University of Grenoble LIG laboratory, [email protected] December 16, 2013 1 / 120

Upload: others

Post on 16-Oct-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

HPC: Linear Algebra Challenges

Arnaud Legrand, CNRS, University of Grenoble

LIG laboratory, [email protected]

December 16, 2013

1 / 120

Page 2: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Reducing communication time

There are three main techniques for improving completion time:

I Tuning (overlap communication and computation)

I Ghosting (duplicate computation when there are dependencies)

I Scheduling (cache aware/cache oblivious, data distribution, ...)

I Change essence of the algorithm (e.g. Strassen n2.80 or Winogradn2.38) but this may be numerically harmful and beware of the O.

2 / 120

Page 3: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Outline

1 Communication “Avoiding” AlgorithmsCache oblivious algorithmParallel Algorithm

2 Synchronization-reducing algorithmsMoving to Scheduling DAGsDAG generationGranularity and Hybrid Computing

3 Auto-tuningBlock SizeCompiler OptimizationPortability, Performance, Power, ...

4 Reproducibility and Mixed-Precision Methods

3 / 120

Page 4: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Outline

1 Communication “Avoiding” AlgorithmsCache oblivious algorithmParallel Algorithm

2 Synchronization-reducing algorithmsMoving to Scheduling DAGsDAG generationGranularity and Hybrid Computing

3 Auto-tuningBlock SizeCompiler OptimizationPortability, Performance, Power, ...

4 Reproducibility and Mixed-Precision Methods

4 / 120

Page 5: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

2  

Why  avoid  communica)on?  (1/2)  Algorithms  have  two  costs  (measured  in  )me  or  energy):  1. Arithme)c  (FLOPS)  2. Communica)on:  moving  data  between    

–  levels  of  a  memory  hierarchy  (sequen)al  case)    –  processors  over  a  network  (parallel  case).    

CPU  Cache  

DRAM  

CPU  DRAM  

CPU  DRAM  

CPU  DRAM  

CPU  DRAM  

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of James Demmel

5 / 120

Page 6: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Why  avoid  communica)on?  (2/3)  •  Running  )me  of  an  algorithm  is  sum  of  3  terms:  

–  #  flops  *  )me_per_flop  –  #  words  moved  /  bandwidth  –  #  messages  *  latency  

3"

communica)on  

•  Time_per_flop    <<    1/  bandwidth    <<    latency  •  Gaps  growing  exponen)ally  with  )me  [FOSC]  

•  Avoid  communica)on  to  save  )me  

•  Goal  :  reorganize  algorithms  to  avoid  communica)on  •  Between  all  memory  hierarchy  levels    

•  L1                  L2                  DRAM                    network,    etc    •  Very  large  speedups  possible  •  Energy  savings  too!  

Annual  improvements  

Time_per_flop   Bandwidth   Latency  

Network   26%   15%  

DRAM   23%   5%  59%  

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of James Demmel

6 / 120

Page 7: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Why  Minimize  Communica)on?  (2/2)  

1  

10  

100  

1000  

10000  

PicoJoules  

now  

2018  

Source:  John  Shalf,  LBL  

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of James Demmel

7 / 120

Page 8: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Why  Minimize  Communica)on?  (2/2)  

1  

10  

100  

1000  

10000  

PicoJoules  

now  

2018  

Source:  John  Shalf,  LBL  

Minimize  communica)on  to  save  energy  

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of James Demmel

8 / 120

Page 9: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Goals  

6"

•  Redesign  algorithms  to  avoid  communica)on  •  Between  all  memory  hierarchy  levels    

•  L1                  L2                  DRAM                    network,    etc    

• Akain  lower  bounds  if  possible  •  Current  algorithms  olen  far  from  lower  bounds  •  Large  speedups  and  energy  savings  possible    

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of James Demmel

9 / 120

Page 10: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

16"

Naïve  Matrix  Mul)ply  {implements  C  =  C  +  A*B}  for  i  =  1  to  n              for  j  =  1  to  n                    for  k  =  1  to  n                        C(i,j)  =  C(i,j)  +  A(i,k)  *  B(k,j)                  

= + * C(i,j) A(i,:)

B(:,j) C(i,j)

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of James Demmel

10 / 120

Page 11: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

17"

Naïve  Matrix  Mul)ply  {implements  C  =  C  +  A*B}  for  i  =  1  to  n      {read  row  i  of  A  into  fast  memory}        for  j  =  1  to  n                {read  C(i,j)  into  fast  memory}                {read  column  j  of  B  into  fast  memory}                for  k  =  1  to  n                        C(i,j)  =  C(i,j)  +  A(i,k)  *  B(k,j)                {write  C(i,j)  back  to  slow  memory}  

= + * C(i,j) A(i,:)

B(:,j) C(i,j)

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of James Demmel

11 / 120

Page 12: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

18"

Naïve  Matrix  Mul)ply  {implements  C  =  C  +  A*B}  for  i  =  1  to  n      {read  row  i  of  A  into  fast  memory}                                  …    n2  reads  altogether        for  j  =  1  to  n                {read  C(i,j)  into  fast  memory}                                          …    n2  reads  altogether                {read  column  j  of  B  into  fast  memory}            …    n3  reads  altogether                for  k  =  1  to  n                        C(i,j)  =  C(i,j)  +  A(i,k)  *  B(k,j)                {write  C(i,j)  back  to  slow  memory}                        …  n2  writes  altogether  

= + * C(i,j) A(i,:)

B(:,j) C(i,j)

n3  +  3n2  reads/writes  altogether  –  dominates  2n3  arithme)c  

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of James Demmel

12 / 120

Page 13: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

19"

Blocked  (Tiled)  Matrix  Mul)ply  Consider  A,B,C  to  be  n/b-­‐by-­‐n/b  matrices  of  b-­‐by-­‐b  subblocks  where                                                          

b  is  called  the  block  size;        assume  3  b-­‐by-­‐b  blocks  fit  in  fast  memory          for  i  =  1  to  n/b  

               for  j  =  1  to  n/b                  {read  block  C(i,j)  into  fast  memory}                  for  k  =  1  to  n/b                                        {read  block  A(i,k)  into  fast  memory}                                        {read  block  B(k,j)  into  fast  memory}                                        C(i,j)  =  C(i,j)  +  A(i,k)  *  B(k,j)  {do  a  matrix  mul)ply  on  blocks}                  {write  block  C(i,j)  back  to  slow  memory}  

= + * C(i,j) C(i,j) A(i,k)

B(k,j) b-­‐by-­‐b  block  

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of James Demmel

13 / 120

Page 14: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

20"

Blocked  (Tiled)  Matrix  Mul)ply  Consider  A,B,C  to  be  n/b-­‐by-­‐n/b  matrices  of  b-­‐by-­‐b  subblocks  where                                                          

b  is  called  the  block  size;        assume  3  b-­‐by-­‐b  blocks  fit  in  fast  memory            for  i  =  1  to  n/b  

               for  j  =  1  to  n/b                  {read  block  C(i,j)  into  fast  memory}                        …  b2  ×  (n/b)2  =  n2  reads                  for  k  =  1  to  n/b                                        {read  block  A(i,k)  into  fast  memory}        …  b2  ×  (n/b)3  =  n3/b  reads                                            {read  block  B(k,j)  into  fast  memory}        …  b2  ×  (n/b)3  =  n3/b  reads                                          C(i,j)  =  C(i,j)  +  A(i,k)  *  B(k,j)  {do  a  matrix  mul)ply  on  blocks}                  {write  block  C(i,j)  back  to  slow  memory}    …  b2  ×  (n/b)2  =  n2  writes    

= + * C(i,j) C(i,j) A(i,k)

B(k,j) b-­‐by-­‐b  block  

2n3/b  +  2n2  reads/writes  <<  2n3  arithme)c    -­‐    Faster!  

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of James Demmel

14 / 120

Page 15: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Does  blocked  matmul  akain  lower  bound?  •  Recall:  if  3  b-­‐by-­‐b  blocks  fit  in  fast  memory  of          size  M,  then  #reads/writes  =  2n3/b  +  2n2  

•  Make  b  as  large  as  possible:  3b2  ≤  M,  so              #reads/writes    ≥  31/2n3/M1/2    +  2n2      

•  Akains  lower  bound    =  Ω  (#flops  /  M1/2  )  

•  But  what  if  we  don’t  know  M?    •  Or  if  there  are  mul)ple  levels  of  fast  memory?  •  How  do  we  write  the  algorithm?  

21  

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of James Demmel

15 / 120

Page 16: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Recursive  Matrix  Mul)plica)on  (RMM)  (1/2)  •  For  simplicity:  square  matrices  with  n  =  2m  

•  C  =                              =  A  ·∙  B  =              ·∙                ·∙                        

           =  

•  True  when  each  Aij  etc      1x1      or      n/2    x    n/2    

22"

A11 A12 A21 A22

B11 B12 B21 B22

C11 C12 C21 C22

A11·B11 + A12·B21 A11·B12 + A12·B22 A21·B11 + A22·B21 A21·B12 + A22·B22

func C = RMM (A, B, n) if n = 1, C = A * B, else { C11 = RMM (A11 , B11 , n/2) + RMM (A12 , B21 , n/2) C12 = RMM (A11 , B12 , n/2) + RMM (A12 , B22 , n/2) C21 = RMM (A21 , B11 , n/2) + RMM (A22 , B21 , n/2) C22 = RMM (A21 , B12 , n/2) + RMM (A22 , B22 , n/2) } return

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of James Demmel

16 / 120

Page 17: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

How  hard  is  hand-­‐tuning  matmul,  anyway?  

24"

•  Results of 22 student teams trying to tune matrix-multiply, in CS267 Spr09 •  Students given “blocked” code to start with (7x faster than naïve)

•  Still hard to get close to vendor tuned performance (ACML) (another 6x) •  For more discussion, see www.cs.berkeley.edu/~volkov/cs267.sp09/hw1/results/

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of James Demmel

17 / 120

Page 18: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

How  hard  is  hand-­‐tuning  matmul,  anyway?  

25"

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of James Demmel

18 / 120

Page 19: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Introduction2.5D matrix multiplication

2.5D LU factorizationConclusion

Strong scaling

Solving science problems faster

Parallel computers can solve bigger problems

I weak scaling

Parallel computers can also solve a �xed problem faster

I strong scaling

Obstacles to strong scaling

I may increase relative cost of communication

I may hurt load balance

Edgar Solomonik and James Demmel 2.5D algorithms 3/ 36

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of James Demmel

19 / 120

Page 20: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Introduction2.5D matrix multiplication

2.5D LU factorizationConclusion

Strong scaling

Achieving strong scaling

How to reduce communication and maintain load balance?

I reduce communication along the critical path

Communicate less

I avoid unnecessary communication

Communicate smarter

I know your network topology

Edgar Solomonik and James Demmel 2.5D algorithms 4/ 36

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of James Demmel

20 / 120

Page 21: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Introduction2.5D matrix multiplication

2.5D LU factorizationConclusion

Strong scaling matrix multiplicationPerforming faster at scale

Strong scaling matrix multiplication

0

20

40

60

80

100

256 512 1024 2048

Perc

enta

ge o

f m

achin

e p

eak

#nodes

Matrix multiplication on BG/P (n=65,536)

2.5D MM2D MM

Edgar Solomonik and James Demmel 2.5D algorithms 5/ 36

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of James Demmel

21 / 120

Page 22: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Let’s compute together the amount of operations and data movements

I for a 1D distribution:

Flops Bytes Memoryn3

p pn2 3n2

p

I for a 2D distribution:

Flops Bytes Memoryn3

p

√pn2 3n2

p

I for a 3D distribution:

Flops Bytes Memoryn3

p3√pn2 3n2

p2/3

Not always that much memory available...

I for a 2.5D distribution:

Flops Bytes Memoryn3

p

√pc n

2 3cn2

p

22 / 120

Page 23: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Let’s compute together the amount of operations and data movements

I for a 1D distribution:Flops Bytes Memory

n3

p pn2 3n2

p

I for a 2D distribution:

Flops Bytes Memoryn3

p

√pn2 3n2

p

I for a 3D distribution:

Flops Bytes Memoryn3

p3√pn2 3n2

p2/3

Not always that much memory available...

I for a 2.5D distribution:

Flops Bytes Memoryn3

p

√pc n

2 3cn2

p

22 / 120

Page 24: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Let’s compute together the amount of operations and data movements

I for a 1D distribution:Flops Bytes Memory

n3

p pn2 3n2

p

I for a 2D distribution:

Flops Bytes Memoryn3

p

√pn2 3n2

p

I for a 3D distribution:

Flops Bytes Memoryn3

p3√pn2 3n2

p2/3

Not always that much memory available...

I for a 2.5D distribution:

Flops Bytes Memoryn3

p

√pc n

2 3cn2

p

22 / 120

Page 25: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Let’s compute together the amount of operations and data movements

I for a 1D distribution:Flops Bytes Memory

n3

p pn2 3n2

p

I for a 2D distribution:Flops Bytes Memory

n3

p

√pn2 3n2

p

I for a 3D distribution:

Flops Bytes Memoryn3

p3√pn2 3n2

p2/3

Not always that much memory available...

I for a 2.5D distribution:

Flops Bytes Memoryn3

p

√pc n

2 3cn2

p

22 / 120

Page 26: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Let’s compute together the amount of operations and data movements

I for a 1D distribution:Flops Bytes Memory

n3

p pn2 3n2

p

I for a 2D distribution:Flops Bytes Memory

n3

p

√pn2 3n2

p

I for a 3D distribution:

Flops Bytes Memoryn3

p3√pn2 3n2

p2/3

Not always that much memory available...

I for a 2.5D distribution:

Flops Bytes Memoryn3

p

√pc n

2 3cn2

p

22 / 120

Page 27: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Let’s compute together the amount of operations and data movements

I for a 1D distribution:Flops Bytes Memory

n3

p pn2 3n2

p

I for a 2D distribution:Flops Bytes Memory

n3

p

√pn2 3n2

p

I for a 3D distribution:Flops Bytes Memory

n3

p3√pn2 3n2

p2/3

Not always that much memory available...

I for a 2.5D distribution:

Flops Bytes Memoryn3

p

√pc n

2 3cn2

p

22 / 120

Page 28: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Let’s compute together the amount of operations and data movements

I for a 1D distribution:Flops Bytes Memory

n3

p pn2 3n2

p

I for a 2D distribution:Flops Bytes Memory

n3

p

√pn2 3n2

p

I for a 3D distribution:Flops Bytes Memory

n3

p3√pn2 3n2

p2/3

Not always that much memory available...

I for a 2.5D distribution:Flops Bytes Memory

n3

p

√pc n

2 3cn2

p

22 / 120

Page 29: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Introduction2.5D matrix multiplication

2.5D LU factorizationConclusion

Strong scaling matrix multiplicationPerforming faster at scale

Blocking matrix multiplication

A

BA

B

A

B

AB

Edgar Solomonik and James Demmel 2.5D algorithms 6/ 36

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of James Demmel

23 / 120

Page 30: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Introduction2.5D matrix multiplication

2.5D LU factorizationConclusion

Strong scaling matrix multiplicationPerforming faster at scale

2D matrix multiplication

[Cannon 69], [Van De Geijn and Watts 97]

A

BA

B

A

B

AB

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

16 CPUs (4x4)

Edgar Solomonik and James Demmel 2.5D algorithms 7/ 36

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of James Demmel

24 / 120

Page 31: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

02/21/2007 CS267 Lecture DLA1 22

SUMMA Algorithm• SUMMA = Scalable Universal Matrix Multiply • Slightly less efficient, but simpler and easier to generalize• Presentation from van de Geijn and Watts

• www.netlib.org/lapack/lawns/lawn96.ps• Similar ideas appeared many times

• Used in practice in PBLAS = Parallel BLAS• www.netlib.org/lapack/lawns/lawn100.ps

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of James Demmel

25 / 120

Page 32: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

02/21/2007 CS267 Lecture DLA1 23

SUMMA

* =i

j

A(i,k)

kk

B(k,j)

• i, j represent all rows, columns owned by a processor• k is a single row or column

• or a block of b rows or columns

• C(i,j) = C(i,j) + Σk A(i,k) * B(k,j)

• Assume a pr by pc processor grid (pr = pc = 4 above) • Need not be square

C(i,j)

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of James Demmel

26 / 120

Page 33: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

02/21/2007 CS267 Lecture DLA1 24

SUMMA

For k=0 to n-1 … or n/b-1 where b is the block size … = # cols in A(i,k) and # rows in B(k,j)

for all i = 1 to pr … in parallelowner of A(i,k) broadcasts it to whole processor row

for all j = 1 to pc … in parallelowner of B(k,j) broadcasts it to whole processor column

Receive A(i,k) into AcolReceive B(k,j) into BrowC_myproc = C_myproc + Acol * Brow

* =i

j

A(i,k)

kk

B(k,j)

C(i,j)

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of James Demmel

27 / 120

Page 34: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

02/21/2007 CS267 Lecture DLA1 25

SUMMA performance

For k=0 to n/b-1for all i = 1 to s … s = sqrt(p)

owner of A(i,k) broadcasts it to whole processor row… time = log s *( α + β * b*n/s), using a tree

for all j = 1 to sowner of B(k,j) broadcasts it to whole processor column… time = log s *( α + β * b*n/s), using a tree

Receive A(i,k) into AcolReceive B(k,j) into BrowC_myproc = C_myproc + Acol * Brow

… time = 2*(n/s)2*b

° Total time = 2*n3/p + α * log p * n/b + β * log p * n2 /s

° To simplify analysis only, assume s = sqrt(p)

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of James Demmel

28 / 120

Page 35: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

02/21/2007 CS267 Lecture DLA1 26

SUMMA performance

• Total time = 2*n3/p + α * log p * n/b + β * log p * n2 /s• Parallel Efficiency =

1/(1 + α * log p * p / (2*b*n2) + β * log p * s/(2*n) )• ~Same β term as Cannon, except for log p factor

log p grows slowly so this is ok• Latency (α) term can be larger, depending on b

When b=1, get α * log p * n As b grows to n/s, term shrinks to

α * log p * s (log p times Cannon)• Temporary storage grows like 2*b*n/s• Can change b to tradeoff latency cost with memory

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of James Demmel

29 / 120

Page 36: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Introduction2.5D matrix multiplication

2.5D LU factorizationConclusion

Strong scaling matrix multiplicationPerforming faster at scale

3D matrix multiplication

[Agarwal et al 95], [Aggarwal, Chandra, and Snir 90], [Bernsten 89]

A

BA

B

A

B

AB

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

64 CPUs (4x4x4)

4 copies of matrices

Edgar Solomonik and James Demmel 2.5D algorithms 8/ 36

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of James Demmel

30 / 120

Page 37: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Introduction2.5D matrix multiplication

2.5D LU factorizationConclusion

Strong scaling matrix multiplicationPerforming faster at scale

2.5D matrix multiplication

A

BA

B

A

B

AB

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

32 CPUs (4x4x2)

2 copies of matrices

Edgar Solomonik and James Demmel 2.5D algorithms 9/ 36

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of James Demmel

31 / 120

Page 38: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

31"

Can  we  do  beker?    •  Lower  bound  assumed  1  copy  of  data:    M  =  O(n2/P)  per  proc.  •  What  if  matrix  small  enough  to  fit  c>1  copies,  so  M  =  cn2/P  ?    

–  #words_moved  =  Ω(  #flops  /  M1/2  )    =  Ω(    n2  /  (  c1/2  P1/2  ))  –  #messages                  =  Ω(  #flops  /  M3/2  )    =  Ω(    P1/2  /c3/2)  

•  Can  we  akain  new  lower  bound?  –  Special  case:  “3D  Matmul”:    c  =  P1/3  

•  Dekel,  Nassimi,  Sahni  [81],  Bernsten  [89],                                                Agarwal,  Chandra,  Snir  [90],  Johnson  [93],                                                Agarwal,  Balle,  Gustavson,  Joshi,  Palkar  [95]  

•  Processors  arranged  in  P1/3  x  P1/3  x  P1/3    grid  •  Processor  (i,j,k)  performs  C(i,j)  =  C(i,j)  +  A(i,k)*B(k,j),    where  each  submatrix  is  n/P1/3  x  n/P1/3  

–  Not  always  that  much  memory  available…  

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of James Demmel

32 / 120

Page 39: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

2.5D  Matrix  Mul)plica)on    

•  Assume  can  fit  cn2/P  data  per  processor,  c>1  •  Processors  form  (P/c)1/2    x    (P/c)1/2    x    c    grid  

c  

(P/c)1/2  

Example:  P  =    32,    c  =  2  

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of James Demmel

33 / 120

Page 40: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

2.5D  Matrix  Mul)plica)on    

•  Assume  can  fit  cn2/P  data  per  processor,  c  >  1  •  Processors  form  (P/c)1/2    x    (P/c)1/2    x    c    grid  

k  

j  

Ini)ally  P(i,j,0)  owns  A(i,j)  and  B(i,j)          each  of  size  n(c/P)1/2  x  n(c/P)1/2  

(1)    P(i,j,0)  broadcasts  A(i,j)  and  B(i,j)  to  P(i,j,k)  (2)    Processors  at  level  k  perform  1/c-­‐th  of  SUMMA,  i.e.  1/c-­‐th  of    Σm  A(i,m)*B(m,j)  (3)    Sum-­‐reduce  par)al  sums  Σm  A(i,m)*B(m,j)  along  k-­‐axis  so  P(i,j,0)  owns  C(i,j)  

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of James Demmel

34 / 120

Page 41: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Introduction2.5D matrix multiplication

2.5D LU factorizationConclusion

Strong scaling matrix multiplicationPerforming faster at scale

2.5D strong scaling

n = dimension, p = #processors, c = #copies of data

I must satisfy 1 � c � p1=3

I special case: c = 1 yields 2D algorithm

I special case: c = p1=3 yields 3D algorithm

cost(2.5D MM(p; c)) = O(n3=p) ops

+ O(n2=pc � p) words moved

+ O(qp=c3) messages�

*ignoring log(p) factors

Edgar Solomonik and James Demmel 2.5D algorithms 10/ 36

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of James Demmel

35 / 120

Page 42: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Introduction2.5D matrix multiplication

2.5D LU factorizationConclusion

Strong scaling matrix multiplicationPerforming faster at scale

2.5D strong scaling

n = dimension, p = #processors, c = #copies of data

I must satisfy 1 � c � p1=3

I special case: c = 1 yields 2D algorithm

I special case: c = p1=3 yields 3D algorithm

cost(2D MM(p)) = O(n3=p) ops

+ O(n2=pp) words moved

+ O(pp) messages�

= cost(2.5D MM(p; 1))

*ignoring log(p) factors

Edgar Solomonik and James Demmel 2.5D algorithms 11/ 36

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of James Demmel

36 / 120

Page 43: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Introduction2.5D matrix multiplication

2.5D LU factorizationConclusion

Strong scaling matrix multiplicationPerforming faster at scale

2.5D strong scaling

n = dimension, p = #processors, c = #copies of data

I must satisfy 1 � c � p1=3

I special case: c = 1 yields 2D algorithm

I special case: c = p1=3 yields 3D algorithm

cost(2.5D MM(c � p; c)) = O(n3=(c � p)) ops+ O(n2=(c � pp)) words moved

+ O(pp=c) messages

= cost(2D MM(p))=c

perfect strong scaling

Edgar Solomonik and James Demmel 2.5D algorithms 12/ 36

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of James Demmel

37 / 120

Page 44: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Introduction2.5D matrix multiplication

2.5D LU factorizationConclusion

Strong scaling matrix multiplicationPerforming faster at scale

2.5D MM on 65,536 cores

0

20

40

60

80

100

8192 131072

Perc

enta

ge o

f m

achin

e p

eak

n

Matrix multiplication on 16,384 nodes of BG/P

12X faster

2.7X faster

Using c=16 matrix copies

2D MM2.5D MM

Edgar Solomonik and James Demmel 2.5D algorithms 13/ 36

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of James Demmel

38 / 120

Page 45: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Outline

1 Communication “Avoiding” AlgorithmsCache oblivious algorithmParallel Algorithm

2 Synchronization-reducing algorithmsMoving to Scheduling DAGsDAG generationGranularity and Hybrid Computing

3 Auto-tuningBlock SizeCompiler OptimizationPortability, Performance, Power, ...

4 Reproducibility and Mixed-Precision Methods

39 / 120

Page 46: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

02/21/2007 CS267 Lecture DLA1 27

ScaLAPACK Parallel Library

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of James Demmel

40 / 120

Page 47: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

02/21/2007 CS267 Lecture DLA1 42

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of James Demmel

41 / 120

Page 48: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Looking at the Gordon Bell Prize

1 GFlop/s; 1988; Cray Y-MP; 8 Processors

Static finite element analysis

1 TFlop/s; 1998; Cray T3E; 1024 Processors

Modeling of metallic magnet atoms, using a variation of the locally self-consistent multiple scattering method.

1 PFlop/s; 2008; Cray XT5; 1.5x105 Processors

Superconductive materials

1 EFlop/s; ~2018; ?; 1x107 Processors (109 threads)

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Jack Dongarra

42 / 120

Page 49: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

14

• Must rethink the design of our software

Another disruptive technology

• Similar to what happened with cluster computing and message passing

Rethink and rewrite the applications, algorithms, and software

• Numerical libraries for example will change

For example, both LAPACK and ScaLAPACK will undergo major changes to accommodate this

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Jack Dongarra

43 / 120

Page 50: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Software/Algorithms follow hardware evolution in time

LINPACK (70’s)

(Vector operations)

Rely on

- Level-1 BLAS operations

LAPACK (80’s)

(Blocking, cache friendly)

Rely on

- Level-3 BLAS operations

ScaLAPACK (90’s)

(Distributed Memory)

Rely on

- PBLAS Mess Passing

PLASMA (00’s)

New Algorithms (many-core friendly)

Rely on

- a DAG/scheduler - block data layout

- some extra kernels

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Jack Dongarra

44 / 120

Page 51: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Software/Algorithms follow hardware evolution in time

LINPACK (70’s)

(Vector operations)

Rely on

- Level-1 BLAS operations

LAPACK (80’s)

(Blocking, cache friendly)

Rely on

- Level-3 BLAS operations

ScaLAPACK (90’s)

(Distributed Memory)

Rely on

- PBLAS Mess Passing

PLASMA (00’s)

New Algorithms (many-core friendly)

Rely on

- a DAG/scheduler - block data layout

- some extra kernels

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Jack Dongarra

45 / 120

Page 52: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Software/Algorithms follow hardware evolution in time

LINPACK (70’s)

(Vector operations)

Rely on

- Level-1 BLAS operations

LAPACK (80’s)

(Blocking, cache friendly)

Rely on

- Level-3 BLAS operations

ScaLAPACK (90’s)

(Distributed Memory)

Rely on

- PBLAS Mess Passing

PLASMA (00’s)

New Algorithms (many-core friendly)

Rely on

- a DAG/scheduler - block data layout

- some extra kernels

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Jack Dongarra

46 / 120

Page 53: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Software/Algorithms follow hardware evolution in time

LINPACK (70’s)

(Vector operations)

Rely on

- Level-1 BLAS operations

LAPACK (80’s)

(Blocking, cache friendly)

Rely on

- Level-3 BLAS operations

ScaLAPACK (90’s)

(Distributed Memory)

Rely on

- PBLAS Mess Passing

PLASMA (00’s)

New Algorithms (many-core friendly)

Rely on

- a DAG/scheduler - block data layout

- some extra kernels

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Jack Dongarra

47 / 120

Page 54: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Jack Dongarra

48 / 120

Page 55: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

20

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Jack Dongarra

49 / 120

Page 56: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

DGETF2

DLASWP(L)

DLASWP(R)

DTRSM

DGEMM

Threads – no lookahead

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Jack Dongarra

50 / 120

Page 57: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

22

Reorganizing algorithms to use

this approach

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Jack Dongarra

51 / 120

Page 58: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Jack Dongarra

52 / 120

Page 59: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Jack Dongarra

53 / 120

Page 60: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Jack Dongarra

54 / 120

Page 61: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

• Asychronicity

• Avoid fork-join (Bulk sync design)

• Dynamic Scheduling

• Out of order execution

• Fine Granularity

• Independent block operations

• Locality of Reference

• Data storage – Block Data Layout

26

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Jack Dongarra

55 / 120

Page 62: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

task pool

task slice

PLASMA Dynamic Task Scheduler

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Jack Dongarra

56 / 120

Page 63: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

• We would generate the DAG, find the critical path and execute it.

• DAG too large to generate ahead

of time

Not explicitly generate

Dynamically generate the DAG as we go

• Machines will have large

number of cores in a distributed fashion

Will have to engage in message passing

Distributed management

Locally have a run time system

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Jack Dongarra

57 / 120

Page 64: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

• Here is the DAG for a factorization on a

20 x 20 matrix

• For a large matrix say O(106) the DAG is huge

• Many challenges for the software 28

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Jack Dongarra

58 / 120

Page 65: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Tile LU factorization 10x10 tiles

300 tasks total

100 task window

Execution of the DAG by a Sliding Window

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Jack Dongarra

59 / 120

Page 66: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Tile LU factorization 10x10 tiles

300 tasks total

100 task window

Execution of the DAG by a Sliding Window

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Jack Dongarra

60 / 120

Page 67: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Tile LU factorization 10x10 tiles

300 tasks total

100 task window

Execution of the DAG by a Sliding Window

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Jack Dongarra

61 / 120

Page 68: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Tile LU factorization 10x10 tiles

300 tasks total

100 task window

Execution of the DAG by a Sliding Window

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Jack Dongarra

62 / 120

Page 69: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Tile LU factorization 10x10 tiles

300 tasks total

100 task window

Execution of the DAG by a Sliding Window

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Jack Dongarra

63 / 120

Page 70: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Tile LU factorization 10x10 tiles

300 tasks total

100 task window

Execution of the DAG by a Sliding Window

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Jack Dongarra

64 / 120

Page 71: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Need for rewriting all algorithms as DAGs ? How to do online (anddistributed?) DAG generation?

1 Bound the number of tasks and execute the sequential tasks withfake kernel calls to obtain the dependencies.Doing so you trade memory for scheduling opportunities. Al-though this approach ensures that this will be compatible withsequential execution on a semantic point of view, it also biasesthe execution and forces it to be close to the sequential execution[Quark/StarPU/MORSE]

2 Put the compiler in. The compiler creates the DAG at compila-tion time but in a compact symbolic way (i.e. a cyclic dependencygraph).This allows to track for any task what are the child and ances-tors. This helps for fault tolerance because this ensures one canreproduce any data and track down what needs to be recomputed.

3 Non-affine loops (e.g., a reduction) that do not fit in the polyhe-dral model are written by hands.

65 / 120

Page 72: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

02/21/2007 CS267 Lecture DLA1 32

Gaussian Elimination

0x

x

xx

.

.

.

Standard Waysubtract a multiple of a row

0

x

00

. . .

0

LINPACKapply sequence to a column

x

nb

then apply nb to rest of matrixa3=a3-a1*a2

a3

a2

a1

L

a2 =L-1 a2

0

x

00

. . .

0

nb LAPACKapply sequence to nb

Slide source: Dongarra

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of James Demmel

66 / 120

Page 73: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

02/21/2007 CS267 Lecture DLA1 33

LU Algorithm:1: Split matrix into two rectangles (m x n/2)

if only 1 column, scale by reciprocal of pivot & return

2: Apply LU Algorithm to the left part

3: Apply transformations to right part (triangular solve A12 = L-1A12 and matrix multiplication A22=A22 -A21*A12 )

4: Apply LU Algorithm to right part

Gaussian Elimination via a Recursive Algorithm

L A12

A21 A22

F. Gustavson and S. Toledo

Most of the work in the matrix multiply Matrices of size n/2, n/4, n/8, …

Slide source: Dongarra

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of James Demmel

67 / 120

Page 74: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

An ideal solution?

I Such dynamic/WS techniques always have trouble with data-management. Although it is possible to estimate communicationcosts and optimized computation kernels are stable, we end upwith a greedy strategy.

I Regarding data movement optimization, sometimes, we know stat-ically that some subDAGs could be done in an efficient way.

I They’re looking at how to deal with such things. Obviously whenit is recursive, adaptive computing is much easier but from clas-sical sequential description it’s more tricky.

68 / 120

Page 75: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

How to pick tile size ?

I When tiles are too small, bad efficiency but when too large, youdo not have enough tiles, hence not enough parallelism.

I Tile size depends on hardware but when having GPUs and CPUs,this means that this choice should be done at runtime, makingopportunistic scheduling choices (MAGMA, StarPU, ...).

69 / 120

Page 76: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Match algorithmic requirements to architectural strengths of the hybrid

components Multicore : small tasks/tiles

Accelerator: large data parallel tasks

e.g. split the computation into tasks; define critical path that “clears” the way

for other large data parallel tasks; proper schedule the tasks execution

Design algorithms with well defined “search space” to facilitate auto-tuning

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Jack Dongarra

70 / 120

Page 77: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

• Algorithms (in particular LU) for Multicore + GPU systems

• Challenges How to split the computation

Software development

Tuning

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Jack Dongarra

71 / 120

Page 78: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Needed tuned parameters and tuned DGEMM for “rectangular” matrices

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Jack Dongarra

72 / 120

Page 79: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Outline

1 Communication “Avoiding” AlgorithmsCache oblivious algorithmParallel Algorithm

2 Synchronization-reducing algorithmsMoving to Scheduling DAGsDAG generationGranularity and Hybrid Computing

3 Auto-tuningBlock SizeCompiler OptimizationPortability, Performance, Power, ...

4 Reproducibility and Mixed-Precision Methods

73 / 120

Page 80: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Goal 3 – Automate Performance Tuning

• Widely used in performance tuning of Kernels– ATLAS (PhiPAC) – BLAS - www.netlib.org/atlas– FFTW – Fast Fourier Transform – www.fftw.org– Spiral – signal processing - www.spiral.net– OSKI – Sparse BLAS – bebop.cs.berkeley.edu/oski

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Jack Dongarra

74 / 120

Page 81: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Optimizing blocksizes for mat-mul

Finding a Needle in a Haystack – So Automate

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Jack Dongarra

75 / 120

Page 82: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Goal 3 – Automate Performance Tuning

• Widely used in performance tuning of Kernels• 1300 calls to ILAENV() to get block sizes, etc.

– Never been systematically tuned• Extend automatic tuning techniques of ATLAS, etc.

to these other parameters– Automation important as architectures evolve

• Convert ScaLAPACK data layouts on the fly– Important for ease-of-use too

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Jack Dongarra

76 / 120

Page 83: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

The Difficulty of Tuning SpMV:Sparse Matrix Vector Multiply

// y <-- y + A*x

for all A(i,j):

y(i) += A(i,j) * x(j)

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Jack Dongarra

77 / 120

Page 84: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

The Difficulty of Tuning SpMV

// y <-- y + A*x

for all A(i,j):

y(i) += A(i,j) * x(j)

// Compressed sparse row (CSR)

for each row i:

t = 0

for k=row[i] to row[i+1]-1:

t += A[k] * x[J[k]]

y[i] = t

• Exploit 8x8 dense blocks

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Jack Dongarra

78 / 120

Page 85: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Speedups on Itanium 2: The Need for Search

Reference Mflop/s (7.6%)

Mflop/s (31.1%)

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Jack Dongarra

79 / 120

Page 86: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Speedups on Itanium 2: The Need for Search

Reference

Best: 4x2

Mflop/s (7.6%)

Mflop/s (31.1%)

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Jack Dongarra

80 / 120

Page 87: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

SpMV Performance—raefsky3

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Jack Dongarra

81 / 120

Page 88: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

SpMV Performance—raefsky3

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Jack Dongarra

82 / 120

Page 89: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

More Surprises tuning SpMV

• More complex example

• Example: 3x3 blocking– Logical grid of 3x3 cells

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Jack Dongarra

83 / 120

Page 90: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Extra Work Can Improve Efficiency

• More complex example

• Example: 3x3 blocking– Logical grid of 3x3 cells– Pad with zeros– “Fill ratio” = 1.5

• On Pentium III:1.5x speedup! (2/3 time)

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Jack Dongarra

84 / 120

Page 91: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

• Many parameters in the code needs to be

optimized.

• Software adaptivity is the key for

applications to effectively use available resources whose complexity is

exponentially increasing

• Goal:

Automatically bridge the gap between the application and computers that are rapidly changing and getting more and more complex

• Non obvious interactions between HW/SW can effect outcome

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Jack Dongarra

85 / 120

Page 92: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Grigori Fursin “Collective Mind: making auto-tuning practical using crowdsourcing and predictive modeling” 16

Multi-objective compiler auto-tuning using mobile phones

Program: image corner detection Processor: ARM v6, 830MHz Compiler: Sourcery GCC for ARM v4.7.3 OS: Android OS v2.3.5 System: Samsung Galaxy Y Data set: MiDataSet #1, image, 600x450x8b PGM, 263KB

500 combinations of random flags -O3 -f(no-)FLAG

Bin

ary

size

(b

yte

s)

Execution time (sec.)

Use Pareto frontier filter;

Pack experimental

data on the fly -O3

Powered by Collective Mind Node (Android Apps on Google Play)

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Grigori Fursin

86 / 120

Page 93: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Grigori Fursin “Collective Mind: making auto-tuning practical using crowdsourcing and predictive modeling” 17

Universal complexity (dimension) reduction

Found solution

-O3 -fno-align-functions -fno-align-jumps -fno-align-labels -fno-align-loops -fno-asynchronous-unwind-tables -fno-branch-count-reg -fno-branch-target-load-optimize2 -fno-btr-bb-exclusive -fno-caller-saves -fno-combine-stack-adjustments -fno-common -fno-compare-elim -fno-conserve-stack -fno-cprop-registers -fno-crossjumping -fno-cse-follow-jumps -fno-cx-limited-range -fdce -fno-defer-pop -fno-delete-null-pointer-checks -fno-devirtualize -fno-dse -fno-early-inlining -fno-expensive-optimizations -fno-forward-propagate -fgcse -fno-gcse-after-reload -fno-gcse-las -fno-gcse-lm -fno-gcse-sm -fno-graphite-identity -fguess-branch-probability -fno-if-conversion -fno-if-conversion2 -fno-inline-functions -fno-inline-functions-called-once -fno-inline-small-functions -fno-ipa-cp -fno-ipa-cp-clone -fno-ipa-matrix-reorg -fno-ipa-profile -fno-ipa-pta -fno-ipa-pure-const -fno-ipa-reference -fno-ipa-sra -fno-ivopts -fno-jump-tables -fno-math-errno -fno-loop-block -fno-loop-flatten -fno-loop-interchange -fno-loop-parallelize-all -fno-loop-strip-mine -fno-merge-constants -fno-modulo-sched -fmove-loop-invariants -fomit-frame-pointer -fno-optimize-register-move -fno-optimize-sibling-calls -fno-peel-loops -fno-peephole -fno-peephole2 -fno-predictive-commoning -fno-prefetch-loop-arrays -fno-regmove -fno-rename-registers -fno-reorder-blocks -fno-reorder-blocks-and-partition -fno-reorder-functions -fno-rerun-cse-after-loop -fno-reschedule-modulo-scheduled-loops -fno-sched-critical-path-heuristic -fno-sched-dep-count-heuristic -fno-sched-group-heuristic -fno-sched-interblock -fno-sched-last-insn-heuristic -fno-sched-pressure -fno-sched-rank-heuristic -fno-sched-spec -fno-sched-spec-insn-heuristic -fno-sched-spec-load -fno-sched-spec-load-dangerous -fno-sched-stalled-insns -fno-sched-stalled-insns-dep -fno-sched2-use-superblocks -fno-schedule-insns -fno-schedule-insns2 -fno-short-enums -fno-signed-zeros -fno-sel-sched-pipelining -fno-sel-sched-pipelining-outer-loops -fno-sel-sched-reschedule-pipelined -fno-selective-scheduling -fno-selective-scheduling2 -fno-signaling-nans -fno-single-precision-constant -fno-split-ivs-in-unroller -fno-split-wide-types -fno-strict-aliasing -fno-thread-jumps -fno-trapping-math -fno-tree-bit-ccp -fno-tree-builtin-call-dce -fno-tree-ccp -fno-tree-ch -fno-tree-copy-prop -fno-tree-copyrename -fno-tree-cselim -fno-tree-dce -fno-tree-dominator-opts -fno-tree-dse -ftree-forwprop -fno-tree-fre -fno-tree-loop-distribute-patterns -fno-tree-loop-distribution -fno-tree-loop-if-convert -fno-tree-loop-if-convert-stores -fno-tree-loop-im -fno-tree-loop-ivcanon -fno-tree-loop-optimize -fno-tree-lrs -fno-tree-phiprop -fno-tree-pre -fno-tree-pta -fno-tree-reassoc -fno-tree-scev-cprop -fno-tree-sink -fno-tree-slp-vectorize -fno-tree-sra -fno-tree-switch-conversion -ftree-ter -fno-tree-vect-loop-version -fno-tree-vectorize -fno-tree-vrp -fno-unroll-all-loops -fno-unsafe-loop-optimizations -fno-unsafe-math-optimizations -funswitch-loops -fno-variable-expansion-in-unroller -fno-vect-cost-model -fno-web

Not very useful for analysis

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Grigori Fursin

87 / 120

Page 94: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Grigori Fursin “Collective Mind: making auto-tuning practical using crowdsourcing and predictive modeling” 18

Universal complexity (dimension) reduction

Found solution

-O3 -fno-align-functions -fno-align-jumps -fno-align-labels -fno-align-loops -fno-asynchronous-unwind-tables -fno-branch-count-reg -fno-branch-target-load-optimize2 -fno-btr-bb-exclusive -fno-caller-saves -fno-combine-stack-adjustments -fno-common -fno-compare-elim -fno-conserve-stack -fno-cprop-registers -fno-crossjumping -fno-cse-follow-jumps -fno-cx-limited-range -fdce -fno-defer-pop -fno-delete-null-pointer-checks -fno-devirtualize -fno-dse -fno-early-inlining -fno-expensive-optimizations -fno-forward-propagate -fgcse -fno-gcse-after-reload -fno-gcse-las -fno-gcse-lm -fno-gcse-sm -fno-graphite-identity -fguess-branch-probability -fno-if-conversion -fno-if-conversion2 -fno-inline-functions -fno-inline-functions-called-once -fno-inline-small-functions -fno-ipa-cp -fno-ipa-cp-clone -fno-ipa-matrix-reorg -fno-ipa-profile -fno-ipa-pta -fno-ipa-pure-const -fno-ipa-reference -fno-ipa-sra -fno-ivopts -fno-jump-tables -fno-math-errno -fno-loop-block -fno-loop-flatten -fno-loop-interchange -fno-loop-parallelize-all -fno-loop-strip-mine -fno-merge-constants -fno-modulo-sched -fmove-loop-invariants -fomit-frame-pointer -fno-optimize-register-move -fno-optimize-sibling-calls -fno-peel-loops -fno-peephole -fno-peephole2 -fno-predictive-commoning -fno-prefetch-loop-arrays -fno-regmove -fno-rename-registers -fno-reorder-blocks -fno-reorder-blocks-and-partition -fno-reorder-functions -fno-rerun-cse-after-loop -fno-reschedule-modulo-scheduled-loops -fno-sched-critical-path-heuristic -fno-sched-dep-count-heuristic -fno-sched-group-heuristic -fno-sched-interblock -fno-sched-last-insn-heuristic -fno-sched-pressure -fno-sched-rank-heuristic -fno-sched-spec -fno-sched-spec-insn-heuristic -fno-sched-spec-load -fno-sched-spec-load-dangerous -fno-sched-stalled-insns -fno-sched-stalled-insns-dep -fno-sched2-use-superblocks -fno-schedule-insns -fno-schedule-insns2 -fno-short-enums -fno-signed-zeros -fno-sel-sched-pipelining -fno-sel-sched-pipelining-outer-loops -fno-sel-sched-reschedule-pipelined -fno-selective-scheduling -fno-selective-scheduling2 -fno-signaling-nans -fno-single-precision-constant -fno-split-ivs-in-unroller -fno-split-wide-types -fno-strict-aliasing -fno-thread-jumps -fno-trapping-math -fno-tree-bit-ccp -fno-tree-builtin-call-dce -fno-tree-ccp -fno-tree-ch -fno-tree-copy-prop -fno-tree-copyrename -fno-tree-cselim -fno-tree-dce -fno-tree-dominator-opts -fno-tree-dse -ftree-forwprop -fno-tree-fre -fno-tree-loop-distribute-patterns -fno-tree-loop-distribution -fno-tree-loop-if-convert -fno-tree-loop-if-convert-stores -fno-tree-loop-im -fno-tree-loop-ivcanon -fno-tree-loop-optimize -fno-tree-lrs -fno-tree-phiprop -fno-tree-pre -fno-tree-pta -fno-tree-reassoc -fno-tree-scev-cprop -fno-tree-sink -fno-tree-slp-vectorize -fno-tree-sra -fno-tree-switch-conversion -ftree-ter -fno-tree-vect-loop-version -fno-tree-vectorize -fno-tree-vrp -fno-unroll-all-loops -fno-unsafe-loop-optimizations -fno-unsafe-math-optimizations -funswitch-loops -fno-variable-expansion-in-unroller -fno-vect-cost-model -fno-web

b = B( c ) … …

Chain complexity reduction filter remove dimensions (or set to default)

iteratively, ANOVA, PCA, etc…

Auto-tuning experimental

pipeline

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Grigori Fursin

88 / 120

Page 95: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Grigori Fursin “Collective Mind: making auto-tuning practical using crowdsourcing and predictive modeling” 19

00,10,20,30,40,50,60,70,80,9

1

0 10 20 30 40 50 60 70 80

Pro

bab

ilit

y

Optimizations

Start: 50% probability to select optimization (uniform distribution)

Avoiding collection of huge amount of data -

filtering (compacting) and learning space on the fly

Active learning to systematize and focus exploration HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Grigori Fursin

89 / 120

Page 96: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Grigori Fursin “Collective Mind: making auto-tuning practical using crowdsourcing and predictive modeling” 20

Current random selection of optimizations increased execution time (bad):

reduce probabilities of the selected optimizations

Active learning to systematize and focus exploration HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Grigori Fursin

90 / 120

Page 97: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Grigori Fursin “Collective Mind: making auto-tuning practical using crowdsourcing and predictive modeling” 21

Current random selection of optimizations improved execution time (good):

reward probabilities of the selected optimizations

Active learning to systematize and focus exploration HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Grigori Fursin

91 / 120

Page 98: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Grigori Fursin “Collective Mind: making auto-tuning practical using crowdsourcing and predictive modeling” 22

A – Break up large expression trees

B – Value propagation

C – Hoisting of loop invariants

D – Loop normalization

E – Loop unrolling

F – Mark constant variables

G – Dismantle array instructions

H – Eliminating copies

A B C D

E

F G H

“good optimizations” across all programs:

Faster then traditional search (~50 iterations).

Can stuck in local minima

Speedups 1.1-2x. Sometimes better to reduce

Intel compiler optimization level!

Active learning to systematize and focus exploration HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Grigori Fursin

92 / 120

Page 99: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Grigori Fursin “Collective Mind: making auto-tuning practical using crowdsourcing and predictive modeling” 23

14 transformations, sequences of length 5, search space = 396000

• F. Agakov, E. Bonilla, J. Cavazos, B. Franke, G. Fursin, M.F.P. O'Boyle, J. Thomson, M. Toussaint and C.K.I. Williams.

Using Machine Learning to Focus Iterative Optimization. Proceedings of the 4th Annual International Symposium on Code

Generation and Optimization (CGO), New York, NY, USA, March 2006

Active learning to systematize and focus exploration HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Grigori Fursin

93 / 120

Page 100: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Grigori Fursin “Collective Mind: making auto-tuning practical using crowdsourcing and predictive modeling” 24

Universal complexity (dimension) reduction

Found solution

-O3 -fno-align-functions -fno-align-jumps -fno-align-labels -fno-align-loops -fno-asynchronous-unwind-tables -fno-branch-count-reg -fno-branch-target-load-optimize2 -fno-btr-bb-exclusive -fno-caller-saves -fno-combine-stack-adjustments -fno-common -fno-compare-elim -fno-conserve-stack -fno-cprop-registers -fno-crossjumping -fno-cse-follow-jumps -fno-cx-limited-range -fdce -fno-defer-pop -fno-delete-null-pointer-checks -fno-devirtualize -fno-dse -fno-early-inlining -fno-expensive-optimizations -fno-forward-propagate -fgcse -fno-gcse-after-reload -fno-gcse-las -fno-gcse-lm -fno-gcse-sm -fno-graphite-identity -fguess-branch-probability -fno-if-conversion -fno-if-conversion2 -fno-inline-functions -fno-inline-functions-called-once -fno-inline-small-functions -fno-ipa-cp -fno-ipa-cp-clone -fno-ipa-matrix-reorg -fno-ipa-profile -fno-ipa-pta -fno-ipa-pure-const -fno-ipa-reference -fno-ipa-sra -fno-ivopts -fno-jump-tables -fno-math-errno -fno-loop-block -fno-loop-flatten -fno-loop-interchange -fno-loop-parallelize-all -fno-loop-strip-mine -fno-merge-constants -fno-modulo-sched -fmove-loop-invariants -fomit-frame-pointer -fno-optimize-register-move -fno-optimize-sibling-calls -fno-peel-loops -fno-peephole -fno-peephole2 -fno-predictive-commoning -fno-prefetch-loop-arrays -fno-regmove -fno-rename-registers -fno-reorder-blocks -fno-reorder-blocks-and-partition -fno-reorder-functions -fno-rerun-cse-after-loop -fno-reschedule-modulo-scheduled-loops -fno-sched-critical-path-heuristic -fno-sched-dep-count-heuristic -fno-sched-group-heuristic -fno-sched-interblock -fno-sched-last-insn-heuristic -fno-sched-pressure -fno-sched-rank-heuristic -fno-sched-spec -fno-sched-spec-insn-heuristic -fno-sched-spec-load -fno-sched-spec-load-dangerous -fno-sched-stalled-insns -fno-sched-stalled-insns-dep -fno-sched2-use-superblocks -fno-schedule-insns -fno-schedule-insns2 -fno-short-enums -fno-signed-zeros -fno-sel-sched-pipelining -fno-sel-sched-pipelining-outer-loops -fno-sel-sched-reschedule-pipelined -fno-selective-scheduling -fno-selective-scheduling2 -fno-signaling-nans -fno-single-precision-constant -fno-split-ivs-in-unroller -fno-split-wide-types -fno-strict-aliasing -fno-thread-jumps -fno-trapping-math -fno-tree-bit-ccp -fno-tree-builtin-call-dce -fno-tree-ccp -fno-tree-ch -fno-tree-copy-prop -fno-tree-copyrename -fno-tree-cselim -fno-tree-dce -fno-tree-dominator-opts -fno-tree-dse -ftree-forwprop -fno-tree-fre -fno-tree-loop-distribute-patterns -fno-tree-loop-distribution -fno-tree-loop-if-convert -fno-tree-loop-if-convert-stores -fno-tree-loop-im -fno-tree-loop-ivcanon -fno-tree-loop-optimize -fno-tree-lrs -fno-tree-phiprop -fno-tree-pre -fno-tree-pta -fno-tree-reassoc -fno-tree-scev-cprop -fno-tree-sink -fno-tree-slp-vectorize -fno-tree-sra -fno-tree-switch-conversion -ftree-ter -fno-tree-vect-loop-version -fno-tree-vectorize -fno-tree-vrp -fno-unroll-all-loops -fno-unsafe-loop-optimizations -fno-unsafe-math-optimizations -funswitch-loops -fno-variable-expansion-in-unroller -fno-vect-cost-model -fno-web

Pruned solution

-O3 -fno-align-functions (15% of speedup) -fdce -fgcse -fguess-branch-probability (70% of speedup) -fmove-loop-invariants -fomit-frame-pointer -ftree-ter -funswitch-loops -fno-ALL

b = B( c ) … …

Chain complexity reduction filter remove dimensions (or set to default)

iteratively, ANOVA, PCA, etc…

Auto-tuning experimental

pipeline

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Grigori Fursin

94 / 120

Page 101: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Overtuning Can Destroy Performance Portability

Each × denotes a DGEMM variant

Wild — INRIA/ANL/UIUC Joint-lab workshop — Nov’12 2

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Stefan Wild

95 / 120

Page 102: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Automating Empirical Performance TuningGiven a computation kernel and transformation space:

performance

evaluation Search

code

compilation

code

generation

code

transformation

high-performing

code

code & tuning

specs

Wild — INRIA/ANL/UIUC Joint-lab workshop — Nov’12 4

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Stefan Wild

96 / 120

Page 103: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Search in AutotuningAlternatives:

⋄ Complete enumeration

� Prohibitively expensive (1050 variants!)� Unnecessary?

⋄ Pruning

� Careful balancing act (between aggressive and conservativestrategies)

Helpful (necessary?) precursors: The expert still plays a role!

⋄ Identify variable space (parameters to be tuned, ranges, constraints)

⋄ Quantify measurement limitations and noise

⋄ Incorporate known theoretical considerations (models)

⋄ Construct meaningful objectives

→ Reduce search space and/or number of variants that need to be examined

Our goal

Design, implement, and analyze efficient optimization (=search) algorithms. . . for tuning kernels in small computation budgets

Wild — INRIA/ANL/UIUC Joint-lab workshop — Nov’12 5

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Stefan Wild

97 / 120

Page 104: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Is a Sophisticated Search Algorithm Needed?

[Seymour, You, & Dongarra, Cluster Computing ’08]: Random search performsbetter than alternatives as the number of tuning parameters grows !

Depends on distribution of high-performing variants:

Mean run−time (s)

Per

cent

of t

otal

0

10

20

30

40

0.00 0.01 0.02 0.03Mean run−time (s)

Per

cent

of t

otal

0

2

4

6

8

0.001 0.002 0.003 0.004

(5000 semantically equivalent variants each)

Wild — INRIA/ANL/UIUC Joint-lab workshop — Nov’12 6

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Stefan Wild

98 / 120

Page 105: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Is a Sophisticated Search Algorithm Useful?

Depends on structure of the (modeled) search space:

10 20 30 40 50 60 70 80 90 100

10

20

30

40

50

60

70

80

90

100

10 20 30 40 50 60 70 80 90 100

10

20

30

40

50

60

70

80

90

100

Both 2-dimensional problems have the same histogram

Must learn/model/exploit this structure to quickly find high-performing variants

Wild — INRIA/ANL/UIUC Joint-lab workshop — Nov’12 7

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Stefan Wild

99 / 120

Page 106: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Formulation and Modeling: Optimization is Optimization

Finding the best configuration is a mathematical optimization problem

minx

{f(x) : x = (xI , xB, xC) ∈ D ⊂ Rn}

x multidimensional parameterization (compiler type, compiler flags,unroll/tiling factors, internal tolerances, . . . ) for a code variant

f(x) empirical performance metric of x such as FLOPS, power, or run time(requires a run)

D search domain (constraints for feasible transformation, no errors, . . . )

bound: unroll ∈ [1, . . . , 30]; RT = 2i, i=[0,1,2,3]known: (RTI ∗RTJ ≤ 150) (cheap); power consumption ≤ 90 W

(expensive)hidden: transformation errors (relatively cheap), compilation

(expensive), and run time (very expensive) failures

See [Balaprakash, Hovland, & W., iWAPT ’11]

Wild — INRIA/ANL/UIUC Joint-lab workshop — Nov’12 8

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Stefan Wild

100 / 120

Page 107: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Optimization Challenges in Autotuning

minx {f(x) : x = (xI , xB, xC) ∈ D ⊂ Rn}- f noisy, expensive, black box

- Discrete x unrelaxable

- ∇xf unavailable/nonexistent

- “Cliffs”, many distinct/local solutions?

Calls for Derivative-Free Optimization

Steps

tau

10 20 30 40 50 60 70 80 900.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

↓ Integer Space: MM (MatMult) ↑ Mixed-Integer: Lattice QCD code

Wild — INRIA/ANL/UIUC Joint-lab workshop — Nov’12 9

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Stefan Wild

101 / 120

Page 108: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

SPAPT: Orio-ready Implementation

[Norris, Hartono, & Gropp, ’07]

⋄ Extensible empirical tuning system

⋄ Allows inserting annotations as structuredcomments

⋄ Supports architecture independent and specificoptimizations

/∗ AXPY Kernel ∗/f o r (i=0; i<=n−1; i++)

y [ i ]=y [ i ]+a1∗x1 [ i ]+a2∗x2 [ i ]+a3∗x3 [ i ]+a4∗x4 [ i ] ;

↓/* Tuning specifications */ UF = {1,. . . ,30}; PAR = {True, False}

/∗@ begin Loop (transform Unroll ( ufactor=UF , parallelize=PAR )

f o r ( i=0; i<=n−1; i++)y [ i ]=y [ i]+a1∗x1 [ i ]+a2∗x2 [ i ]+a3∗x3 [ i ]+a4∗x4 [ i ] ;

)@∗/

Wild — INRIA/ANL/UIUC Joint-lab workshop — Nov’12 11

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Stefan Wild

102 / 120

Page 109: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Classical Algorithms for Performance Tuning

Global search

⋄ exploration and exploitation

⋄ find the globally best∗

⋄ long search time

⋄ parameter sensitive

Local search

⋄ limited exploration

⋄ find the locally best

⋄ short search time

⋄ risk of bad local solution

Hypothesis: customized local search algorithms are effective for shortcomputational budgets

Wild — INRIA/ANL/UIUC Joint-lab workshop — Nov’12 13

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Stefan Wild

103 / 120

Page 110: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Previous Algorithms for Performance Tuning

[Seymour, You, & Dongarra, Cluster Computing ’08] and [Kisuki, Knijnenburg, &

O’Boyle, PACT ’00] compared several global and local algorithms

⋄ Random search outperforms a genetic algorithm, simulated annealing,particle swarm, Nelder-Mead, and orthogonal search !

⋄ Large number of high-performing parameter configurations → easy tofind one of them

[Norris, Hartono, & Gropp, Computational Science ’07] used several global andlocal algorithms but no comparison

⋄ Nelder-Mead simplex method, simulated annealing, a genetic algorithm

Other local search algorithms without comparison to global search:

⋄ Orthogonal search in ATLAS [Whaley & Dongarra, SC ’98]

⋄ Pattern search in loop optimization [Qasem, Kennedy, & Mellor-Crummey

SC ’06]

⋄ Modified Nelder-Mead simplex algorithm in Active Harmony [Tiwari, Chen,

Chame, Hall, & Hollingsworth, IPDPS ’09]

Wild — INRIA/ANL/UIUC Joint-lab workshop — Nov’12 14

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Stefan Wild

104 / 120

Page 111: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Local Algorithms: Direct Search Methods

See [Kolda, Lewis, & Torczon, SIREV ’03]

Pattern Search

2.2 2.4 2.6 2.8 3 3.2 3.4 3.6

1.5

2

2.5

3

3.5

4

4.5

5

Easy to parallelize f evaluations

Nelder-Mead

θ1

θ 2

Popularized by Numerical Recipes

⋄ Rely on indicator functions: [f(xk + s) <? f(xk)]

• Ignore valuable information on relative magnitudes of f(xk)

Wild — INRIA/ANL/UIUC Joint-lab workshop — Nov’12 15

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Stefan Wild

105 / 120

Page 112: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Making the Most of Little Information on f

⋄ f is expensive ⇒ can afford to make better use of points

⋄ Overhead of the optimization routine is minimal (negligible?) relative tocost of empirical evaluation

11.5

22.5

33.5

4 0

1

2

30

5

10

15

20

25

30

yx

Bank of data, {xi, f(xi)}ki=1:

= Everything∗ known about f

Idea:

⋄ Make use of growing bank asoptimization progresses

Wild — INRIA/ANL/UIUC Joint-lab workshop — Nov’12 16

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Stefan Wild

106 / 120

Page 113: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Making the Most of Little Information on f

⋄ f is expensive ⇒ can afford to make better use of points

⋄ Overhead of the optimization routine is minimal (negligible?) relative tocost of empirical evaluation

Bank of data, {xi, f(xi)}ki=1:

= Everything∗ known about f

Idea:

⋄ Make use of growing bank asoptimization progresses

Wild — INRIA/ANL/UIUC Joint-lab workshop — Nov’12 16

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Stefan Wild

107 / 120

Page 114: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Surrogate-Based Trust-Region Algorithms

Substitute min {m(x) : x ∈ Bk} for min f(x)

f expensive, no ∇f

m cheap, analyticderivatives

Surrogate based on known f values

13

57

911

1315

12

34

56

78

910

0

1

2

3

4

5

x 107

Unroll Factor jUnroll Factor i

Tim

e [C

PU

ms]

Trust m ≈ f in Bk

2.2 2.4 2.6 2.8 3 3.2 3.4 3.6

1.5

2

2.5

3

3.5

4

4.5

5

Surrogates: predict improvement

Wild — INRIA/ANL/UIUC Joint-lab workshop — Nov’12 17

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Stefan Wild

108 / 120

Page 115: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Simultaneously Optimizing Multiple Objectives

minx∈D

{f1(x), f2(x), . . . , fp(x)}

⋄ No a priori weights wi(∑i wifi(x)

)

⋄ Dominated points x̃:∃x∗ ∈ D withfi(x̃) ≥ fi(x

∗) ∀i,fj(x̃) > fj(x

∗) some j

⋄ Seek Pareto front ofnon-dominated points

Wild — INRIA/ANL/UIUC Joint-lab workshop — Nov’12 20

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Stefan Wild

109 / 120

Page 116: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Multiple Objectives: Time, Power, Energy

4 5 6 7 8 9 10 11 12

220

240

260

280

300

320

2.6 GHz

1.2 GHz

2.6 GHz

1.2 GHz

2.6 GHz

1.2 GHz

adi

Time (s)

Tot

al P

ower

(W

)

smallmediumlarge

Input Size

⋄ Tradeoffs in power donot imply tradeoffs inenergy

Wild — INRIA/ANL/UIUC Joint-lab workshop — Nov’12 21

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Stefan Wild

110 / 120

Page 117: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Multiple Objectives: Time, Power, Energy

4 5 6 7 8 9 10 11 12

1200

1400

1600

1800

2000

2200

2400

2600

2800

2.6 GHz

1.2 GHz

2.6 GHz

1.2 GHz

2.6 GHz

1.2 GHz

adi

Time (s)

Tot

al E

nerg

y (W

s)

smallmediumlarge

Input Size

⋄ Tradeoffs in power donot imply tradeoffs inenergy

⋄ Objectives may not beconflicting: “Race toidle

Wild — INRIA/ANL/UIUC Joint-lab workshop — Nov’12 21

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Stefan Wild

111 / 120

Page 118: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Multiple Objectives: Time, Power, Energy

4 5 6 7 8

1400

1600

1800

2000

2200

2400

2600

2800

2.6 GHz

1.2 GHz

2.6 GHz

1.2 GHz

2.6 GHz

1.2 GHz

dsyr2k

Time (s)

Tot

al E

nerg

y (W

s)

smallmediumlarge

Input Size

⋄ Tradeoffs in power donot imply tradeoffs inenergy

⋄ Objectives may not beconflicting: “Race toidle

⋄ Tradeoffs occur fordifferent sizes

Wild — INRIA/ANL/UIUC Joint-lab workshop — Nov’12 21

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Stefan Wild

112 / 120

Page 119: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Multiple Objectives: Time, Power, Energy

5 6 7 8 9 101600

1700

1800

1900

2000

2100

2200

2300

2400

2500

2600

2.6 GHz

1.2 GHz

2.6 GHz

1.2 GHz

2.6 GHz

1.2 GHz

swim

Time (s)

Tot

al E

nerg

y (W

s)

smallmediumlarge

Input Size

⋄ Tradeoffs in power donot imply tradeoffs inenergy

⋄ Objectives may not beconflicting: “Race toidle

⋄ Tradeoffs occur fordifferent sizes

⋄ Tradeoffs occur atdifferent frequencies

Wild — INRIA/ANL/UIUC Joint-lab workshop — Nov’12 21

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Stefan Wild

113 / 120

Page 120: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Summary and Links⋄ Performance tuning increasingly necessary, not yet “automatic”

⋄ Derivative-free optimization is a powerful, practical tool

When the available tuning time is limited:

⋄ Global exploration less useful

⋄ Problem formulation and starting point play important roles

Future work includes:

⋄ Incorporation of models, binary parameters, constraints (from models orotherwise), online restart strategies,role in full application codes, . . .

→ always collecting new search/optimization problems. . . especially those with structure

Some preprints http://mcs.anl.gov/∼wild

http://trac.mcs.anl.gov/projects/performance/wiki/Orio

SPAPT http://trac · · · /performance/browser/orio/testsuite/SPAPT.v.01

Wild — INRIA/ANL/UIUC Joint-lab workshop — Nov’12 22

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Stefan Wild

114 / 120

Page 121: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Outline

1 Communication “Avoiding” AlgorithmsCache oblivious algorithmParallel Algorithm

2 Synchronization-reducing algorithmsMoving to Scheduling DAGsDAG generationGranularity and Hybrid Computing

3 Auto-tuningBlock SizeCompiler OptimizationPortability, Performance, Power, ...

4 Reproducibility and Mixed-Precision Methods

115 / 120

Page 122: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Iterative Refinement: for speed• What if double precision much slower than

single?– Cell processor in Playstation 3

• 256 GFlops single, 25 GFlops double

– Pentium SSE2: single twice as fast as double• Given Ax=b in double precision

– Factor in single, do refinement in double– If κ(A) < 1/εsingle, runs at speed of single

• 1.9x speedup on Intel-based laptop• Applies to many algorithms, if difference large

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Jack Dongarra

116 / 120

Page 123: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Reproducibility

I Reproducible numerical computations is already difficult for a sim-ple reduce.

I The increase of PUs, dynamic scheduling and the use of hybridmixed-precision hardware makes it even harder.

I Changing algorithms may be particularly harmful.

117 / 120

Page 124: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Fast Matrix Multiplication (1) (Cohn, Kleinberg, Szegedy, Umans)

• Can think of fast convolution of polynomials p, q as– Map p (q) into group algebra Σi pi zi ∈ C[G] of cyclic group G = { zi } – Multiply elements of C[G] (use divide&conquer = FFT)– Extract coefficients

• For matrix multiply, need non-abelian group satisfying triple product property– There are subsets X, Y, Z of G where xyz = 1 with

x ∈ X, y ∈ Y, z ∈Z ⇒ x = y = z = 1– Map matrix A into group algebra via Σxy Axy x-1y,

B into Σy’z By’z y’-1z.– Since x-1y y’-1z = x-1z iff y = y’ we get Σy Axy Byz = (AB)xz

• Search for fast algorithms reduced to search for groups with certain properties– Fastest algorithm so far is O(n2.38), same as Coppersmith/Winograd

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Jack Dongarra

118 / 120

Page 125: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

MPI. Really ?

I Hybrid parallelism (MPI+openMP) is tricky.

I MPI 3.0 introduces among other things neihborhood collectivecommunications, asynchronous collective operations, the abilityto hint the middleware about possible optimizations, ...

I MPI 3.0still considers MPI ranks as process and not as ”end-points”. :(

I MPI will have trouble going to exascale. Another approach is toresort to data parallel languages to express data parallelism. HPFremoved power from power users compared to MPI, which is oneof the reason for the success of MPI.

119 / 120

Page 126: HPC: Linear Algebra Challengespolaris.imag.fr/arnaud.legrand/teaching/2013/PC_11_linalg.pdf · HPC: Linear Algebra Challenges A. Legrand Communication \Avoiding" Algorithms Cache

Mega, Giga, Tera,

Peta, Exa, Zetta …

103 kilo

106 mega

109 giga

1012 tera

1015 peta

1018 exa

1021 zetta

1024 yotta

1027 xona

1030 weka

1033 vunda

1036 uda

1039 treda

1042 sorta

1045 rinta

1048 quexa

1051 pepta

1054 ocha

1057 nenaN

1060 minga

1063 luma

50

HPC: LinearAlgebra

Challenges

A. Legrand

Communication“Avoiding”Algorithms

Cache obliviousalgorithm

ParallelAlgorithm

Synchronization-reducingalgorithms

Moving toSchedulingDAGs

DAG generation

Granularity andHybridComputing

Auto-tuning

Block Size

CompilerOptimization

Portability,Performance,Power, ...

ReproducibilityandMixed-PrecisionMethods

Courtesy of Jack Dongarra

120 / 120