dense linear algebra for massively parallel computers€¦ · leibniz rechenzentrum supermuc, intel...

Dense Linear Algebra for Massively

Parallel Computers

Jack Dongarra University of Tennessee

Oak Ridge National Laboratory University of Manchester

�?B9J?>�#1D8C� ��

Dense Linear Algebra

•  Look at the largest machines today. •  Look at where things are headed. •  What’s the current thinking on

effective use. •  Applies to DLA but ideas also extend

beyond that.

2

November 2012: The TOP10 Rank Site Computer Country Cores Rmax

[Pflops] % of Peak

Power [MW]

MFlops/Watt

1 DOE / OS Oak Ridge Nat Lab

Titan, Cray XK7 (16C) + Nvidia Kepler GPU (14c) + custom USA 560,640 17.6 66 8.3 2120

2 DOE / NNSA L Livermore Nat Lab

Sequoia, BlueGene/Q (16c) + custom USA 1,572,864 16.3 81 7.9 2063

3 RIKEN Advanced Inst for Comp Sci

K computer Fujitsu SPARC64 VIIIfx (8c) + custom Japan 705,024 10.5 93 12.7 827

4 DOE / OS Argonne Nat Lab

Mira, BlueGene/Q (16c) + custom USA 786,432 8.16 81 3.95 2066

5 Forschungszentrum Juelich

JuQUEEN, BlueGene/Q (16c) + custom Germany 393,216 4.14 82 1.97 2102

6 Leibniz Rechenzentrum SuperMUC, Intel (8c) + IB Germany 147,456 2.90 90* 3.42 848

7 Texas Advanced Computing Center

Stampede, Dell Intel (8) + Intel Xeon Phi (61) + IB USA 204,900 2.66 67 3.3 806

8 Nat. SuperComputer Center in Tianjin

Tianhe-1A, NUDT Intel (6c) + Nvidia Fermi GPU

(14c) + custom China 186,368 2.57 55 4.04 636

9 CINECA Fermi, BlueGene/Q (16c) + custom Italy 163,840 1.73 82 .822 2105

10 IBM DARPA Trial System, Power7 (8C) + custom USA 63,360 1.51 78 .358 422

��Slovak Academy Sci IBM Power 7��Slovak Rep�� 3

November 2012: The TOP10 Rank Site Computer Country Cores Rmax

[Pflops] % of Peak

Power [MW]

MFlops/Watt

1 DOE / OS Oak Ridge Nat Lab

Titan, Cray XK7 (16C) + Nvidia Kepler GPU (14c) + custom USA 560,640 17.6 66 8.3 2120

2 DOE / NNSA L Livermore Nat Lab

Sequoia, BlueGene/Q (16c) + custom USA 1,572,864 16.3 81 7.9 2063

3 RIKEN Advanced Inst for Comp Sci

K computer Fujitsu SPARC64 VIIIfx (8c) + custom Japan 705,024 10.5 93 12.7 827

4 DOE / OS Argonne Nat Lab

Mira, BlueGene/Q (16c) + custom USA 786,432 8.16 81 3.95 2066

5 Forschungszentrum Juelich

JuQUEEN, BlueGene/Q (16c) + custom Germany 393,216 4.14 82 1.97 2102

6 Leibniz Rechenzentrum SuperMUC, Intel (8c) + IB Germany 147,456 2.90 90* 3.42 848

7 Texas Advanced Computing Center

Stampede, Dell Intel (8) + Intel Xeon Phi (61) + IB USA 204,900 2.66 67 3.3 806

8 Nat. SuperComputer Center in Tianjin

Tianhe-1A, NUDT Intel (6c) + Nvidia Fermi GPU

(14c) + custom China 186,368 2.57 55 4.04 636

9 CINECA Fermi, BlueGene/Q (16c) + custom Italy 163,840 1.73 82 .822 2105

10 IBM DARPA Trial System, Power7 (8C) + custom USA 63,360 1.51 78 .358 422

4 ��Slovak Academy Sci IBM Power 7��Slovak Rep��

Energy Cost Challenge �  At ~$1M per MW energy costs are

substantial ! 10 Pflop/s in 2011 uses ~10 MWs ! 1 Eflop/s in 2018 > 100 MWs

! DOE Target: 1 Eflop/s around 2020-2022 at 20

MWs 5

Potential System Architecture Systems 2012

Titan Computer 2022 Difference

Today & 2022 System peak 27 Pflop/s 1 Eflop/s O(100)

Power 8.3 MW (2 Gflops/W)

~20 MW (50 Gflops/W)

System memory 710 TB (38*18688)

32 - 64 PB O(10)

Node performance 1,452 GF/s (1311+141)

1.2 or 15TF/s O(10) – O(100)

Node memory BW 232 GB/s (52+180)

2 - 4TB/s O(1000)

Node concurrency 16 cores CPU 2688 CUDA

cores

O(1k) or 10k O(100) – O(1000)

Total Node Interconnect BW

8 GB/s 200-400GB/s O(10)

System size (nodes) 18,688 O(100,000) or O(1M) O(100) – O(1000)

Total concurrency 50 M O(billion) O(1,000)

MTTF ?? unknown O(<1 day) - O(10)

Potential System Architecture with a cap of $200M and 20MW Systems 2012

Titan Computer 2020-2022 Difference

Today & 2022 System peak 27 Pflop/s 1 Eflop/s O(100)

Power 8.3 MW (2 Gflops/W)

~20 MW (50 Gflops/W)

O(10)

System memory 710 TB (38*18688)

32 - 64 PB O(100)

Node performance 1,452 GF/s (1311+141)

1.2 or 15TF/s O(10)

Node memory BW 232 GB/s (52+180)

2 - 4TB/s O(10)

Node concurrency 16 cores CPU 2688 CUDA

cores

O(1k) or 10k O(100) – O(10)

Total Node Interconnect BW

8 GB/s 200-400GB/s O(100)

System size (nodes) 18,688 O(100,000) or O(1M) O(10) – O(100)

Total concurrency 50 M O(billion) O(100)

MTTF ?? unknown O(<1 day) O(?)

•  Must rethink the design of our applications, algorithms and software " Manycore and Hybrid architectures (62)

are disruptive technology •  Similar to what happened with cluster

computing and message passing " Data movement is expensive & Flops

are cheap •  Many methods are not optimized this way

Major Changes to Algorithms/Software

8

Critical Issues at Peta & Exascale for Algorithm and Software Design •  Synchronization-reducing algorithms

"  Break Fork-Join model

•  Communication-reducing algorithms "  Communication carries a high energy cost. "  Use methods which have lower bound on communication

•  Mixed precision methods "  2x speed of ops and 2x speed for data movement

•  Autotuning "  Today’s machines are too complicated, build “smarts” into

software to adapt to the hardware

•  Fault resilient algorithms "  Implement algorithms that can recover from failures/bit flips

•  Reproducibility of results "  Today we can’t guarantee this. We understand the issues,

but some of our “colleagues” have a hard time with this.

A New Generation of DLA Software

�#��#��I2B94��<7?B9D8=CN��85D5B?75>59DI�6B95>4<I��

(5<I�?>��8I2B94�C3854E<5B��8I2B94�;5B>5<C�

Software/Algorithms follow hardware evolution in time

LINPACK (70’s) (Vector operations)

Rely on - Level-1 BLAS operations

LAPACK (80’s) (Blocking, cache friendly)

Rely on - Level-3 BLAS operations

ScaLAPACK (90’s) (Distributed Memory)

Rely on - PBLAS Mess Passing

PLASMA New Algorithms (many-core friendly)

Rely on - a DAG/scheduler - block data layout - some extra kernels

Accelerating Dense Linear Algebra with GPUs

1024 3072 5184 7040 90880

40

80

120

160

200

240

GPU (MAGMA)CPU (LAPACK)

Matrix Size

GFl

op/s

"+��13D?B9J1D9?>�9>�4?E2<5�@B539C9?>��&�N/�6?B�C?<F9>7�1�45>C5�<9>51B�CICD5=�0�

��-��

�� -��

��&+��5B=9�� /��+��?B5C��J�0��&+��#�� )*�$�+"��N�� >D5<�'��/��3?B5C�� J0��/��C?3;5DC�H��3?B5C��3?B5C�� J�0N��&�@51;��<?@C��&�@51;��<?@CN��)ICD5=�3?CD��K��)ICD5=�3?CD��K��N��&?G5B��K�� -��&?G5B��K�� -NN

��?=@ED1D9?>�3?>CE=54�@?G5B�B1D5��D?D1<�CICD5=�B1D5�=9>EC�94<5�B1D5��=51CEB54�G9D8��!��

K��@51;�K��@B935�K��@?G5B�K��.�@5B6?B=1>35�

Paralleliza(on+of+QR+Factoriza(on+

Parallelize+the+update:+• !Easy!and!done!in!any!reasonable!so.ware.!• !This!is!the!2/3n3!term!in!the!FLOPs!count.!• !Can!be!done!“efficiently”!with!LAPACK+mulEthreaded!BLAS!

��

475==�

13

��

AB��

475A6 ��4<1B6D�

4<1B62�

,�

(�

��

�� ,�

(�

+@4

1D5�

?6�D8

5�B5

=19

>9>7

�CE2

=1D

B9H�

&1>5

<�61

3D?B

9J1D

9?>�

�?B;��!?9>�@1B1<<5<9C=��E<;�)I>3�&B?35CC9>7�

Synchronization (in LAPACK LU)

•  Fork-join, bulk synchronous processing 27

��

23

��

��

��

��

��

!  fork join !  bulk synchronous processing

14 Allowing for delayed update, out of order, asynchronous, dataflow execution

• Objectives "  High utilization of each core "  Scaling to large number of cores "  Synchronization reducing algorithms

• Methodology "  Dynamic DAG scheduling (QUARK) "  Explicit parallelism "  Implicit communication "  Fine granularity / block data layout

• Arbitrary DAG with dynamic scheduling

��

Cholesky 6 x 6

�?B;�:?9>�@1B1<<5<9C=�

PLASMA/MAGMA: Parallel Linear Algebra s/w for Multicore/Hybrid Architectures

��C3854E<54�@1B1<<5<9C=�

*9=5�

POTRF

TRSM

TRSM

TRSM

TRSM

TRSM

SYRK

GEMM

GEMM

GEMM

GEMM

SYRK

SYRK

GEMM

GEMM

GEMM

GEMM

SYRK

POTRF

TRSM

TRSM

TRSM

SYRK

GEMM

GEMM

SYRK

POTRF

TRSM

TRSM

GEMM

TRSM

GEMM

GEMM

GEMM

TRSM

GEMM

SYRK

GEMM

GEMM

SYRK

GEMM

SYRK

SYRK

GEMM

SYRK

SYRK

SYRK

POTRF

SYRK

TRSM

GEMM

TRSM

POTRF

SYRK

TRSM

POTRF

Pipelining: Cholesky Inversion 3 Steps: Factor, Invert L, Multiply L’s

16

&%*(��*(*( �"�++#�� D��8?<5C;I��13D?B9J1D9?>�1<?>5��D� �

��3?B5C�&%*(��*(*( �1>4�"�++#�*85�=1DB9H�9C��H��D9<5�C9J5�9C� ��H� ��

&9@5<9>54��D��

Communication Avoiding Algorithms •  Goal: Algorithms that communicate as little as possible •  J. Demmel & L. Grigori and company have been working on

algorithms that obtain a provable minimum communication. •  Direct methods (BLAS, LU, QR, SVD, other decompositions)

•  Communication lower bounds for all these problems •  Algorithms that attain them (all dense linear algebra, some

sparse)

•  Iterative methods – Krylov subspace methods for Ax=b, Ax=λx •  Communication lower bounds, and algorithms that attain them

(depending on sparsity structure)

•  For QR Factorization they can show:

17

Communication Avoiding QR Example

��&?D85>�1>4�&�(1781F1>��9CDB92ED54�?BD8?7?>1<�613D?B9J1D9?>� >��#��! ��#� ��! � ("�#�&��! �&##� %��!�"&%�#$�� ""��%�! $��'!�&��""��%�! $��@175C��M�� &1C145>1��!1>��#�&5>>�)D1D5�

R0!

R1!

R2!

R3!

R0!

R2!

R0!R! R!

D1!

D2!

D3!

Domain_Tile_QR!

Domain_Tile_QR!

Domain_Tile_QR!

Domain_Tile_QR!

D0!

D1!

D2!

D3!

D0!

� �� 18



R0!

R1!

R2!

R3!

R0!

R2!

R0!R! R!

D1!

D2!

D3!

Domain_Tile_QR!

Domain_Tile_QR!

Domain_Tile_QR!

Domain_Tile_QR!

D0!

D1!

D2!

D3!

D0!

� �� 19



R0!

R1!

R2!

R3!

R0!

R2!

R0!R! R!

D1!

D2!

D3!

Domain_Tile_QR!

Domain_Tile_QR!

Domain_Tile_QR!

Domain_Tile_QR!

D0!

D1!

D2!

D3!

D0!

� �� 20



R0!

R1!

R2!

R3!

R0!

R2!

R0!R! R!

D1!

D2!

D3!

Domain_Tile_QR!

Domain_Tile_QR!

Domain_Tile_QR!

Domain_Tile_QR!

D0!

D1!

D2!

D3!

D0!

� �� 21



R0!

R1!

R2!

R3!

R0!

R2!

R0!R! R!

D1!

D2!

D3!

Domain_Tile_QR!

Domain_Tile_QR!

Domain_Tile_QR!

Domain_Tile_QR!

D0!

D1!

D2!

D3!

D0!

� �� 22


'E14�C?3;5D��AE14�3?B5�=1389>5� >D5<�.5?>��#*��1D� ��J��*85?B5D931<�@51;�9C�L�� 6<?@C�G9D8��3?B5C�#1DB9H�C9J5�� 2I��

dense linear algebra for massively parallel computers€¦ · leibniz rechenzentrum supermuc, intel...

Documents