satisfying your dependencies with supermatrix

36
September 17-20, 2007 Cluster 2007 1 Satisfying Your Dependencies with SuperMatrix Ernie Chan

Upload: brandi

Post on 14-Jan-2016

50 views

Category:

Documents


0 download

DESCRIPTION

Satisfying Your Dependencies with SuperMatrix. Ernie Chan. Motivation. Transparent Parallelization of Matrix Operations for SMP and Multi-Core Architectures Schedule submatrix operations out-of-order via dependency analysis Programmability - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Satisfying Your Dependencies with SuperMatrix

September 17-20, 2007 Cluster 2007 1

Satisfying Your Dependencies with SuperMatrix

Ernie Chan

Page 2: Satisfying Your Dependencies with SuperMatrix

September 17-20, 2007 Cluster 2007 2

Motivation

Transparent Parallelization of Matrix Operations for SMP and Multi-Core Architectures Schedule submatrix operations out-of-order via

dependency analysis

Programmability High-level abstractions to hide details of

parallelization from user

Page 3: Satisfying Your Dependencies with SuperMatrix

September 17-20, 2007 Cluster 2007 3

Outline

SuperMatrixImplementationPerformance ResultsConclusion

Page 4: Satisfying Your Dependencies with SuperMatrix

September 17-20, 2007 Cluster 2007 4

SuperMatrix

Page 5: Satisfying Your Dependencies with SuperMatrix

September 17-20, 2007 Cluster 2007 5

SuperMatrix

FLA_Part_2x2( A, &ATL, &ATR,

&ABL, &ABR, 0, 0, FLA_TL );

while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) &&

FLA_Obj_width ( ATL ) < FLA_Obj_width ( A ) )

{

b = min( FLA_Obj_length( ABR ), nb_alg );

FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02,

/* ************* */ /* ******************** */

&A10, /**/ &A11, &A12,

ABL, /**/ ABR, &A20, /**/ &A21, &A22,

b, b, FLA_BR );

/*------------------------------------------------------------------*/

FLA_LU_nopiv( A11 );

FLA_Trsm( FLA_LEFT, FLA_LOWER_TRIANGULAR,

FLA_NO_TRANSPOSE, FLA_UNIT_DIAG,

FLA_ONE, A11, A12 );

FLA_Trsm( FLA_RIGHT, FLA_UPPER_TRIANGULAR,

FLA_NO_TRANSPOSE, FLA_NONUNIT_DIAG,

FLA_ONE, A11, A21 );

FLA_Gemm( FLA_NO_TRANSPOSE, FLA_NO_TRANSPOSE,

FLA_MINUS_ONE, A21, A12, FLA_ONE, A22 );

/*------------------------------------------------------------------*/

FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02,

A10, A11, /**/ A12,

/* ************** */ /* ****************** */

&ABL, /**/ &ABR, A20, A21, /**/ A22,

FLA_TL );

}

Page 6: Satisfying Your Dependencies with SuperMatrix

September 17-20, 2007 Cluster 2007 6

SuperMatrix

LU Factorization Without Pivoting Iteration 1

LU

TRSM

TRSM

GEMMTRSM

TRSM

GEMM

GEMM

GEMM

Page 7: Satisfying Your Dependencies with SuperMatrix

September 17-20, 2007 Cluster 2007 7

SuperMatrix

LU Factorization Without Pivoting Iteration 2

LU

GEMM

TRSM

TRSM

Page 8: Satisfying Your Dependencies with SuperMatrix

September 17-20, 2007 Cluster 2007 8

SuperMatrix

LU Factorization Without Pivoting Iteration 3

LU

Page 9: Satisfying Your Dependencies with SuperMatrix

September 17-20, 2007 Cluster 2007 9

SuperMatrix

FLASH Matrix of matrices

Page 10: Satisfying Your Dependencies with SuperMatrix

September 17-20, 2007 Cluster 2007 10

SuperMatrix

FLA_Part_2x2( A, &ATL, &ATR,

&ABL, &ABR, 0, 0, FLA_TL );

while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) &&

FLA_Obj_width ( ATL ) < FLA_Obj_width ( A ) )

{

FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02,

/* ************* */ /* ******************** */

&A10, /**/ &A11, &A12,

ABL, /**/ ABR, &A20, /**/ &A21, &A22,

1, 1, FLA_BR );

/*------------------------------------------------------------------*/

FLASH_LU_nopiv( A11 );

FLASH_Trsm( FLA_LEFT, FLA_LOWER_TRIANGULAR,

FLA_NO_TRANSPOSE, FLA_UNIT_DIAG,

FLA_ONE, A11, A12 );

FLASH_Trsm( FLA_RIGHT, FLA_UPPER_TRIANGULAR,

FLA_NO_TRANSPOSE, FLA_NONUNIT_DIAG,

FLA_ONE, A11, A21 );

FLASH_Gemm( FLA_NO_TRANSPOSE, FLA_NO_TRANSPOSE,

FLA_MINUS_ONE, A21, A12, FLA_ONE, A22 );

/*------------------------------------------------------------------*/

FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02,

A10, A11, /**/ A12,

/* ************** */ /* ****************** */

&ABL, /**/ &ABR, A20, A21, /**/ A22,

FLA_TL );

}

FLASH_Queue_exec( );

Page 11: Satisfying Your Dependencies with SuperMatrix

September 17-20, 2007 Cluster 2007 11

SuperMatrix

Analyzer Delay execution and place tasks on queue

Tasks are function pointers annotated with input/output information

Compute dependence information (flow, anti, output) between all tasks

Create DAG of tasks

Page 12: Satisfying Your Dependencies with SuperMatrix

September 17-20, 2007 Cluster 2007 12

SuperMatrix

Dispatcher Use DAG to execute tasks out-of-order in

parallel Akin to Tomasulo’s algorithm and instruction-

level parallelism on blocks of computation SuperScalar vs. SuperMatrix

Page 13: Satisfying Your Dependencies with SuperMatrix

September 17-20, 2007 Cluster 2007 13

SuperMatrix

Dispatcher 4 threads 5 x 5 matrix

of blocks 55 tasks 18 stages

LU

TRSMGEMM

LU

LU

LU

LU

TRSM TRSM

TRSMTRSM

TRSMTRSMTRSMTRSM TRSM

TRSM

TRSM TRSM TRSM

TRSMTRSM TRSM

GEMM GEMMGEMMGEMM GEMMGEMM GEMMGEMM GEMM GEMM GEMM

GEMMGEMM

GEMM

GEMM GEMM

GEMM

GEMM

GEMM

GEMM GEMM

GEMMGEMM

GEMMGEMM GEMMGEMM

GEMMGEMM

TRSMTRSM

TRSM

Page 14: Satisfying Your Dependencies with SuperMatrix

September 17-20, 2007 Cluster 2007 14

Outline

SuperMatrixImplementationPerformance ResultsConclusion

Page 15: Satisfying Your Dependencies with SuperMatrix

September 17-20, 2007 Cluster 2007 15

Implementation

Analyzer

LU

GEMM

TRSMTRSM

GEMMGEMMGEMM

TRSMTRSM

LU

LUTRSMTRSMGEMM

Task Queue DAG of tasks

LU

TRSM

TRSM

TRSM

TRSMTRSM TRSM

LU

LU

GEMM GEMM GEMMGEMM

GEMM

Page 16: Satisfying Your Dependencies with SuperMatrix

September 17-20, 2007 Cluster 2007 16

Implementation

Analyzer FLASH routines enqueue tasks onto global task

queue Dependencies between each task are

calculated and stored in the task structure Each submatrix block stores the last task enqueued

that writes to it Flow dependencies occur when a subsequent task

reads that block DAG is embedded in task queue

Page 17: Satisfying Your Dependencies with SuperMatrix

September 17-20, 2007 Cluster 2007 17

Implementation

Dispatcher

Waiting Queue…

Threads

LU

GEMM

TRSMTRSM

GEMMGEMMGEMM

TRSMTRSM

LU

LUTRSMTRSMGEMM

Task Queue

LU

TRSMTRSMTRSM

TRSMLU TRSM TRSM TRSMTRSM

Page 18: Satisfying Your Dependencies with SuperMatrix

September 17-20, 2007 Cluster 2007 18

Implementation

Dispatcher Place ready and available tasks on global

waiting queue First task on task queue always ready and

available Threads asynchronously dequeue tasks from

head of waiting queue Once a task completes execution, notify

dependent tasks and update waiting queue Loop until all tasks complete execution

Page 19: Satisfying Your Dependencies with SuperMatrix

September 17-20, 2007 Cluster 2007 19

Outline

SuperMatrixImplementationPerformance ResultsConclusion

Page 20: Satisfying Your Dependencies with SuperMatrix

September 17-20, 2007 Cluster 2007 20

Performance Results

Target Architectures

Processing Elements

Peak (GFLOPS)

BLAS Libraries

Itanium2 16 96.0 MKL 8.1

Xeon 8 41.6 MKL 9.0

Opteron 8 41.6 ACML 3.6

POWER5 8 60.8 ESSL 4.2

Page 21: Satisfying Your Dependencies with SuperMatrix

September 17-20, 2007 Cluster 2007 21

Performance Results

GotoBLAS 1.13 installed on all machinesSupported Operations

LAPACK-level functions Cholesky factorization LU factorization without pivoting

All level-3 BLAS GEMM, TRMM, TRSM SYMM, SYRK, SYR2K HEMM, HERK, HER2K

Page 22: Satisfying Your Dependencies with SuperMatrix

September 17-20, 2007 Cluster 2007 22

Performance Results

Implementations SuperMatrix + serial BLAS FLAME + multithreaded BLAS LAPACK + multithreaded BLAS

Block size = 192 Processing elements = 8

Page 23: Satisfying Your Dependencies with SuperMatrix

September 17-20, 2007 Cluster 2007 23

Performance Results

SuperMatrix Implementation Fixed block sized

Varying block sizes can lead to better performance Experiments show 192 generally the best

Simplest scheduling No sorting to execute task on critical path earlier No attempt to improve data locality in these

experiments

Page 24: Satisfying Your Dependencies with SuperMatrix

September 17-20, 2007 Cluster 2007 24

Performance Results

Page 25: Satisfying Your Dependencies with SuperMatrix

September 17-20, 2007 Cluster 2007 25

Performance Results

Page 26: Satisfying Your Dependencies with SuperMatrix

September 17-20, 2007 Cluster 2007 26

Performance Results

Page 27: Satisfying Your Dependencies with SuperMatrix

September 17-20, 2007 Cluster 2007 27

Performance Results

Page 28: Satisfying Your Dependencies with SuperMatrix

September 17-20, 2007 Cluster 2007 28

Performance Results

Page 29: Satisfying Your Dependencies with SuperMatrix

September 17-20, 2007 Cluster 2007 29

Performance Results

Page 30: Satisfying Your Dependencies with SuperMatrix

September 17-20, 2007 Cluster 2007 30

Outline

SuperMatrixImplementationPerformance ResultsConclusion

Page 31: Satisfying Your Dependencies with SuperMatrix

September 17-20, 2007 Cluster 2007 31

Conclusion

Apply out-of-order execution techniques to schedule tasks

The whole is greater than the sum of the parts Exploit parallelism between operations

Despite having to calculate dependencies, SuperMatrix only has small performance penalties

Page 32: Satisfying Your Dependencies with SuperMatrix

September 17-20, 2007 Cluster 2007 32

Conclusion

Programmability Code at a high level without needing to deal

with aspects of parallelization

Page 33: Satisfying Your Dependencies with SuperMatrix

September 17-20, 2007 Cluster 2007 33

Authors

Ernie ChanField G. Van ZeeEnrique S. Quintana-OrtíGregorio Quintana-OrtíRobert van de Geijn

The University of Texas at Austin Universidad Jaume I

Page 34: Satisfying Your Dependencies with SuperMatrix

September 17-20, 2007 Cluster 2007 34

Acknowledgements

We thank the Texas Advanced Computing Center (TACC) for access to their machines and their support

Funding NSF Grants

CCF—0540926 CCF—0702714

Page 35: Satisfying Your Dependencies with SuperMatrix

September 17-20, 2007 Cluster 2007 35

References

[1] Ernie Chan, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix Out-of-Order Scheduling of Matrix Operations on SMP and Multi-Core Architectures. In SPAA ‘07: Proceedings of the Nineteenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, pages 116-125, San Diego, CA, USA, June 2007.

[2] Ernie Chan, Field G. Van Zee, Paolo Bientinesi, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix: A Multithreaded Runtime Scheduling System for Algorithms-by-Blocks. Submitted to PPoPP 2008.

[3] Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Ernie Chan, Robert A. van de Geijn, and Field G. Van Zee. Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures. Submitted to Euromicro PDP 2008.

Page 36: Satisfying Your Dependencies with SuperMatrix

September 17-20, 2007 Cluster 2007 36

Conclusion

More Information

http://www.cs.utexas.edu/users/flame

Questions?

[email protected]