dense linear algebra (data distributions)

29
Dense Linear Algebra (Data Distributions) Sathish Vadhiyar

Upload: donna-tran

Post on 03-Jan-2016

31 views

Category:

Documents


1 download

DESCRIPTION

Dense Linear Algebra (Data Distributions). Sathish Vadhiyar. Gaussian Elimination - Review. Version 1 for each column i zero it out below the diagonal by adding multiples of row i to later rows for i= 1 to n-1 for each row j below row i for j = i+1 to n - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Dense Linear Algebra (Data Distributions)

Dense Linear Algebra(Data Distributions)

Sathish Vadhiyar

Page 2: Dense Linear Algebra (Data Distributions)

Gaussian Elimination - ReviewVersion 1for each column i zero it out below the diagonal by adding multiples of row i to

later rowsfor i= 1 to n-1 for each row j below row i for j = i+1 to n add a multiple of row i to row j for k = i to n A(j, k) = A(j, k) – A(j, i)/A(i, i) * A(i, k)

0

0

0

0

0

0

0

0

0

0

0

i

ii,i

X

X

x

j

k

Page 3: Dense Linear Algebra (Data Distributions)

Gaussian Elimination - ReviewVersion 2 – Remove A(j, i)/A(i, i) from inner loopfor each column i zero it out below the diagonal by adding multiples of row i to

later rowsfor i= 1 to n-1 for each row j below row i for j = i+1 to n m = A(j, i) / A(i, i) for k = i to n A(j, k) = A(j, k) – m* A(i, k)

0

0

0

0

0

0

0

0

0

0

0

i

ii,i

X

X

x

j

k

Page 4: Dense Linear Algebra (Data Distributions)

Gaussian Elimination - ReviewVersion 3 – Don’t compute what we already knowfor each column i zero it out below the diagonal by adding multiples of row i to

later rowsfor i= 1 to n-1 for each row j below row i for j = i+1 to n m = A(j, i) / A(i, i) for k = i+1 to n A(j, k) = A(j, k) – m* A(i, k)

0

0

0

0

0

0

0

0

0

0

0

i

ii,i

X

X

x

j

k

Page 5: Dense Linear Algebra (Data Distributions)

Gaussian Elimination - ReviewVersion 4 – Store multipliers m below diagonalsfor each column i zero it out below the diagonal by adding multiples of row i to

later rowsfor i= 1 to n-1 for each row j below row i for j = i+1 to n A(j, i) = A(j, i) / A(i, i) for k = i+1 to n A(j, k) = A(j, k) – A(j, i)* A(i, k)

0

0

0

0

0

0

0

0

0

0

0

i

ii,i

X

X

x

j

k

Page 6: Dense Linear Algebra (Data Distributions)

GE - Runtime

Divisions

Multiplications / subtractions

Total

1+ 2 + 3 + … (n-1) = n2/2 (approx.)

12 + 22 + 32 + 42 +52 + …. (n-1)2 = n3/3 – n2/2

2n3/3

Page 7: Dense Linear Algebra (Data Distributions)

Parallel GE

1st step – 1-D block partitioning along blocks of n columns by p processors

0

0

0

0

0

0

0

0

0

0

0

i

ii,i

X

X

x

j

k

Page 8: Dense Linear Algebra (Data Distributions)

1D block partitioning - Steps

1. Divisions

2. Broadcast

3. Multiplications and Subtractions

Runtime:

n2/2

xlog(p) + ylog(p-1) + zlog(p-3) + … log1 < n2logp

(n-1)n/p + (n-2)n/p + …. 1x1 = n3/p (approx.)

< n2/2 +n2logp + n3/p

Page 9: Dense Linear Algebra (Data Distributions)

2-D block

To speedup the divisions

0

0

0

0

0

0

0

0

0

0

0

i

ii,i

X

X

x

j

k

P

Q

Page 10: Dense Linear Algebra (Data Distributions)

2D block partitioning - Steps

1. Broadcast of (k,k)

2. Divisions

3. Broadcast of multipliers

logQ

n2/Q (approx.)

xlog(P) + ylog(P-1) + zlog(P-2) + …. = n2/Q logP

4. Multiplications and subtractionsn3/PQ (approx.)

Page 11: Dense Linear Algebra (Data Distributions)

Problem with block partitioning for GE

Once a block is finished, the corresponding processor remains idle for the rest of the execution

Solution? -

Page 12: Dense Linear Algebra (Data Distributions)

Onto cyclic The block partitioning algorithms waste

processor cycles. No load balancing throughout the algorithm.

Onto cyclic

0 1 2 3 0 2 3 0 2 31 1 0

cyclic 1-D block-cyclic 2-D block-cyclic

Load balanceLoad balance, block operations, but column factorization bottleneck

Has everything

Page 13: Dense Linear Algebra (Data Distributions)

Block cyclic Having blocks in a processor can lead

to block-based operations (block matrix multiply etc.)

Block based operations lead to high performance

Page 14: Dense Linear Algebra (Data Distributions)

GE: MiscellaneousGE with Partial Pivoting

1D block-column partitioning: which is better? Column or row pivoting

2D block partitioning: Can restrict the pivot search to limited number of columns

•Column pivoting does not involve any extra steps since pivot search and exchange are done locally on each processor. O(n-i-1)

•The exchange information is passed to the other processes by piggybacking with the multiplier information

• Row pivoting

• Involves distributed search and exchange – O(n/P)+O(logP)

Page 15: Dense Linear Algebra (Data Distributions)

Triangular Solve – Unit Upper triangular matrix

Sequential Complexity - Complexity of parallel algorithm with

2D block partitioning (P0.5*P0.5)

O(n2)

O(n2)/P0.5

Thus (parallel GE / parallel TS) < (sequential GE / sequential TS) Overall (GE+TS) – O(n3/P)

Page 16: Dense Linear Algebra (Data Distributions)

Dense LU on GPUs

Page 17: Dense Linear Algebra (Data Distributions)

LU for Hybrid Multicore + GPU Systems(Tomov et al., Parallel Computing, 2010)

Assume the CPU host has 8 cores Assume NxN matrix; Divided into

blocks of size NB; Split such that the first N-7NB

columns are on GPU memory Last 7NB on the host

Page 18: Dense Linear Algebra (Data Distributions)

Load Splitting for Hybrid LU

Page 19: Dense Linear Algebra (Data Distributions)

Steps Current panel is downloaded to CPU; the dark

blue part in the figure Panel factored on CPU and result sent to GPU to

update trailing sub-matrix; the red part; GPU updates the first NB columns of the trailing

submatrix Updated panel sent to CPU; Asynchronously

factored on CPU while the GPU updates the rest of the trailing submatrix

The rest of the 7 host cores update the last 7 NB host columns

Page 20: Dense Linear Algebra (Data Distributions)

Dense Cholesky on GPUs

Page 21: Dense Linear Algebra (Data Distributions)

Cholesky Factorization Symmetric positive definite matrix A A = LLT

Similar to Gaussian elimination; succession of block column factorizations followed by updates of the trailing submatrix

Can be parallelized using fork-join model similar to parallel GE

Matrix-matrix multiplications in parallel (fork) and synchronization needed for next block column factorization (join)

Page 22: Dense Linear Algebra (Data Distributions)

Heterogeneous Tile Algorithms (Song et al., ICS 2012) Divide a matrix into a set of small and large

tiles Small tiles for CPU, and large tiles for GPUs At the top level, divide the matrix into large

square tiles of size B x B Then subdivide each top level tile into a

number of small rectangular tiles of size Bxb and a remaining tile

Page 23: Dense Linear Algebra (Data Distributions)

Heterogeneous Tile Cholesky Factorization

Factor the first tile to solve L(1,1)

Apply L(1,1) to update its right A(1,2)

Page 24: Dense Linear Algebra (Data Distributions)

Heterogeneous Tile Cholesky Factorization

Factor the two tiles below L(1,1)

Update all tiles on the right of the first tile column

Page 25: Dense Linear Algebra (Data Distributions)

Heterogeneous Tile Cholesky Factorization

At the second iteration, repeat the above steps on the trailing submatrix starting from the second tile column

Page 26: Dense Linear Algebra (Data Distributions)

Two Level Block-Cyclic Distribution Divide a matrix A into p x (s.p) rectangular tiles

discussed earlier i.e., first divided into pxp large tiles, then

partition each large tile into s tiles On a hybrid CPU and GPU machine, allocate the

tiles to the host and a number of P GPUs in a 1-D block-cyclic way

Those columns whose indices are multiples of s are mapped to the P GPUs in a cyclic way, and remaining columns go to all CPU cores in the host

Page 27: Dense Linear Algebra (Data Distributions)

Two Level Block-Cyclic Distribution - Example

Page 28: Dense Linear Algebra (Data Distributions)

Tuning Tile Size

Depends on relative performance of CPU and GPU cores

For example, to determine the tile width, Bh for CPU:

Page 29: Dense Linear Algebra (Data Distributions)

References E. Agullo, C. Augonnet, J. Dongarra, H. Ltaief, R.

Namyst, S. Thibault, S. Tomov, and W. m. Hu. Faster, Cheaper, Better: a Hybridization Methodology to Develop Linear Algebra Software for GPUs. GPU Computing Gems, 2, 2010.

F. Song, S. Tomov, and J. Dongarra. Enabling and Scaling Matrix Computations on Heterogeneous Multi-core and Multi-GPU Systems. In In the Proceedings of IEEE/ACM Supercomputing Conference (SC), pages 365{376, 2012.