high performance computing : models, methods, & means applied parallel algorithms 2
DESCRIPTION
Prof. Thomas Sterling Dr. Hartmut Kaiser Department of Computer Science Louisiana State University March 18, 2011. HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2. Puzzle of the Day. if(a = 0) { … } - PowerPoint PPT PresentationTRANSCRIPT
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS
APPLIED PARALLEL ALGORITHMS 2
Prof. Thomas SterlingDr. Hartmut KaiserDepartment of Computer ScienceLouisiana State UniversityMarch 18, 2011
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
Puzzle of the Day
• Some nice ways to get something different from what was intended:
2
if(a = 0) { … }/* a always equals 0, but block will never be executed */
if(0 < a < 5) { … }/* this "boolean" is always true! [think: (0 < a) < 5] */
if(a =! 0) { … }/* a always equal to 1, as this is compiled as (a = !0), an assignment, rather than (a != 0) or (a == !0) */
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
Topics
• Array Decomposition• Matrix Transpose• Gauss-Jordan Elimination• LU Decomposition• Summary Materials for Test
3
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
Topics
• Array Decomposition• Matrix Transpose• Gauss-Jordan Elimination• LU Decomposition• Summary Materials for Test
4
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
5
Parallel Matrix Processing & Locality
• Maximize locality– Spatial locality
• Variable likely to be used if neighbor data is used• Exploits unit or uniform stride access patterns• Exploits cache line length• Adjacent blocks minimize message traffic
– Depends on volume to surface ratio
– Temporal locality• Variable likely to be reused if already recently used• Exploits cache loads and LRU (least recently used) replacement policy• Exploits register allocation
– Granularity• Maximizes length of local computation• Reduces number of messages• Maximizes length of individual messages
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
6
Array Decomposition
• Simple MPI Example• Master-Worker Data Partitioning and Distribution
– Array decomposition– Uniformly distributes parts of array among workers
• (and master)– A kind of static load balancing
• Assumes equal work on equal data set sizes
• Demonstrates– Data partitioning– Data distribution– Coarse grain parallel execution
• No communication between tasks– Reduction operator– Master-worker control model
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
7
Array Decomposition Layout
• Dimensions – 1 dimension: linear (dot product)– 2 dimensions: “2-D” or (matrix operations)– 3 dimensions (higher order models)– Impacts surface to volume ratio for inter process communications
• Distribution – Block
• Minimizes messaging• Maximizes message size
– Cyclic • Improves load balancing
• Memory layout– C vs. FORTRAN
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
8
Array Decomposition
Accumulate sum from each part
rayCompleteAr
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
9
Array Decomposition
Demonstrate simple data decomposition :– Master initializes array and then distributes an equal portion of the array
among the other tasks.
– The other tasks receive their portion of the array, they perform an addition operation to each array element.
– Each task maintains the sum for their portion of the array
– The master task does likewise with its portion of the array.
– As each of the non-master tasks finish, they send their updated portion of the array to the master.
– An MPI collective communication call is used to collect the sums maintained by each task.
– Finally, the master task displays selected parts of the final array and the global sum of all array elements.
– Assumption : that the array can be equally divided among the group.
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
10
Flowchart for Array Decomposition“master” “workers”
Initialize MPI EnvironmentInitialize MPI Environment
Initialize MPI EnvironmentInitialize MPI Environment
Initialize MPI EnvironmentInitialize MPI Environment
… Initialize MPI EnvironmentInitialize MPI Environment
Initialize ArrayInitialize Array
Partition Array into workloads Partition Array into workloads
Send Workload to “workers”
Send Workload to “workers”
Recv. workRecv. work Recv. workRecv. work … Recv. workRecv. work
Calculate Sum for array chunk
Calculate Sum for array chunk
Calculate Sum for array chunkCalculate Sum
for array chunkCalculate Sum
for array chunkCalculate Sum
for array chunkCalculate Sum
for array chunkCalculate Sum
for array chunk…
Send SumSend Sum Send SumSend Sum … Send SumSend Sum
Recv. resultsRecv. results
Reduction Operator to Sum up results
Reduction Operator to Sum up results
Print resultsPrint results
EndEnd
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
11
Array Decompositon (source code)#include "mpi.h"#include <stdio.h>#include <stdlib.h>#define ARRAYSIZE 16000000#define MASTER 0
float data[ARRAYSIZE];int main (int argc, char **argv){int numtasks, taskid, rc, dest, offset, i, j, tag1, tag2, source, chunksize; float mysum, sum;float update(int myoffset, int chunk, int myid);
MPI_Status status;
MPI_Init(&argc, &argv);MPI_Comm_size(MPI_COMM_WORLD, &numtasks);if (numtasks % 4 != 0) { printf("Quitting. Number of MPI tasks must be divisible by 4.\n"); /**For equal distribution of workload**/ MPI_Abort(MPI_COMM_WORLD, rc); exit(0); }MPI_Comm_rank(MPI_COMM_WORLD,&taskid);printf ("MPI task %d has started...\n", taskid);
chunksize = (ARRAYSIZE / numtasks);tag2 = 1;tag1 = 2;
Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_array.c
Workload to be processed by each processor
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
12
Array Decompositon (source code)if (taskid == MASTER){ sum = 0; for(i=0; i<ARRAYSIZE; i++) { data[i] = i * 1.0; sum = sum + data[i]; } printf("Initialized array sum = %e\n",sum); offset = chunksize; for (dest=1; dest<numtasks; dest++) { MPI_Send(&offset, 1, MPI_INT, dest, tag1, MPI_COMM_WORLD); MPI_Send(&data[offset], chunksize, MPI_FLOAT, dest, tag2, MPI_COMM_WORLD); printf("Sent %d elements to task %d offset= %d\n",chunksize,dest,offset); offset = offset + chunksize; } offset = 0;
mysum = update(offset, chunksize, taskid);
for (i=1; i<numtasks; i++) { source = i; MPI_Recv(&offset, 1, MPI_INT, source, tag1, MPI_COMM_WORLD, &status); MPI_Recv(&data[offset], chunksize, MPI_FLOAT, source, tag2, MPI_COMM_WORLD, &status); }
Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_array.c
Initialize array
Array[0] -> Array[offset-1] is processed by master
Send workloads to respective processorsMaster computes
local Sum
Master receives summation computed by workers
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
13
Array Decompositon (source code) MPI_Reduce(&mysum, &sum, 1, MPI_FLOAT, MPI_SUM, MASTER, MPI_COMM_WORLD); printf("Sample results: \n"); offset = 0; for (i=0; i<numtasks; i++) { for (j=0; j<5; j++) printf(" %e",data[offset+j]); printf("\n"); offset = offset + chunksize; } printf("*** Final sum= %e ***\n",sum); } /* end of master section */if (taskid > MASTER) { /* Receive my portion of array from the master task */ source = MASTER; MPI_Recv(&offset, 1, MPI_INT, source, tag1, MPI_COMM_WORLD, &status); MPI_Recv(&data[offset], chunksize, MPI_FLOAT, source, tag2, MPI_COMM_WORLD, &status); mysum = update(offset, chunksize, taskid); /* Send my results back to the master task */ dest = MASTER; MPI_Send(&offset, 1, MPI_INT, dest, tag1, MPI_COMM_WORLD); MPI_Send(&data[offset], chunksize, MPI_FLOAT, MASTER, tag2, MPI_COMM_WORLD); MPI_Reduce(&mysum, &sum, 1, MPI_FLOAT, MPI_SUM, MASTER, MPI_COMM_WORLD); } /* end of non-master */
Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_array.c
Master computes the SUM of all workloads
Worker processes receive work chunks from master
Each worker computes local sum
Send local sum to master process
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
14
Array Decompositon (source code)
MPI_Finalize();
} /* end of main */
float update(int myoffset, int chunk, int myid) { int i; float mysum; /* Perform addition to each of my array elements and keep my sum */ mysum = 0; for(i=myoffset; i < myoffset + chunk; i++) { data[i] = data[i] + i * 1.0; mysum = mysum + data[i]; } printf("Task %d mysum = %e\n",myid,mysum); return(mysum); }
Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_array.c
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
15
Demo : Array Decomposition
[lsu00@master array_decomposition]$ mpiexec -np 4 ./arrayMPI task 0 has started...MPI task 2 has started...MPI task 1 has started...MPI task 3 has started...Initialized array sum = 1.335708e+14Sent 4000000 elements to task 1 offset= 4000000Sent 4000000 elements to task 2 offset= 8000000Task 1 mysum = 4.884048e+13Sent 4000000 elements to task 3 offset= 12000000Task 2 mysum = 7.983003e+13Task 0 mysum = 1.598859e+13Task 3 mysum = 1.161867e+14Sample results: 0.000000e+00 2.000000e+00 4.000000e+00 6.000000e+00 8.000000e+00 8.000000e+06 8.000002e+06 8.000004e+06 8.000006e+06 8.000008e+06 1.600000e+07 1.600000e+07 1.600000e+07 1.600001e+07 1.600001e+07 2.400000e+07 2.400000e+07 2.400000e+07 2.400001e+07 2.400001e+07*** Final sum= 2.608458e+14 ***
Output from arete for a 4 processor run.
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
Topics
• Array Decomposition• Matrix Transpose• Gauss-Jordan Elimination• LU Decomposition• Summary Materials for Test
16
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
Matrix Transpose• The transpose of the (m x n) matrix A is the (n x m) matrix
formed by interchanging the rows and columns such that row i becomes column i of the transposed matrix
mnnn
m
m
T
aaa
aaa
aaa
21
22212
12111
A
mnmm
n
n
aaa
aaa
aaa
21
22221
11211
A
010
431A
04
13
01TA
52
31A
53
21TA
17
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
Matrix Transpose - OpenMP
18
#include <stdio.h>#include <sys/time.h>#include <omp.h>#define SIZE 4
main(){ int i, j; float Matrix[SIZE][SIZE], Trans[SIZE][SIZE]; for (i = 0; i < SIZE; i++) { for (j = 0; j < SIZE; j++) Matrix[i][j] = (i * j) * 5 + i; } for (i = 0; i < SIZE; i++) { for (j = 0; j < SIZE; j++) Trans[i][j] = 0.0; }
Initialize source matrix
Initialize results matrix
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
Matrix Transpose - OpenMP
19
#pragma omp parallel for private(j) for (i = 0; i < SIZE; i++) for (j = 0; j < SIZE; j++) Trans[j][i] = Matrix[i][j]; printf("The Input Matrix Is \n"); for (i = 0; i < SIZE; i++) { for (j = 0; j < SIZE; j++) printf("%f \t", Matrix[i][j]); printf("\n"); } printf("\nThe Transpose Matrix Is \n"); for (i = 0; i < SIZE; i++) { for (j = 0; j < SIZE; j++) printf("%f \t", Trans[i][j]); printf("\n"); } return 0;}
Perform transpose in parallel using omp parallel
for
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
Matrix Transpose – OpenMP (DEMO)
20
[LSU760000@n01 matrix_transpose]$ ./omp_mtrans
The Input Matrix Is 0.000000 0.000000 0.0000000 0.0000000 1.000000 6.000000 11.000000 16.000000 2.000000 12.000000 22.000000 32.000000 3.000000 18.000000 33.000000 48.000000
The Transpose Matrix Is 0.000000 1.0000000 2.0000000 3.0000000 0.000000 6.0000000 12.000000 18.000000 0.000000 11.000000 22.000000 33.000000 0.000000 16.000000 32.000000 48.000000
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
Matrix Transpose - MPI
21
#include <stdio.h>#include "mpi.h"#define N 4int A[N][N];void fill_matrix(){ int i,j; for(i = 0; i < N; i ++) for(j = 0; j < N; j ++) A[i][j] = i * N + j;}void print_matrix(){ int i,j; for(i = 0; i < N; i ++) { for(j = 0; j < N; j ++) printf("%d ", A[i][j]); printf("\n"); }}
Initialize source matrix
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
Matrix Transpose - MPI
22
main(int argc, char* argv[]){ int r, i; MPI_Status st; MPI_Datatype typ;
MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &r);
if(r == 0) { fill_matrix(); printf("\n Source:\n"); print_matrix(); MPI_Type_contiguous(N * N, MPI_INT, &typ); MPI_Type_commit(&typ); MPI_Barrier(MPI_COMM_WORLD); MPI_Send(&(A[0][0]), 1, typ, 1, 0, MPI_COMM_WORLD); }
Creating custom MPI datatype to store local
workloads
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
Matrix Transpose - MPI
23
else if(r == 1){ MPI_Type_vector(N, 1, N, MPI_INT, &typ); MPI_Type_hvector(N, 1, sizeof(int), typ, &typ); MPI_Type_commit(&typ); MPI_Barrier(MPI_COMM_WORLD); MPI_Recv(&(A[0][0]), 1, typ, 0, 0, MPI_COMM_WORLD, &st); printf("\n Transposed:\n"); print_matrix(); }
MPI_Finalize();}
Creates a vector datatype of length N strided by a blocklength
of 1
Datatype MPI_Type_hvector allows for on the fly transpose of the matrix
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
Matrix Transpose – MPI (DEMO)
24
[LSU760000@n01 matrix_transpose]$ mpiexiec -np 2 ./mpi_mtrans
Source:0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Transposed:0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
Topics
• Array Decomposition• Matrix Transpose• Gauss-Jordan Elimination• LU Decomposition• Summary Materials for Test
25
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
Linear Systems
3333232131
2323222121
1313212111
bxaxaxa
bxaxaxa
bxaxaxa
3333232131
2323222121
1313212111
bxaxaxa
bxaxaxa
bxaxaxa
3
2
1
3
2
1
333231
232221
131211
b
b
b
x
x
x
aaa
aaa
aaa
3
2
1
3
2
1
333231
232221
131211
b
b
b
x
x
x
aaa
aaa
aaa
Solve Ax=b, where A is an nn matrix andb is an n1 column vector
www.cs.princeton.edu/courses/archive/fall07/cos323/
26
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
Gauss-Jordan Elimination
• Fundamental operations:1. Replace one equation with linear combination
of other equations
2. Interchange two equations
3. Re-label two variables
• Combine to reduce to trivial system
• Simplest variant only uses #1 operations but get better stability by adding– #2 or
– #2 and #3
www.cs.princeton.edu/courses/archive/fall07/cos323/
27
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
Gauss-Jordan Elimination
• Solve:
• Can be represented as
• Goal: to reduce the LHS to an identity matrix resulting with the solutions in RHS
1354
732
21
21
xx
xx
1354
732
21
21
xx
xx
13
7
54
32
13
7
54
32
?
?
10
01
?
?
10
01
www.cs.princeton.edu/courses/archive/fall07/cos323/
28
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
Gauss-Jordan Elimination
• Basic operation 1: replace any row bylinear combination with any other row :
replace row1 with 1/2 * row1 + 0 * row2
• Replace row2 with row2 – 4 * row1
• Negate row2
13
7
54
32
13
7
54
32
1354
1 27
23
1354
1 27
23
110
1 27
23
110
1 27
23
110
1 27
23
110
1 27
23
www.cs.princeton.edu/courses/archive/fall07/cos323/
29
Row1 = (Row1)/2
Row2=Row2-(4*Row1)
Row2 = (-1)*Row2
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
Gauss-Jordan Elimination
• Replace row1 with row1 – 3/2 * row2
• Solution:
x1 = 2, x2 = 1
110
1 27
23
110
1 27
23
1
2
10
01
1
2
10
01
www.cs.princeton.edu/courses/archive/fall07/cos323/
30
Row1 = Row1 – (3/2)* Row2
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
Pivoting
• Consider this system:
• Immediately run into problem: algorithm wants us to divide by zero!• More subtle version:
• The pivot or pivot element is the element of a matrix which is selected first by an algorithm to do computation
• Pivot entry is usually required to be at least distinct from zero, and often distant from it
• Select largest element in matrix and swap columns and rows to bring this element to the ‚right’ position: full (complete) pivoting
8
2
32
10
8
2
32
10
8
2
32
1001.0
8
2
32
1001.0
www.cs.princeton.edu/courses/archive/fall07/cos323/
31
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
Pivoting
• Consider this system:
• Pivoting :– Swap rows 1 and 2:
– And continue to solve as shown before
1
8
10
23
1
8
10
23
1
2
10
01
110
1 38
32
1
2
10
01
110
1 38
32
www.cs.princeton.edu/courses/archive/fall07/cos323/
32
x1 = 2, x2 = 1
8
1
23
10
8
1
23
10
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
Pivoting:Example• Division by small numbers round-off error in computer arithmetic
• Consider the following system0.0001x1 + x2 = 1.000x1 + x2 = 2.000
• exact solution: x1=1.0001 and x2 = 0.9999
• say we round off after 3 digits after the decimal point
• Multiply the first equation by 104 and subtract it from the second equation
• (1 - 1)x1 + (1 - 104)x2 = 2 - 104
• But, in finite precision with only 3 digits:– 1 - 104 = -0.9999 E+4 ~ -0.999 E+4– 2 - 104 = -0.9998 E+4 ~ -0.999 E+4
• Therefore, x2 = 1 and x1 = 0 (from the first equation)
• Very far from the real solution!
0.0001 1
1 1
src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
33
1
2
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
Partial Pivoting
• Partial pivoting doesn‘t look for largest element in matrix, but just for the largest element in the ‚current‘ column
• Swap rows to bring the corresponding row to ‚right‘ position
• Partial pivoting is generally sufficient to adequately reduce round-off error.
• Complete pivoting is usually not necessary to ensure numerical stability
• Due to the additional computations it introduces, it may not always be the most appropriate pivoting strategy
34
http://www.amath.washington.edu/~bloss/amath352_lectures/
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
Partial Pivoting• One can just swap rows
x1 + x2 = 2.000
0.0001x1 + x2 = 1.000• Multiple the first equation by 0.0001 and subtract it from the second equation gives:
(1 - 0.0001)x2 = 1 - 0.0001
0.9999 x2 = 0.9999 => x2 = 1
and then x1 = 1• Final solution is closer to the real solution.
• Partial Pivoting– For numerical stability, one doesn’t go in order, but pick the next row in rows i to n that has the
largest element in row i– This row is swapped with row i (along with elements of the right hand side) before the
subtractions• the swap is not done in memory but rather one keeps an indirection array
• Total Pivoting– Look for the greatest element ANYWHERE in the matrix– Swap columns– Swap rows
• Numerical stability is really a difficult fieldsrc: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
35
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
Partial Pivoting
36
http://www.amath.washington.edu/~bloss/amath352_lectures/
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
Special Cases
• Common special case:• Tri-diagonal Systems :
– Only main diagonal & 1 above,1 below– Solve using : Gauss-Jordan
• Lower Triangular Systems (L)– Solve using : forward substitution
• Upper Triangular Systems (U)– Solve using : backward substitution
4
3
2
1
4443
343332
232221
1211
00
0
0
00
b
b
b
b
aa
aaa
aaa
aa
4
3
2
1
4443
343332
232221
1211
00
0
0
00
b
b
b
b
aa
aaa
aaa
aa
4
3
2
1
44434241
333231
2221
11
0
00
000
b
b
b
b
aaaa
aaa
aa
a
4
3
2
1
44434241
333231
2221
11
0
00
000
b
b
b
b
aaaa
aaa
aa
a
11
11 a
bx
11
11 a
bx
22
12122 a
xabx
22
12122 a
xabx
33
23213133 a
xaxabx
33
23213133 a
xaxabx
5
4
3
2
1
55
4544
353433
25242322
1514131211
0000
000
00
0
b
b
b
b
b
a
aa
aaa
aaaa
aaaaa
5
4
3
2
1
55
4544
353433
25242322
1514131211
0000
000
00
0
b
b
b
b
b
a
aa
aaa
aaaa
aaaaa
55
55 a
bx
55
55 a
bx
44
54544 a
xabx
44
54544 a
xabx
www.cs.princeton.edu/courses/archive/fall07/cos323/
37
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
Topics
• Array Decomposition• Matrix Transpose• Gauss-Jordan Elimination• LU Decomposition• Summary Materials for Test
38
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
Solving Linear Systems of Eq.
• Method for solving Linear Systems– The need to solve linear systems arises in an estimated 75% of
all scientific computing problems [Dahlquist 1974]• Gaussian Elimination is perhaps the most well-known method
– based on the fact that the solution of a linear system is invariant under scaling and under row additions
• One can multiply a row of the matrix by a constant as long as one multiplies the corresponding element of the right-hand side by the same constant
• One can add a row of the matrix to another one as long as one adds the corresponding elements of the right-hand side
– Idea: scale and add equations so as to transform matrix A in an upper triangular matrix:
??
???
x =
equation n-i has i unknowns, with
?
src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
39
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
Gaussian Elimination1 1 1
1 -2 2
1 2 -1
0
4
2x =
1 1 1
0 -3 1
0 1 -2
0
4
2x =
1 1 1
0 -3 1
0 0 -5
0
4
10
x =
Subtract row 1 from rows 2 and 3
Multiple row 3 by 3 and add row 2
-5x3 = 10 x3 = -2-3x2 + x3 = 4 x2 = -2x1 + x2 + x3 = 0 x1 = 4
Solving equations inreverse order (backsolving)
src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
40
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
Gaussian Elimination• The algorithm goes through the matrix from the top-left
corner to the bottom-right corner• The ith step eliminates non-zero sub-diagonal elements
in column i, subtracting the ith row scaled by aji/aii from row j, for j=i+1,..,n.
i
0
values already computed
values yet to beupdated
pivot row ito
be
ze
roe
d
src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
41
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
Sequential Gaussian Elimination
Simple sequential algorithm
// for each column i// zero it out below the diagonal by adding// multiples of row i to later rowsfor i = 1 to n-1 // for each row j below row i for j = i+1 to n // add a multiple of row i to row j for k = i to n A(j,k) = A(j,k) - (A(j,i)/A(i,i)) * A(i,k)
• Several “tricks” that do not change the spirit of the algorithm but make implementation easier and/or more efficient– Right-hand side is typically kept in column n+1 of the matrix and one
speaks of an augmented matrix– Compute the A(i,j)/A(i,i) term outside of the loop
src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
42
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
Parallel Gaussian Elimination?
• Assume that we have one processor per matrix element
Reduction
to find the max aji
Broadcast
max aji needed to computethe scaling factor
Compute
Independent computationof the scaling factor
Broadcasts
Every update needs thescaling factor and the element from the pivot row
Compute
Independentcomputations
src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
43
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
LU Factorization
• Gaussian Elimination is simple but– What if we have to solve many Ax = b systems for different values of b?
• This happens a LOT in real applications
• Another method is the “LU Factorization” (LU Decomposition)• Ax = b• Say we could rewrite A = L U, where L is a lower triangular matrix, and U is an upper
triangular matrix O(n3)• Then Ax = b is written L U x = b• Solve L y = b O(n2) • Solve U x = y O(n2)
??????
x =??????
x =
equation i has i unknowns equation n-i has i unknowns
triangular system solves are easy
src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
44
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
LU Factorization: Principle
• It works just like the Gaussian Elimination, but instead of zeroing out elements, one “saves” scaling coefficients.
• Magically, A = L x U !• Should be done with pivoting as well
1 2 -1
4 3 1
2 2 3
1 2 -1
0 -5
5
2 2 3
gaussianelimination
save thescalingfactor
1 2 -1
4 -5
5
2 2 3
gaussianelimination
+save thescalingfactor
1 2 -1
4 -5
5
2 -2
5gaussianelimination
+save thescalingfactor
1 2 -1
4 -5 5
2 2/5 3
1 0 0
4 1 0
2 2/5 1
L = 1 2 -1
0 -5 5
0 0 3U =
src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
45
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
LU Factorization
stores the scaling factors
k
k
LU-sequential(A,n) { for k = 0 to n-2 { // preparing column k for i = k+1 to n-1 aik -aik / akk
for j = k+1 to n-1 // Task Tkj: update of column j for i=k+1 to n-1 aij aij + aik * akj
}}
src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
• We’re going to look at the simplest possible version– No pivoting: just creates a bunch of indirections that are easy but make
the code look complicated without changing the overall principle
46
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
LU Factorization
• We’re going to look at the simplest possible version– No pivoting: just creates a bunch of indirections that are easy but make
the code look complicated without changing the overall principle
LU-sequential(A,n) { for k = 0 to n-2 { // preparing column k for i = k+1 to n-1 aik -aik / akk
for j = k+1 to n-1 // Task Tkj: update of column j for i=k+1 to n-1 aij aij + aik * akj
}}
k
ij
k
update
src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
47
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
Parallel LU on a ring
• Since the algorithm operates by columns from left to right, we should distribute columns to processors
• Principle of the algorithm– At each step, the processor that owns column k does the “prepare” task
and then broadcasts the bottom part of column k to all others• Annoying if the matrix is stored in row-major fashion• Remember that one is free to store the matrix in anyway one wants, as long
as it’s coherent and that the right output is generated
– After the broadcast, the other processors can then update their data.
• Assume there is a function alloc(k) that returns the rank of the processor that owns column k– Basically so that we don’t clutter our program with too many global-to-
local index translations
• In fact, we will first write everything in terms of global indices, as to avoid all annoying index arithmetic
src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
48
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
LU-broadcast algorithm
LU-broadcast(A,n) { q MY_NUM() p NUM_PROCS() for k = 0 to n-2 { if (alloc(k) == q) // preparing column k for i = k+1 to n-1 buffer[i-k-1] aik -aik / akk
broadcast(alloc(k),buffer,n-k-1) for j = k+1 to n-1 if (alloc(j) == q) // update of column j for i=k+1 to n-1 aij aij + buffer[i-k-1] * akj
}}
src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
49
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
Dealing with local indices
• Assume that p divides n• Each processor needs to store r=n/p columns and its
local indices go from 0 to r-1• After step k, only columns with indices greater than k will
be used• Simple idea: use a local index, l, that everyone initializes
to 0• At step k, processor alloc(k) increases its local index so
that next time it will point to its next local column
src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
50
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
LU-broadcast algorithm
... double a[n-1][r-1];
q MY_NUM() p NUM_PROCS() l 0 for k = 0 to n-2 { if (alloc(k) == q) for i = k+1 to n-1 buffer[i-k-1] a[i,k] -a[i,l] / a[k,l] l l+1 broadcast(alloc(k),buffer,n-k-1) for j = l to r-1 for i=k+1 to n-1 a[i,j] a[i,j] + buffer[i-k-1] * a[k,j] }} src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
51
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
Bad load balancing
P1 P2 P3 P4
alreadydone
alreadydone working
on it
src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
52
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
Good Load Balancing?
working on it
alreadydone
alreadydone
Cyclic distribution
src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
53
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
Load-balanced program
... double a[n-1][r-1];
q MY_NUM() p NUM_PROCS() l 0 for k = 0 to n-2 { if (k mod p == q) for i = k+1 to n-1 buffer[i-k-1] a[i,k] -a[i,l] / a[k,l] l l+1 broadcast(alloc(k),buffer,n-k-1) for j = l to r-1 for i=k+1 to n-1 a[i,j] a[i,j] + buffer[i-k-1] * a[k,j] }} src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
54
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
Performance Analysis
• How long does this code take to run?– This is not an easy question because there are many tasks and
many communications
• A little bit of analysis shows that the execution time is the sum of three terms– n-1 communications: n L + (n2/2) b + O(1)– n-1 column preparations: (n2/2) w’ + O(1)– column updates: (n3/3p) w + O(n2)
• Therefore, the execution time is O(n3/p) – Note that the sequential time is: O(n3)
• Therefore, we have perfect asymptotic efficiency!– This is good, but isn’t always the best in practice
• How can we improve this algorithm?
src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
55
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
Pipelining on the Ring
• So far, in the algorithm we’ve used a simple broadcast• Nothing was specific to being on a ring of processors
and it’s portable – in fact you could just write raw MPI that just looks like our
pseudo-code and have a very limited, inefficient for small n, LU factorization that works only for some number of processors
• But it’s not efficient– The n-1 communication steps are not overlapped with
computations– Therefore Amdahl’s law, etc.
• Turns out that on a ring, with a cyclic distribution of the columns, one can interleave pieces of the broadcast with the computation– It almost looks like inserting the source code from the broadcast
code we saw at the very beginning throughout the LU code
src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
56
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
Previous program
... double a[n-1][r-1];
q MY_NUM() p NUM_PROCS() l 0 for k = 0 to n-2 { if (k == q mod p) for i = k+1 to n-1 buffer[i-k-1] a[i,k] -a[i,l] / a[k,l] l l+1 broadcast(alloc(k),buffer,n-k-1) for j = l to r-1 for i=k+1 to n-1 a[i,j] a[i,j] + buffer[i-k-1] * a[k,j] }} src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
57
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
LU-pipeline algorithm
double a[n-1][r-1];
q MY_NUM() p NUM_PROCS() l 0 for k = 0 to n-2 { if (k == q mod p) for i = k+1 to n-1 buffer[i-k-1] a[i,k] -a[i,l] / a[k,l] l l+1 send(buffer,n-k-1) else recv(buffer,n-k-1) if (q ≠ k-1 mod p) send(buffer, n-k-1) for j = l to r-1 for i=k+1 to n-1 a[i,j] a[i,j] + buffer[i-k-1] * a[k,j] }} src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
58
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
Topics
• Array Decomposition• Matrix Transpose• Gauss-Jordan Elimination• LU Decomposition• Summary Materials for Test
59
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011
Summary : Material for the Test
• Matrix Transpose: Slides 17-23• Gauss Jordan: Slides 26-30• Pivoting: Slides 31-37• Special Cases (forward & backward substitution): Slide 35• LU Decomposition 44-58
60