parallel linear algebra · parallel linear algebra our goals: fast and efﬁcient parallel...

Parallel Linear Algebra

Our goals: Fast and efficient parallel algorithms forthe matrix-vector product,the matrix-matrix product,solving systems of linear equations,applying finite difference systems,and computing the fast Fourier Transform.

The matrix-vector product is the basis of most of our algorithms.

Parallel Linear Algebra 1 / 35

Decomposing a matrix

How to distribute an m × n matrix A to p processes?

Rowwise decomposition:each process is responsible for m/p contiguous rows.Columnwise decomposition:each process is responsible for n/p contiguous columns.Checkerboard decomposition:Assume that k divides m and that l divides n.

I Assume moreover that k · l = p.I Imagine that the processes form a k × l mesh.I Process (i , j) obtains the submatrix of A consisting of the i th row

interval of length m/k and the j th column interval of length n/l .

Parallel Linear Algebra 2 / 35

The Matrix-Vector Product

Our goal: Compute y = A · x for a m × n matrix A and a vector x with ncomponents.

Assumptions:I We do assume that matrix A has been distributed to the various

processes.I Process 1 knows the vector x and has to determine the vector y .

The conventional sequential algorithm determines y by setting

yi =n∑

j=1

A[i , j] · xj .

I To compute yi we perform n multiplications and n − 1 additions.I Overall m · n multiplications and m · (n − 1) additions suffice.

Parallel Linear Algebra The Matrix-Vector Product 3 / 35

The Rowwise Decomposition

Replicate x : broadcast x to all processes in time O(n · log2 p).Each process determines its m

p vector-vector products in timeO(m·n

p ).

Process 1 performs a Gather operation in time O(m): p − 1messages of length m/p are involved.Performance analysis:

I Communication time is proportional to n · log2 p + m and overalltime Θ(m · n/p + n · log2 p + m) is sufficient.

I Efficiency is Θ(m · n/(m · n + p · (n · log2 p + m))).I Constant efficiency follows, if

m · n = Ω(p · (n · log2 p + m)) = Ω(p · log2 p · n + m · p)I Hence we get constant efficiency for

m = Ω(p · log2 p) and n = Ω(p).


The Columnwise Decomposition

Apply MPI_Scatter to distribute the blocks of x to “their”processes. Since this involves p − 1 messages of length n/p, timeO(n) is sufficient.Each process i computes the matrix-vector product y i = Ai · x i forits block Ai of columns.Time O(m · n/p) is sufficient.Process 1 applies a Reduce operation to sum up y1, y2, . . . , yp intime O(m · log2 p).Performance analysis:

I Run time is bounded by O(m · n/p + n + m · log2 p).I Here we have constant efficiency, if computing time dominates

communication time. Require

m = Ω(p) and n = Ω(p · log2 p).


Checkerboard Decomposition

Process 1 applies a Scatter operation addressed to the lprocesses of row 1 of the process mesh. Time O(l · n

l ) = O(n).Then each process of row 1 broadcasts its block of x to the kprocesses in its column: time O(n

l · log2 k) suffices. All processescompute their matrix-vector products in time O(m · n/p).The processes in column 1 of the process mesh apply a Reduceoperation for their row to sum up the l vectors of length m

k : timeO(m/k · log2 l) is sufficient.Process 1 gathers the k − 1 vectors of length m

k in time O(m).Performance analysis:

I The total computation time is bounded byO(m · n/p + n + n

l · log2 k + mk · log2 l + m).

I The total communication time is bounded by O(n + m), providedlog2 k ≤ l and log2 l ≤ k .

I We obtain constant efficiency, if m = Ω(p) and n = Ω(p).


Summary

The checkerboard decomposition has the best performance, if m ≈ n.Why?

All three decompositions have the same computation time.Assuming m = n,

I the communication time of the rowwise decomposition is dominatedby boadcasting the vector x : time O(n log2 p),

I whereas the final Reduce dominates for the columnwisedecomposition: time O(m log2 p).

I The checkerboard decomposition cuts down on the messagelength!


Matrix-Matrix Product

Our goal is to compute the n × n product matrix C = A · B for n × nmatrices A and B.

To compute C[i , j] =∑n

k=1 A[i , k ] · B[k , j] sequentially,n multiplications and n − 1 additions are required.Since C has n2 entries, we obtain running time Θ(n3).We discuss four approaches:

I the first algorithm uses the rowwise decomposition.I The algorithm of Fox and its improvement, the algorithm of Cannon,

use the checkerboard decomposition.I The DNS algorithm assumes a variant of the checkerboard

decomposition.

Parallel Linear Algebra The Matrix-Matrix Product 8 / 35


Process i receives the submatrices Ai of A and Bi of B,corresponding to the i th row interval of length n

p .

Further subdivide Ai ,Bi into the np square submatrices Ai,j ,Bi,j .

Define C i,j analogously and observe that C i,j =∑p

k=1 Ai,k · Bk ,j

holds. The computation:I In phase 1 process i computes all products Ai,i · Bi,j for j = 1, . . . ,p

in time O(p · np ·

np ·

np ) = O( n3

p2 ), then sends Bi to process i + 1 andreceives Bi−1 from process i − 1 in time O(n2/p).

I In phase 2 process i computes all products Ai,i−1 · Bi−1,j , sendsBi−1 to process i + 1 and receives Bi−2 from i − 1 . . ..

Performance analysis:I All in all p phases. Hence computing time is bounded by O(n3/p)

and communication time is bounded by O(n2).I The compute/communicate ratio n3

p /n2 = n

p is small!


The Algorithm of Fox

We again determine the product matrix according toC i,j =

∑pk=1 Ai,k · Bk ,j , but now

I processes are arranged in a√

p ×√p mesh of processes.I Process i knows the n/

√p × n/

√p submatrices Ai,j and Bi,j .

We have√

p phases.In phase k we want process (i , j) to compute Ai,i+k−1 · Bi+k−1,j :

I process (i , i + k − 1) broadcasts Ai,i+k−1 to all processes in row i ,I process (i , j) computes Ai,i+k−1 · Bi+k−1,j ,I receives Bi+k,j from (i + 1, j) and sends Bi+k−1,j to (i − 1, j).

Performance Analysis:I Per phase: computing time O(( n√

p )3) and communication time

O( n2

p · log p).I We have

√p phases: computation time O( n3

p ), communication time

O( n2√

p · log p). The compute/communicate ratio n√p log2 p increases.


The Algorithm of Cannon

The setup is as for the algorithm of Fox.In particular, process (i , j) has to determine C i,j =

∑pk=1 Ai,k · Bk ,j .

At the very beginning, redistribute matrices, such that process(i , j) holds Ai,i+j and Bi+j,j .We again have

√p phases. In phase k we want process (i , j) to

compute Ai,i+j+k−1 · Bi+j+k−1,j :I process (i , j) computes Ai,i+j+k−1 · Bi+j+k−1,j ,I sends Ai,i+j+k−1 to (i , j − 1) and Bi+j+k−1,j to (i − 1, j) andI receives Ai,i+j+k from (i , j + 1) and Bi+j+k,j from (i + 1, j).

Performance Analysis:I Per phase: computation time O(( n√

p )3), communication timeO(( n√

p )2).I Overall, computation time O( n3

p ), communication time O( n2√

p ) andthe compute/communicate ratio n√

p increases again.


How did we save Communication?

- Rowwise decomposition: in each of the p phases row blocks areexchanged.All in all O(p · n2/p) communication.

- The algorithm of Fox: a broadcast in each of the√

p withcommunication time O(n2/p · log p).All in all communication time O(n2/

√p · log p): merging

point-to-point messages into broadcasts is profitable!- The algorithm of Cannon: after initially rearranging submatrices,

the broadcasts in the algorithm of Fox are replaced by point topoint messages.All in all communication time O(

√p · n2/p).


The DNS Algorithm

p = n3 processes are arranged in an n × n × n mesh of processes.Process (i , j ,1) stores A[i , j],B[i , j] and has to determine C[i , j].

We move A[i , k ] to process (i , ∗, k): (i , k ,1) sends A[i , k ] to(i , k , k), which broadcasts A[i , k ] to all processes (i , ∗, k).Next we move B[k , j] to process (∗, j , k): (k , j ,1) sends B[k , j] to(k , j , k), which broadcasts B[k , j] to all processes (∗, j , k).Process (i , j , k) computes the product A[i , k ] · B[k , j].Process (i , j ,1) computes

∑nk=1 A[i , k ] · B[k , j] with MPI_Reduce.

Performance analysis:I The replication step takes time O(log2 n), since the broadcast

dominates. The multiplication step runs in constant time and theReduce operation runs in logarithmic time.

I Time O(log2 n) suffices. Its efficiency Θ(1/ log2 n) is too small.I We scale down.


Scaling down the number of processors

We work with p processes. Let q = p1/3 and imagine that the pprocesses are arranged in a q × q × q mesh.Input distribution: process (i , j ,1) receives the n

q ×nq submatrices

Ai,j and Bi,j : the matrices Ai,j and Bi,j play the role of the entriesA[i , j] and B[i , j].Mimic the algorithm for n3 processes.Performance analysis:

I The total computing time is O( n3

q3 ) = O( n3

p ), since nq ×

nq matrices

have to be multiplied.I During replication and summing, n

q ×nq matrices are involved and

hence the communication time is bounded by O( n2

q2 · log p).I The compute/communicate ratio is n

q·log2 p .

Best performance so far. p should be sufficiently large.


Summary

The checkerboard decomposition is again better than the rowwisedecomposition.Cannon’s algorithm replaces a broadcast by a point-to-pointmessage and is therefore faster than the algorithm of Fox.The DNS algorithm partitions the matrices A and B among q2 ofthe q3 processes.

I Thus each “input process” gets a relatively large chunk.I However there are only two (instead of

√p) communication steps:

namely when replicating and summing.I Observe that DNS is better than Cannon only if p is sufficiently

large.


Solving Linear Systems

We are given a matrix A and a right-hand side b and would like tosolve the linear system A · x = b.

We begin with the easy case of lower triangular matrices A anddescribe back substitution.Then we discuss efficient parallelizations of Gaussian eliminationand continue with iterative methods: Jacobi relaxation, theGauss-Seidel algorithm, the conjugate gradient approach and theNewton method.Finally we consider parallelization of the finite difference method.

Solving Linear Systems 16 / 35

Backsubstitution

We have to solve the system

A[i ,1] · x1 + · · ·+ A[i , i] · xi = bi

for i = 1, . . . ,n.

A sequential solution:I first determine x1 from the first equation A[i ,1] · x1 = b1.I If we already know x1, . . . , xi−1, then determine xi from the i th

equation.I Since an evaluation of the i th equation requires time O(i), the

sequential solution runs in time O(n2).We consider two input distributions:

I The off-diagonal decomposition of matrix A:process 1 knows the main diagonal and process i (i ≥ 2) knows thei − 1st offdiagonal A[i ,1],A[i + 1,2], . . . ,A[n,n − i + 1].

I And the rowwise decomposition.

Solving Linear Systems Backsubstitution 17 / 35

The Off-Diagonal Decomposition I

We use the linear array as communication pattern.

Process 1 successively determines x1, . . . , xn.Once computed, xi is forwarded thru the linear array.

How to solve the i th equation A[i ,1] · x1 + · · ·+ A[i , i] · xi = bi?

I Process i computes A[i ,1] · x1 immediately after receiving x1 fromprocess i − 1. Then i sends A[i ,1] · x1 to process i − 1 and x1 toprocess i + 1.

I If process i − 1 receives x2 from process i − 2, it computes theproduct A[i ,2] · x2, sends the sum A[i ,1] · x1 + A[i ,2] · x2 to processi − 2 and forwards x2 to process i .

I We communicate according to the principle of“just in time production”.


The Off-Diagonal Decomposition II

x1

x2

x 3

A[2,1]

A[3,1]

A[4,1]x1

A[3,2]

A[4,2]x

A[4,3]x

x4

x 1

1x

2x

2

3

time

processors


The Off-Diagonal Decomposition III

Backsubstitution with p processes.

Assign the off-diagonals (A[j ,1], . . . ,A[n,n − j + 1]) forj ∈ (i − 1) · n/p + 1, . . . , i · n/p to process i .The computing time: we have p phases with compute timeO((n/p)2) per phase.All in all compute time is bounded by O(n2

p ).

Communication O(n/p) per phase and hence all in all O(n)communication.

The running time is bounded by O(n2

p + n).We achieve constant efficiency, whenever n = Ω(p).



This time process i determines xi .Once xi is determined, process i broadcasts xi to processesi + 1, . . . ,n.

For p processes:

Each process is responsible for np variables. Time O(n) per

variable is sufficient.⇒ The compute time is bounded by O(n2

p ).There is one broadcast per unknown and communication time isbounded by O(n · log2 p).We achieve constant efficiency, whenever n = Ω(p · log2 p).


Gaussian Elimination with Partial Pivoting

Include right hand side b as last column of matrix A.

If we have already eliminated nonzeroes below the diagonals incolumns 1, . . . , i − 1, then

use largest entry A[i , j] for j = i , . . . ,n as pivot,

swap rows i and j and set rowk = rowk − A[k ,i]A[i,i] · rowi for k > i .

Performance analysis for the sequential algorithm:

When dealing with row i :I Determine the largest entry A[j , i] in column i in time O(n).I The elimination step for row i requires O(n − i + 1) arithmetic

operations.I All in all O(n + (n − i + 1)2) = O(n2) operations suffice.

The total number of arithmetic operations is bounded by O(n3).

Solving Linear Systems Gaussian Elimination 22 / 35

A parallelization of Gaussian Elimination I

We work with p processes and the rowwise decomposition:each process receives an “interval” of n/p rows.We maintain the sequential structure of pivoting,but parallelize each pivoting step instead.Assume that we have reached row i .

I To utilize the rowwise decomposition we look for the largest entry inrow i (and not in column i).

I We have to eliminate all non-zeroes in row i :the process holding row i has to

F determine the largest entry A[i, k ] in row i ,F compute the vector mi of multiples for the elimination stepF and send mi to the remaining processes.


A parallelization of Gaussian Elimination II

Avoid broadcasting (mi , k). When dealing with row i − 1:I After computing mi−1, the process j holding row i interrupts its

elimination work for row i − 1,I immediately recomputes row i and determines (mi , k) instead,I sends (mi , k) to process j + 1 andI then resumes its elimination work for row i − 1.

We cover communication by computation:I the expensive broadcast of (mi , k) is replaced by sending (mi , k)

thru the linear array of processes.I Whenever a process receives (mi , k), it immediately forwards

(mi , k) to its neighbor process.Performance analysis:

I No delay when eliminating row i , if the compute time Θ( np · n) for

pivoting dominates the maximal communication delay p · n.

The overall compute time is bounded by O(n · np · n) = O(n3

p ).There is no delay due to communication, provided n = Ω(p2).


Iterative Methods

In an iterative method an approximate solution of a linear systemA · x = b is successively improved.One starts with an initial “guess” x(0) andreplaces x(t) by a presumably better solution x(t + 1).

Assume that the computation of x(t + 1) is based on the

matrix-vector product.

We obtain a fast parallel algorithm and can exploit sparse linearsystems.We describe:

I the Jacobi relaxation and its variants,I the Newton method to approximately compute the inverse A−1.

Solving Linear Systems Iterative Methods 25 / 35

http://en.wikipedia.org/wiki/Iterative_method

Jacobi Relaxation

Assume that A · x∗ = b. If A has a nonzero diagonal, then

x∗i =1

A[i , i]·

bi −∑j 6=i

A[i , j] · x∗j

.

The Jacobi iteration: if xi(t) is an approximate solution, set

xi(t + 1) =1

A[i , i]·

bi −∑j 6=i

A[i , j] · xj(t)

.

Each Jacobi iteration corresponds to a matrix-vector product.Hence one iteration runs in time O(n2

p ) and we obtain a fastapproximation, whenever few iterations suffice.When does the Jacobi iteration converge against the uniquesolution?


http://en.wikipedia.org/wiki/Jacobi_method

Jacobi Relaxation: Convergence

Let D be the diagonal matrix with D[i , i] = A[i , i].Set M = D−1 · (D − A).

I Another view of the Jacobi iteration:if A is invertible and if x∗ is the unique solution of A · x = b, then

x(t + 1)− x∗ = M · (x(t)− x∗).

I Consequently x(t)− x∗ = M t · (x(0)− x∗) follows for all t .I If limt→∞M t = 0, then x(t) converges against x∗.

The Jacobi Relaxation converges for row diagonally dominantmatrices A: i.e., if

|A[i , i]| >∑j 6=i

|A[i , j]|

holds for all i .


Two Extensions

In many practical applications the Jacobi overrelaxation convergesfaster.

I for a suitable coefficient γ:

xi (t + 1) = (1− γ) · xi (t) +γ

A[i , i]·

bi −∑j 6=i

A[i , j] · xj (t)

.

I The Jacobi relaxation is a special case: set γ = 1.The Gauss-Seidel algorithm incorporates already recomputedvalues of xj (i.e., it replaces xj(t) by xj(t + 1)).

I An example is

xi (t + 1) =1

A[i , i]·

bi −∑j<i

A[i , j] · xj (t + 1)−∑j>i

A[i , j] · xj (t)

.

I The Gauss-Seidel method does not look parallelizable!?


http://en.wikipedia.org/wiki/Gauss-Seidel

Inversion by Newton Iteration

Assume that A is invertible and that Xt is an approximate inverse of A.The Newton iteration is

Xt+1 = 2 · Xt − Xt · A · Xt .

What is the intuition behind this approach?I Consider the residual matrix Rt = I − A · Xt which measures the

“distance” between A−1 and Xt .Rt+1 = I − A · Xt+1

= I − A · (2 · Xt − Xt · A · Xt )= (I − A · Xt )

2 = R2t .

I Rt converges rapidly towards the 0-matrix, whenever X0 is a goodapproximation of A−1.

Since the Newton iteration is based on matrix-matrix products,each iteration is easily parallelized.


The Finite Difference Method

Finite differences can be used to approximate derivatives of a functionf , since

f ′(x) = limh→0f (x+h)−f (x)

h = limh→0f (x)−f (x−h)

hf ′′(x) = limh→0

f ′(x+h)−f ′(x)h = limh→0

f (x+h)−2f (x)+f (x−h)h2 .

The finite difference method is used to solve differential equations:

I Derivatives are approximated by finite differencesI and differential equations are modeled by linear systems of

equations.

The usually sparse systems are mostly solved with iterativemethods.

Solving Linear Systems The Finite Difference Method 30 / 35

http://en.wikipedia.org/wiki/Finite_difference_method

http://en.wikipedia.org/wiki/Finite_difference

An Example: The Poisson Equation

Find a function u : [0,1]2 → R whichsatisfies the Poisson equation uxx + uyy = Hand which has prescribed values on the boundary of the unitsquare [0,1]2.

If u is sufficiently smooth and if h is sufficiently small, then

uxx (x , y) ≈ u(x + h, y)− 2u(x , y) + u(x − h, y)

h2 .

Approximate uy ,y analogously and we get H(x , y) ≈u(x + h, y) + u(x − h, y) + u(x , y + h) + u(x , y − h)− 4u(x , y)

h2 .

For N sufficiently large, set

h =1N, ui,j = u(

iN,

jN

) and Hi,j = H(iN,

jN

).


The Linear System I

Choose (x , y) as one of the grid points ( iN ,

jN ) | 0 < i , j < N and

we get the linear system

−4ui,j + ui+1,j + ui−1,j + ui,j+1 + ui,j−1 =Hi,j

N2

The system is huge: (N − 1)2 equations in (N − 1)2 unknowns.(The values of u at the boundary are prescribed.)The matrix of the system has (N − 1)4 entries, but is sparse, sinceany equation has at most five nonzero coefficients.To utilize sparsity, we apply iterative methods.


The Linear System II

We process the system beginning with the lower boundary andworking upwards:

ui+1,j = 4ui,j − (ui−1,j + ui,j+1 + ui,j−1) +Hi,j

N2 .

We apply the Jacobi Relaxation and get

ui+1,j(t + 1) = 4ui,j(t)− (ui−1,j(t) + ui,j+1(t) + ui,j−1(t)) +Hi,j

N2 .

How to obtain an efficient parallelization of one iteration?I We use the checkerboard decomposition of the N − 1× N − 1 grid.

p processes are arranged in a√

p ×√p mesh.I The process responsible for determining ui+1,j (t + 1) has to know

ui−1,j (t),ui,j+1(t),ui,j−1(t) and ui,j (t).I Any missing value belongs to the “near-boundary” of a neighbor.


The Linear System: Performance Analysis

The processes communicate their O( N√p ) near-boundary values.

Afterwards they perform an iteration without furthercommunication: computation time is bounded by O(N2

p ), since theupdate time for any grid point is constant and each process has toupdate O(N2

p ) grid points.

We have constant efficiency, whenever N = Ω(√

p).One can show that the matrix of the linear system is symmetricand positive definite: the Jacobi Relaxation converges.


Gauss-Seidel Revisited

So far we have used the Jacobi Relaxation

ui+1,j(t + 1) = 4ui,j(t)− (ui−1,j(t) + ui,j+1(t) + ui,j−1(t)) +Hi,j

N2 .

Can we use the generally better Gauss-Seidel method instead?

I Use the new update

ui,j (t + 1) =ui+1,j (t) + ui−1,j (t) + ui,j+1(t) + ui,j−1(t)

4−

Hi,j

4N2 .

I Label grid point (i , j) white iff i + j is even and black otherwise:white grid points are updated with black grid points only.

I Execute one iteration of the Gauss-Seidel algorithm byF first updating black grid points conventionally with the Jacobi

relaxation.F Then apply the Gauss-Seidel algorithm to update white grid points by

using the already updated black grid points.


parallel linear algebra · parallel linear algebra our goals: fast and efﬁcient parallel...

Documents