parallel linear algebra · parallel linear algebra our goals: fast and efficient parallel...

35
Parallel Linear Algebra Our goals: Fast and efficient parallel algorithms for the matrix-vector product, the matrix-matrix product, solving systems of linear equations, applying finite difference systems, and computing the fast Fourier Transform. The matrix-vector product is the basis of most of our algorithms. Parallel Linear Algebra 1 / 35

Upload: others

Post on 16-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Parallel Linear Algebra · Parallel Linear Algebra Our goals: Fast and efficient parallel algorithms for the matrix-vector product, the matrix-matrix product, solving systems of

Parallel Linear Algebra

Our goals: Fast and efficient parallel algorithms forthe matrix-vector product,the matrix-matrix product,solving systems of linear equations,applying finite difference systems,and computing the fast Fourier Transform.

The matrix-vector product is the basis of most of our algorithms.

Parallel Linear Algebra 1 / 35

Page 2: Parallel Linear Algebra · Parallel Linear Algebra Our goals: Fast and efficient parallel algorithms for the matrix-vector product, the matrix-matrix product, solving systems of

Decomposing a matrix

How to distribute an m × n matrix A to p processes?

Rowwise decomposition:each process is responsible for m/p contiguous rows.Columnwise decomposition:each process is responsible for n/p contiguous columns.Checkerboard decomposition:Assume that k divides m and that l divides n.

I Assume moreover that k · l = p.I Imagine that the processes form a k × l mesh.I Process (i , j) obtains the submatrix of A consisting of the i th row

interval of length m/k and the j th column interval of length n/l .

Parallel Linear Algebra 2 / 35

Page 3: Parallel Linear Algebra · Parallel Linear Algebra Our goals: Fast and efficient parallel algorithms for the matrix-vector product, the matrix-matrix product, solving systems of

The Matrix-Vector Product

Our goal: Compute y = A · x for a m × n matrix A and a vector x with ncomponents.

Assumptions:I We do assume that matrix A has been distributed to the various

processes.I Process 1 knows the vector x and has to determine the vector y .

The conventional sequential algorithm determines y by setting

yi =n∑

j=1

A[i , j] · xj .

I To compute yi we perform n multiplications and n − 1 additions.I Overall m · n multiplications and m · (n − 1) additions suffice.

Parallel Linear Algebra The Matrix-Vector Product 3 / 35

Page 4: Parallel Linear Algebra · Parallel Linear Algebra Our goals: Fast and efficient parallel algorithms for the matrix-vector product, the matrix-matrix product, solving systems of

The Rowwise Decomposition

Replicate x : broadcast x to all processes in time O(n · log2 p).Each process determines its m

p vector-vector products in timeO(m·n

p ).

Process 1 performs a Gather operation in time O(m): p − 1messages of length m/p are involved.Performance analysis:

I Communication time is proportional to n · log2 p + m and overalltime Θ(m · n/p + n · log2 p + m) is sufficient.

I Efficiency is Θ(m · n/(m · n + p · (n · log2 p + m))).I Constant efficiency follows, if

m · n = Ω(p · (n · log2 p + m)) = Ω(p · log2 p · n + m · p)I Hence we get constant efficiency for

m = Ω(p · log2 p) and n = Ω(p).

Parallel Linear Algebra The Matrix-Vector Product 4 / 35

Page 5: Parallel Linear Algebra · Parallel Linear Algebra Our goals: Fast and efficient parallel algorithms for the matrix-vector product, the matrix-matrix product, solving systems of

The Columnwise Decomposition

Apply MPI_Scatter to distribute the blocks of x to “their”processes. Since this involves p − 1 messages of length n/p, timeO(n) is sufficient.Each process i computes the matrix-vector product y i = Ai · x i forits block Ai of columns.Time O(m · n/p) is sufficient.Process 1 applies a Reduce operation to sum up y1, y2, . . . , yp intime O(m · log2 p).Performance analysis:

I Run time is bounded by O(m · n/p + n + m · log2 p).I Here we have constant efficiency, if computing time dominates

communication time. Require

m = Ω(p) and n = Ω(p · log2 p).

Parallel Linear Algebra The Matrix-Vector Product 5 / 35

Page 6: Parallel Linear Algebra · Parallel Linear Algebra Our goals: Fast and efficient parallel algorithms for the matrix-vector product, the matrix-matrix product, solving systems of

Checkerboard Decomposition

Process 1 applies a Scatter operation addressed to the lprocesses of row 1 of the process mesh. Time O(l · n

l ) = O(n).Then each process of row 1 broadcasts its block of x to the kprocesses in its column: time O(n

l · log2 k) suffices. All processescompute their matrix-vector products in time O(m · n/p).The processes in column 1 of the process mesh apply a Reduceoperation for their row to sum up the l vectors of length m

k : timeO(m/k · log2 l) is sufficient.Process 1 gathers the k − 1 vectors of length m

k in time O(m).Performance analysis:

I The total computation time is bounded byO(m · n/p + n + n

l · log2 k + mk · log2 l + m).

I The total communication time is bounded by O(n + m), providedlog2 k ≤ l and log2 l ≤ k .

I We obtain constant efficiency, if m = Ω(p) and n = Ω(p).

Parallel Linear Algebra The Matrix-Vector Product 6 / 35

Page 7: Parallel Linear Algebra · Parallel Linear Algebra Our goals: Fast and efficient parallel algorithms for the matrix-vector product, the matrix-matrix product, solving systems of

Summary

The checkerboard decomposition has the best performance, if m ≈ n.Why?

All three decompositions have the same computation time.Assuming m = n,

I the communication time of the rowwise decomposition is dominatedby boadcasting the vector x : time O(n log2 p),

I whereas the final Reduce dominates for the columnwisedecomposition: time O(m log2 p).

I The checkerboard decomposition cuts down on the messagelength!

Parallel Linear Algebra The Matrix-Vector Product 7 / 35

Page 8: Parallel Linear Algebra · Parallel Linear Algebra Our goals: Fast and efficient parallel algorithms for the matrix-vector product, the matrix-matrix product, solving systems of

Matrix-Matrix Product

Our goal is to compute the n × n product matrix C = A · B for n × nmatrices A and B.

To compute C[i , j] =∑n

k=1 A[i , k ] · B[k , j] sequentially,n multiplications and n − 1 additions are required.Since C has n2 entries, we obtain running time Θ(n3).We discuss four approaches:

I the first algorithm uses the rowwise decomposition.I The algorithm of Fox and its improvement, the algorithm of Cannon,

use the checkerboard decomposition.I The DNS algorithm assumes a variant of the checkerboard

decomposition.

Parallel Linear Algebra The Matrix-Matrix Product 8 / 35

Page 9: Parallel Linear Algebra · Parallel Linear Algebra Our goals: Fast and efficient parallel algorithms for the matrix-vector product, the matrix-matrix product, solving systems of

The Rowwise Decomposition

Process i receives the submatrices Ai of A and Bi of B,corresponding to the i th row interval of length n

p .

Further subdivide Ai ,Bi into the np square submatrices Ai,j ,Bi,j .

Define C i,j analogously and observe that C i,j =∑p

k=1 Ai,k · Bk ,j

holds. The computation:I In phase 1 process i computes all products Ai,i · Bi,j for j = 1, . . . ,p

in time O(p · np ·

np ·

np ) = O( n3

p2 ), then sends Bi to process i + 1 andreceives Bi−1 from process i − 1 in time O(n2/p).

I In phase 2 process i computes all products Ai,i−1 · Bi−1,j , sendsBi−1 to process i + 1 and receives Bi−2 from i − 1 . . ..

Performance analysis:I All in all p phases. Hence computing time is bounded by O(n3/p)

and communication time is bounded by O(n2).I The compute/communicate ratio n3

p /n2 = n

p is small!

Parallel Linear Algebra The Matrix-Matrix Product 9 / 35

Page 10: Parallel Linear Algebra · Parallel Linear Algebra Our goals: Fast and efficient parallel algorithms for the matrix-vector product, the matrix-matrix product, solving systems of

The Algorithm of Fox

We again determine the product matrix according toC i,j =

∑pk=1 Ai,k · Bk ,j , but now

I processes are arranged in a√

p ×√p mesh of processes.I Process i knows the n/

√p × n/

√p submatrices Ai,j and Bi,j .

We have√

p phases.In phase k we want process (i , j) to compute Ai,i+k−1 · Bi+k−1,j :

I process (i , i + k − 1) broadcasts Ai,i+k−1 to all processes in row i ,I process (i , j) computes Ai,i+k−1 · Bi+k−1,j ,I receives Bi+k,j from (i + 1, j) and sends Bi+k−1,j to (i − 1, j).

Performance Analysis:I Per phase: computing time O(( n√

p )3) and communication time

O( n2

p · log p).I We have

√p phases: computation time O( n3

p ), communication time

O( n2√

p · log p). The compute/communicate ratio n√p log2 p increases.

Parallel Linear Algebra The Matrix-Matrix Product 10 / 35

Page 11: Parallel Linear Algebra · Parallel Linear Algebra Our goals: Fast and efficient parallel algorithms for the matrix-vector product, the matrix-matrix product, solving systems of

The Algorithm of Cannon

The setup is as for the algorithm of Fox.In particular, process (i , j) has to determine C i,j =

∑pk=1 Ai,k · Bk ,j .

At the very beginning, redistribute matrices, such that process(i , j) holds Ai,i+j and Bi+j,j .We again have

√p phases. In phase k we want process (i , j) to

compute Ai,i+j+k−1 · Bi+j+k−1,j :I process (i , j) computes Ai,i+j+k−1 · Bi+j+k−1,j ,I sends Ai,i+j+k−1 to (i , j − 1) and Bi+j+k−1,j to (i − 1, j) andI receives Ai,i+j+k from (i , j + 1) and Bi+j+k,j from (i + 1, j).

Performance Analysis:I Per phase: computation time O(( n√

p )3), communication timeO(( n√

p )2).I Overall, computation time O( n3

p ), communication time O( n2√

p ) andthe compute/communicate ratio n√

p increases again.

Parallel Linear Algebra The Matrix-Matrix Product 11 / 35

Page 12: Parallel Linear Algebra · Parallel Linear Algebra Our goals: Fast and efficient parallel algorithms for the matrix-vector product, the matrix-matrix product, solving systems of

How did we save Communication?

- Rowwise decomposition: in each of the p phases row blocks areexchanged.All in all O(p · n2/p) communication.

- The algorithm of Fox: a broadcast in each of the√

p withcommunication time O(n2/p · log p).All in all communication time O(n2/

√p · log p): merging

point-to-point messages into broadcasts is profitable!- The algorithm of Cannon: after initially rearranging submatrices,

the broadcasts in the algorithm of Fox are replaced by point topoint messages.All in all communication time O(

√p · n2/p).

Parallel Linear Algebra The Matrix-Matrix Product 12 / 35

Page 13: Parallel Linear Algebra · Parallel Linear Algebra Our goals: Fast and efficient parallel algorithms for the matrix-vector product, the matrix-matrix product, solving systems of

The DNS Algorithm

p = n3 processes are arranged in an n × n × n mesh of processes.Process (i , j ,1) stores A[i , j],B[i , j] and has to determine C[i , j].

We move A[i , k ] to process (i , ∗, k): (i , k ,1) sends A[i , k ] to(i , k , k), which broadcasts A[i , k ] to all processes (i , ∗, k).Next we move B[k , j] to process (∗, j , k): (k , j ,1) sends B[k , j] to(k , j , k), which broadcasts B[k , j] to all processes (∗, j , k).Process (i , j , k) computes the product A[i , k ] · B[k , j].Process (i , j ,1) computes

∑nk=1 A[i , k ] · B[k , j] with MPI_Reduce.

Performance analysis:I The replication step takes time O(log2 n), since the broadcast

dominates. The multiplication step runs in constant time and theReduce operation runs in logarithmic time.

I Time O(log2 n) suffices. Its efficiency Θ(1/ log2 n) is too small.I We scale down.

Parallel Linear Algebra The Matrix-Matrix Product 13 / 35

Page 14: Parallel Linear Algebra · Parallel Linear Algebra Our goals: Fast and efficient parallel algorithms for the matrix-vector product, the matrix-matrix product, solving systems of

Scaling down the number of processors

We work with p processes. Let q = p1/3 and imagine that the pprocesses are arranged in a q × q × q mesh.Input distribution: process (i , j ,1) receives the n

q ×nq submatrices

Ai,j and Bi,j : the matrices Ai,j and Bi,j play the role of the entriesA[i , j] and B[i , j].Mimic the algorithm for n3 processes.Performance analysis:

I The total computing time is O( n3

q3 ) = O( n3

p ), since nq ×

nq matrices

have to be multiplied.I During replication and summing, n

q ×nq matrices are involved and

hence the communication time is bounded by O( n2

q2 · log p).I The compute/communicate ratio is n

q·log2 p .

Best performance so far. p should be sufficiently large.

Parallel Linear Algebra The Matrix-Matrix Product 14 / 35

Page 15: Parallel Linear Algebra · Parallel Linear Algebra Our goals: Fast and efficient parallel algorithms for the matrix-vector product, the matrix-matrix product, solving systems of

Summary

The checkerboard decomposition is again better than the rowwisedecomposition.Cannon’s algorithm replaces a broadcast by a point-to-pointmessage and is therefore faster than the algorithm of Fox.The DNS algorithm partitions the matrices A and B among q2 ofthe q3 processes.

I Thus each “input process” gets a relatively large chunk.I However there are only two (instead of

√p) communication steps:

namely when replicating and summing.I Observe that DNS is better than Cannon only if p is sufficiently

large.

Parallel Linear Algebra The Matrix-Matrix Product 15 / 35

Page 16: Parallel Linear Algebra · Parallel Linear Algebra Our goals: Fast and efficient parallel algorithms for the matrix-vector product, the matrix-matrix product, solving systems of

Solving Linear Systems

We are given a matrix A and a right-hand side b and would like tosolve the linear system A · x = b.

We begin with the easy case of lower triangular matrices A anddescribe back substitution.Then we discuss efficient parallelizations of Gaussian eliminationand continue with iterative methods: Jacobi relaxation, theGauss-Seidel algorithm, the conjugate gradient approach and theNewton method.Finally we consider parallelization of the finite difference method.

Solving Linear Systems 16 / 35

Page 17: Parallel Linear Algebra · Parallel Linear Algebra Our goals: Fast and efficient parallel algorithms for the matrix-vector product, the matrix-matrix product, solving systems of

Backsubstitution

We have to solve the system

A[i ,1] · x1 + · · ·+ A[i , i] · xi = bi

for i = 1, . . . ,n.

A sequential solution:I first determine x1 from the first equation A[i ,1] · x1 = b1.I If we already know x1, . . . , xi−1, then determine xi from the i th

equation.I Since an evaluation of the i th equation requires time O(i), the

sequential solution runs in time O(n2).We consider two input distributions:

I The off-diagonal decomposition of matrix A:process 1 knows the main diagonal and process i (i ≥ 2) knows thei − 1st offdiagonal A[i ,1],A[i + 1,2], . . . ,A[n,n − i + 1].

I And the rowwise decomposition.

Solving Linear Systems Backsubstitution 17 / 35

Page 18: Parallel Linear Algebra · Parallel Linear Algebra Our goals: Fast and efficient parallel algorithms for the matrix-vector product, the matrix-matrix product, solving systems of

The Off-Diagonal Decomposition I

We use the linear array as communication pattern.

Process 1 successively determines x1, . . . , xn.Once computed, xi is forwarded thru the linear array.

How to solve the i th equation A[i ,1] · x1 + · · ·+ A[i , i] · xi = bi?

I Process i computes A[i ,1] · x1 immediately after receiving x1 fromprocess i − 1. Then i sends A[i ,1] · x1 to process i − 1 and x1 toprocess i + 1.

I If process i − 1 receives x2 from process i − 2, it computes theproduct A[i ,2] · x2, sends the sum A[i ,1] · x1 + A[i ,2] · x2 to processi − 2 and forwards x2 to process i .

I We communicate according to the principle of“just in time production”.

Solving Linear Systems Backsubstitution 18 / 35

Page 19: Parallel Linear Algebra · Parallel Linear Algebra Our goals: Fast and efficient parallel algorithms for the matrix-vector product, the matrix-matrix product, solving systems of

The Off-Diagonal Decomposition II

x1

x2

x 3

A[2,1]

A[3,1]

A[4,1]x1

A[3,2]

A[4,2]x

A[4,3]x

x4

x 1

1x

2x

2

3

time

processors

Solving Linear Systems Backsubstitution 19 / 35

Page 20: Parallel Linear Algebra · Parallel Linear Algebra Our goals: Fast and efficient parallel algorithms for the matrix-vector product, the matrix-matrix product, solving systems of

The Off-Diagonal Decomposition III

Backsubstitution with p processes.

Assign the off-diagonals (A[j ,1], . . . ,A[n,n − j + 1]) forj ∈ (i − 1) · n/p + 1, . . . , i · n/p to process i .The computing time: we have p phases with compute timeO((n/p)2) per phase.All in all compute time is bounded by O(n2

p ).

Communication O(n/p) per phase and hence all in all O(n)communication.

The running time is bounded by O(n2

p + n).We achieve constant efficiency, whenever n = Ω(p).

Solving Linear Systems Backsubstitution 20 / 35

Page 21: Parallel Linear Algebra · Parallel Linear Algebra Our goals: Fast and efficient parallel algorithms for the matrix-vector product, the matrix-matrix product, solving systems of

The Rowwise Decomposition

This time process i determines xi .Once xi is determined, process i broadcasts xi to processesi + 1, . . . ,n.

For p processes:

Each process is responsible for np variables. Time O(n) per

variable is sufficient.⇒ The compute time is bounded by O(n2

p ).There is one broadcast per unknown and communication time isbounded by O(n · log2 p).We achieve constant efficiency, whenever n = Ω(p · log2 p).

Solving Linear Systems Backsubstitution 21 / 35

Page 22: Parallel Linear Algebra · Parallel Linear Algebra Our goals: Fast and efficient parallel algorithms for the matrix-vector product, the matrix-matrix product, solving systems of

Gaussian Elimination with Partial Pivoting

Include right hand side b as last column of matrix A.

If we have already eliminated nonzeroes below the diagonals incolumns 1, . . . , i − 1, then

use largest entry A[i , j] for j = i , . . . ,n as pivot,

swap rows i and j and set rowk = rowk − A[k ,i]A[i,i] · rowi for k > i .

Performance analysis for the sequential algorithm:

When dealing with row i :I Determine the largest entry A[j , i] in column i in time O(n).I The elimination step for row i requires O(n − i + 1) arithmetic

operations.I All in all O(n + (n − i + 1)2) = O(n2) operations suffice.

The total number of arithmetic operations is bounded by O(n3).

Solving Linear Systems Gaussian Elimination 22 / 35

Page 23: Parallel Linear Algebra · Parallel Linear Algebra Our goals: Fast and efficient parallel algorithms for the matrix-vector product, the matrix-matrix product, solving systems of

A parallelization of Gaussian Elimination I

We work with p processes and the rowwise decomposition:each process receives an “interval” of n/p rows.We maintain the sequential structure of pivoting,but parallelize each pivoting step instead.Assume that we have reached row i .

I To utilize the rowwise decomposition we look for the largest entry inrow i (and not in column i).

I We have to eliminate all non-zeroes in row i :the process holding row i has to

F determine the largest entry A[i, k ] in row i ,F compute the vector mi of multiples for the elimination stepF and send mi to the remaining processes.

Solving Linear Systems Gaussian Elimination 23 / 35

Page 24: Parallel Linear Algebra · Parallel Linear Algebra Our goals: Fast and efficient parallel algorithms for the matrix-vector product, the matrix-matrix product, solving systems of

A parallelization of Gaussian Elimination II

Avoid broadcasting (mi , k). When dealing with row i − 1:I After computing mi−1, the process j holding row i interrupts its

elimination work for row i − 1,I immediately recomputes row i and determines (mi , k) instead,I sends (mi , k) to process j + 1 andI then resumes its elimination work for row i − 1.

We cover communication by computation:I the expensive broadcast of (mi , k) is replaced by sending (mi , k)

thru the linear array of processes.I Whenever a process receives (mi , k), it immediately forwards

(mi , k) to its neighbor process.Performance analysis:

I No delay when eliminating row i , if the compute time Θ( np · n) for

pivoting dominates the maximal communication delay p · n.

The overall compute time is bounded by O(n · np · n) = O(n3

p ).There is no delay due to communication, provided n = Ω(p2).

Solving Linear Systems Gaussian Elimination 24 / 35

Page 25: Parallel Linear Algebra · Parallel Linear Algebra Our goals: Fast and efficient parallel algorithms for the matrix-vector product, the matrix-matrix product, solving systems of

Iterative Methods

In an iterative method an approximate solution of a linear systemA · x = b is successively improved.One starts with an initial “guess” x(0) andreplaces x(t) by a presumably better solution x(t + 1).

Assume that the computation of x(t + 1) is based on the

matrix-vector product.

We obtain a fast parallel algorithm and can exploit sparse linearsystems.We describe:

I the Jacobi relaxation and its variants,I the Newton method to approximately compute the inverse A−1.

Solving Linear Systems Iterative Methods 25 / 35

Page 26: Parallel Linear Algebra · Parallel Linear Algebra Our goals: Fast and efficient parallel algorithms for the matrix-vector product, the matrix-matrix product, solving systems of

Jacobi Relaxation

Assume that A · x∗ = b. If A has a nonzero diagonal, then

x∗i =1

A[i , i]·

bi −∑j 6=i

A[i , j] · x∗j

.

The Jacobi iteration: if xi(t) is an approximate solution, set

xi(t + 1) =1

A[i , i]·

bi −∑j 6=i

A[i , j] · xj(t)

.

Each Jacobi iteration corresponds to a matrix-vector product.Hence one iteration runs in time O(n2

p ) and we obtain a fastapproximation, whenever few iterations suffice.When does the Jacobi iteration converge against the uniquesolution?

Solving Linear Systems Iterative Methods 26 / 35

Page 27: Parallel Linear Algebra · Parallel Linear Algebra Our goals: Fast and efficient parallel algorithms for the matrix-vector product, the matrix-matrix product, solving systems of

Jacobi Relaxation: Convergence

Let D be the diagonal matrix with D[i , i] = A[i , i].Set M = D−1 · (D − A).

I Another view of the Jacobi iteration:if A is invertible and if x∗ is the unique solution of A · x = b, then

x(t + 1)− x∗ = M · (x(t)− x∗).

I Consequently x(t)− x∗ = M t · (x(0)− x∗) follows for all t .I If limt→∞M t = 0, then x(t) converges against x∗.

The Jacobi Relaxation converges for row diagonally dominantmatrices A: i.e., if

|A[i , i]| >∑j 6=i

|A[i , j]|

holds for all i .

Solving Linear Systems Iterative Methods 27 / 35

Page 28: Parallel Linear Algebra · Parallel Linear Algebra Our goals: Fast and efficient parallel algorithms for the matrix-vector product, the matrix-matrix product, solving systems of

Two Extensions

In many practical applications the Jacobi overrelaxation convergesfaster.

I for a suitable coefficient γ:

xi (t + 1) = (1− γ) · xi (t) +γ

A[i , i]·

bi −∑j 6=i

A[i , j] · xj (t)

.

I The Jacobi relaxation is a special case: set γ = 1.The Gauss-Seidel algorithm incorporates already recomputedvalues of xj (i.e., it replaces xj(t) by xj(t + 1)).

I An example is

xi (t + 1) =1

A[i , i]·

bi −∑j<i

A[i , j] · xj (t + 1)−∑j>i

A[i , j] · xj (t)

.

I The Gauss-Seidel method does not look parallelizable!?

Solving Linear Systems Iterative Methods 28 / 35

Page 29: Parallel Linear Algebra · Parallel Linear Algebra Our goals: Fast and efficient parallel algorithms for the matrix-vector product, the matrix-matrix product, solving systems of

Inversion by Newton Iteration

Assume that A is invertible and that Xt is an approximate inverse of A.The Newton iteration is

Xt+1 = 2 · Xt − Xt · A · Xt .

What is the intuition behind this approach?I Consider the residual matrix Rt = I − A · Xt which measures the

“distance” between A−1 and Xt .Rt+1 = I − A · Xt+1

= I − A · (2 · Xt − Xt · A · Xt )= (I − A · Xt )

2 = R2t .

I Rt converges rapidly towards the 0-matrix, whenever X0 is a goodapproximation of A−1.

Since the Newton iteration is based on matrix-matrix products,each iteration is easily parallelized.

Solving Linear Systems Iterative Methods 29 / 35

Page 30: Parallel Linear Algebra · Parallel Linear Algebra Our goals: Fast and efficient parallel algorithms for the matrix-vector product, the matrix-matrix product, solving systems of

The Finite Difference Method

Finite differences can be used to approximate derivatives of a functionf , since

f ′(x) = limh→0f (x+h)−f (x)

h = limh→0f (x)−f (x−h)

hf ′′(x) = limh→0

f ′(x+h)−f ′(x)h = limh→0

f (x+h)−2f (x)+f (x−h)h2 .

The finite difference method is used to solve differential equations:

I Derivatives are approximated by finite differencesI and differential equations are modeled by linear systems of

equations.

The usually sparse systems are mostly solved with iterativemethods.

Solving Linear Systems The Finite Difference Method 30 / 35

Page 31: Parallel Linear Algebra · Parallel Linear Algebra Our goals: Fast and efficient parallel algorithms for the matrix-vector product, the matrix-matrix product, solving systems of

An Example: The Poisson Equation

Find a function u : [0,1]2 → R whichsatisfies the Poisson equation uxx + uyy = Hand which has prescribed values on the boundary of the unitsquare [0,1]2.

If u is sufficiently smooth and if h is sufficiently small, then

uxx (x , y) ≈ u(x + h, y)− 2u(x , y) + u(x − h, y)

h2 .

Approximate uy ,y analogously and we get H(x , y) ≈u(x + h, y) + u(x − h, y) + u(x , y + h) + u(x , y − h)− 4u(x , y)

h2 .

For N sufficiently large, set

h =1N, ui,j = u(

iN,

jN

) and Hi,j = H(iN,

jN

).

Solving Linear Systems The Finite Difference Method 31 / 35

Page 32: Parallel Linear Algebra · Parallel Linear Algebra Our goals: Fast and efficient parallel algorithms for the matrix-vector product, the matrix-matrix product, solving systems of

The Linear System I

Choose (x , y) as one of the grid points ( iN ,

jN ) | 0 < i , j < N and

we get the linear system

−4ui,j + ui+1,j + ui−1,j + ui,j+1 + ui,j−1 =Hi,j

N2

The system is huge: (N − 1)2 equations in (N − 1)2 unknowns.(The values of u at the boundary are prescribed.)The matrix of the system has (N − 1)4 entries, but is sparse, sinceany equation has at most five nonzero coefficients.To utilize sparsity, we apply iterative methods.

Solving Linear Systems The Finite Difference Method 32 / 35

Page 33: Parallel Linear Algebra · Parallel Linear Algebra Our goals: Fast and efficient parallel algorithms for the matrix-vector product, the matrix-matrix product, solving systems of

The Linear System II

We process the system beginning with the lower boundary andworking upwards:

ui+1,j = 4ui,j − (ui−1,j + ui,j+1 + ui,j−1) +Hi,j

N2 .

We apply the Jacobi Relaxation and get

ui+1,j(t + 1) = 4ui,j(t)− (ui−1,j(t) + ui,j+1(t) + ui,j−1(t)) +Hi,j

N2 .

How to obtain an efficient parallelization of one iteration?I We use the checkerboard decomposition of the N − 1× N − 1 grid.

p processes are arranged in a√

p ×√p mesh.I The process responsible for determining ui+1,j (t + 1) has to know

ui−1,j (t),ui,j+1(t),ui,j−1(t) and ui,j (t).I Any missing value belongs to the “near-boundary” of a neighbor.

Solving Linear Systems The Finite Difference Method 33 / 35

Page 34: Parallel Linear Algebra · Parallel Linear Algebra Our goals: Fast and efficient parallel algorithms for the matrix-vector product, the matrix-matrix product, solving systems of

The Linear System: Performance Analysis

The processes communicate their O( N√p ) near-boundary values.

Afterwards they perform an iteration without furthercommunication: computation time is bounded by O(N2

p ), since theupdate time for any grid point is constant and each process has toupdate O(N2

p ) grid points.

We have constant efficiency, whenever N = Ω(√

p).One can show that the matrix of the linear system is symmetricand positive definite: the Jacobi Relaxation converges.

Solving Linear Systems The Finite Difference Method 34 / 35

Page 35: Parallel Linear Algebra · Parallel Linear Algebra Our goals: Fast and efficient parallel algorithms for the matrix-vector product, the matrix-matrix product, solving systems of

Gauss-Seidel Revisited

So far we have used the Jacobi Relaxation

ui+1,j(t + 1) = 4ui,j(t)− (ui−1,j(t) + ui,j+1(t) + ui,j−1(t)) +Hi,j

N2 .

Can we use the generally better Gauss-Seidel method instead?

I Use the new update

ui,j (t + 1) =ui+1,j (t) + ui−1,j (t) + ui,j+1(t) + ui,j−1(t)

4−

Hi,j

4N2 .

I Label grid point (i , j) white iff i + j is even and black otherwise:white grid points are updated with black grid points only.

I Execute one iteration of the Gauss-Seidel algorithm byF first updating black grid points conventionally with the Jacobi

relaxation.F Then apply the Gauss-Seidel algorithm to update white grid points by

using the already updated black grid points.

Solving Linear Systems The Finite Difference Method 35 / 35