outline sparse linear algebra: direct methodsamestoy.perso.enseeiht.fr/cours/alc_2013_2014.pdf ·...

Sparse Linear Algebra: Direct Methods

P. Amestoy and A. ButtariINPT(ENSEEIHT)

A. Guermouche (Univ. Bordeaux-LaBRI),J.-Y. L’Excellent and Bora Ucar (INRIA-CNRS/LIP-ENS Lyon) and

F.-H. Rouet (Lawrence Berkeley National Laboratory)

2013-2014

1 / 165

Outline

Introduction to Sparse Matrix ComputationsMotivation and main issuesSparse matricesGaussian eliminationConclusion

2 / 165

A selection of references

I Books

I Duff, Erisman and Reid, Direct methods for Sparse Matrices,Clarendon Press, Oxford 1986.

I Dongarra, Duff, Sorensen and van der Vorst, Solving LinearSystems on Vector and Shared Memory Computers, SIAM, 1991.

I Davis, Direct methods for sparse linear systems, SIAM, 2006.I George, Liu, and Ng, Computer Solution of Sparse Positive

Definite Systems, book to appear

I Articles

I Gilbert and Liu, Elimination structures for unsymmetric sparse LUfactors, SIMAX, 1993.

I Liu, The role of elimination trees in sparse factorization, SIMAX,1990.

I Heath, Ng and Peyton, Parallel Algorithms for Sparse LinearSystems, SIAM review 1991.

3 / 165

Motivations

I solution of linear systems of equations → key algorithmic kernel

Continuous problem↓

Discretization↓

Solution of a linear system Ax = b

I Main parameters:I Numerical properties of the linear system (symmetry, pos.

definite, conditioning, . . . )I Size and structure:

I Large (order > 107 to 108), square/rectangularI Dense or sparse (structured / unstructured)

I Target computer (sequential/parallel/multicore)

→ Algorithmic choices are critical

4 / 165

Motivations for designing efficient algorithms

I Time-critical applications

I Solve larger problems

I Decrease elapsed time (code optimization, parallelism)

I Minimize cost of computations (time, memory)

5 / 165

Difficulties

I Access to data:I Computer: complex memory hierarchy (registers, multilevel

cache, main memory (shared or distributed), disk)I Sparse matrix: large irregular dynamic data structures.

→ Exploit the locality of references to data on the computer(design algorithms providing such locality)

I Efficiency (time and memory)I Number of operations and memory depend very much on the

algorithm used and on the numerical and structural properties ofthe problem.

I The algorithm depends on the target computer (vector, scalar,shared, distributed, clusters of Symmetric Multi-Processors(SMP), multicore).

→ Algorithmic choices are critical

6 / 165

Sparse matrices

Example:

3 x1 + 2 x2 = 52 x2 - 5 x3 = 1

2 x1 + 3 x3 = 0

can be represented as

Ax = b,

where A =

3 2 00 2 −52 0 3

, x =

x1x2x3

, and b =

510

Sparse matrix: only nonzeros are stored.

7 / 165

Sparse matrix ?

0 100 200 300 400 500

0

100

200

300

400

500

nz = 5104

Matrix dwt 592.rua (N=592, NZ=5104);Structural analysis of a submarine

8 / 165

Sparse matrix ?

Matrix from Computational Fluid Dynamics;(collaboration Univ. Tel Aviv)

0 1000 2000 3000 4000 5000 6000 7000

0

1000

2000

3000

4000

5000

6000

7000

nz = 43105

“Saddle-point” problem

9 / 165

Preprocessing sparse matrices

Original (A =lhr01) Preprocessed matrix (A′(lhr01))

0 200 400 600 800 1000 1200 1400

0

200

400

600

800

1000

1200

1400

nz = 184270 200 400 600 800 1000 1200 1400

0

200

400

600

800

1000

1200

1400

nz = 18427

Modified Problem:A′x ′ = b′ with A′ = PnPDrADcQPtQn

10 / 165

Factorization process

Solution of Ax = b

I A is unsymmetric :I A is factorized as: A = LU, where

L is a lower triangular matrix, andU is an upper triangular matrix.

I Forward-backward substitution: Ly = b then Ux = y

I A is symmetric:I positive definite A = LLT

I general A = LDLT

I A is rectangular m × n with m ≥ n and minx ‖Ax− b‖2 :I A = QR where Q is orthogonal (Q−1 = QT) and R is triangular.I Solve: y = QTb then Rx = y

11 / 165

Difficulties

I Only non-zero values are stored

I Factors L and U have far more nonzeros than A

I Data structures are complex

I Computations are only a small portion of the code (the rest isdata manipulation)

I Memory size is a limiting factor ( → out-of-core solvers )

12 / 165

Key numbers:

1- Small sizes : 500 MB matrix;Factors = 5 GB; Flops = 100 Gflops ;

2- Example of 2D problem: Lab. Geosiences Azur, ValbonneI Complex 2D finite difference matrix n=16× 106 , 150× 106

nonzerosI Storage (single prec): 2 GB (12 GB with the factors)I Flops: 10 TeraFlops

3- Example of 3D problem: EDF (Code Aster, structuralengineering)

I Real matrix finite elements n = 106 , nz = 71× 106 nonzerosI Storage: 3.5× 109 entries (28 GB) for factors, 35 GB totalI Flops: 2.1× 1013

13 / 165

Example in structural mechanics:

Car body,227,362 unknowns,5,757,996 nonzeros,MSC.Software

Size of factors:51.1 million entriesNumber of operations:44.9× 109

14 / 165

Example in structural mechanics:

BMW crankshaft,148,770 unknowns,5,396,386 nonzeros,MSC.Software

Size of factors:97.2 million entriesNumber of operations:127.9× 109

15 / 165

Data structure for sparse matrices

I Storage scheme depends on the pattern of the matrix and on thetype of access required

I band or variable-band matricesI “block bordered” or block tridiagonal matricesI general matrixI row, column or diagonal access

16 / 165

Data formats for a general sparse matrix A

What needs to be represented

I Assembled matrices: MxN matrix A with NNZ nonzeros.

I Elemental matrices (unassembled): MxN matrix A with NELTelements.

I Arithmetic: Real (4 or 8 bytes) or complex (8 or 16 bytes)

I Symmetric (or Hermitian)→ store only part of the data.

I Distributed format ?

I Duplicate entries and/or out-of-range values ?

17 / 165

Classical Data Formats for Assembled Matrices

I Example of a 3x3 matrix with NNZ=5 nonzeros

a31

a23a22

a11

a33

1 2 3

1

2

3

I Coordinate formatIRN [1 : NNZ ] = 1 3 2 2 3JCN [1 : NNZ ] = 1 1 2 3 3VAL [1 : NNZ ] = a11 a31 a22 a23 a33

I Compressed Sparse Column (CSC) formatIRN [1 : NNZ ] = 1 3 2 2 3VAL [1 : NNZ ] = a11 a31 a22 a23 a33COLPTR [1 : N + 1] = 1 3 4 6

column J is stored in IRN/A locations COLPTR(J)...COLPTR(J+1)-1

I Compressed Sparse Row (CSR) format:Similar to CSC, but row by row

18 / 165

Classical Data Formats for Assembled Matrices

I Example of a 3x3 matrix with NNZ=5 nonzeros

a31

a23a22

a11

a33

1 2 3

1

2

3

I Diagonal format (M=N):NDIAG = 3IDIAG = −2 0 1

VAL =

na a11 0na a22 a23a31 a33 na

(na: not accessed)

VAL(i,j) corresponds to A(i,i+IDIAG(j)) (for 1 ≤ i + IDIAG(j) ≤N)

19 / 165

Sparse Matrix-vector products Y ← AX

Algorithm depends on sparse matrix format:

I Coordinate format:

I CSC format:

I CSR format

20 / 165

Example of elemental matrix format

A1 =123

−1 2 3

2 1 11 1 1

, A2 =

345

2 −1 31 2 −13 2 1

A =

−1 2 3 0 02 1 1 0 01 1 3 −1 30 0 1 2 −10 0 3 2 1

= A1 + A2

21 / 165

Example of elemental matrix format

A1 =123

−1 2 3

2 1 11 1 1

, A2 =

345

2 −1 31 2 −13 2 1

I N=5 NELT=2 NVAR=6 A =∑NELT

i=1 Ai

IELTPTR [1:NELT+1] = 1 4 7ELTVAR [1:NVAR] = 1 2 3 3 4 5ELTVAL [1:NVAL] = -1 2 1 2 1 1 3 1 1 2 1 3 -1 2 2 3 -1 1

I Remarks:

I NVAR = ELTPTR(NELT+1)-1I Order of element i : Si = ELTPTR(i + 1)− ELTPTR(i)I NVAL =

∑S2i (unsym) ou

∑Si (Si + 1)/2 (sym),

I storage of elements in ELTVAL: by columns

22 / 165

File storage: Rutherford-Boeing

I Standard ASCII format for filesI Header + Data (CSC format). key xyz:

I x=[rcp] (real, complex, pattern)I y=[suhzr] (sym., uns., herm., skew sym. (A = −AT ), rectang.)I z=[ae] (assembled, elemental)I ex: M T1.RSA, SHIP003.RSE

I Supplementary files: right-hand-sides, solution, permutations. . .

I Canonical format introduced to guarantee a uniquerepresentation (order of entries in each column, no duplicates).

23 / 165

File storage: Rutherford-Boeing

DNV-Ex 1 : Tubular joint-1999-01-17 M_T1

1733710 9758 492558 1231394 0

rsa 97578 97578 4925574 0

(10I8) (10I8) (3e26.16)

1 49 96 142 187 231 274 346 417 487

556 624 691 763 834 904 973 1041 1108 1180

1251 1321 1390 1458 1525 1573 1620 1666 1711 1755

1798 1870 1941 2011 2080 2148 2215 2287 2358 2428

2497 2565 2632 2704 2775 2845 2914 2982 3049 3115

...

1 2 3 4 5 6 7 8 9 10

11 12 49 50 51 52 53 54 55 56

57 58 59 60 67 68 69 70 71 72

223 224 225 226 227 228 229 230 231 232

233 234 433 434 435 436 437 438 2 3

4 5 6 7 8 9 10 11 12 49

50 51 52 53 54 55 56 57 58 59

...

-0.2624989288237320E+10 0.6622960540857440E+09 0.2362753266740760E+11

0.3372081648690030E+08 -0.4851430162799610E+08 0.1573652896140010E+08

0.1704332388419270E+10 -0.7300763190874110E+09 -0.7113520995891850E+10

0.1813048723097540E+08 0.2955124446119170E+07 -0.2606931100955540E+07

0.1606040913919180E+07 -0.2377860366909130E+08 -0.1105180386670390E+09

0.1610636280324100E+08 0.4230082475435230E+07 -0.1951280618776270E+07

0.4498200951891750E+08 0.2066239484615530E+09 0.3792237438608430E+08

0.9819999042370710E+08 0.3881169368090200E+08 -0.4624480572242580E+08

24 / 165

File storage: Matrix-market

I Example

%%MatrixMarket matrix coordinate real general

% Comments

5 5 8

1 1 1.000e+00

2 2 1.050e+01

3 3 1.500e-02

1 4 6.000e+00

4 2 2.505e+02

4 4 -2.800e+02

4 5 3.332e+01

5 5 1.200e+01

25 / 165

Examples of sparse matrix collections

I The University of Florida Sparse Matrix Collectionhttp://www.cise.ufl.edu/research/sparse/matrices/

I Matrix market http://math.nist.gov/MatrixMarket/

I Rutherford-Boeinghttp://www.cerfacs.fr/algor/Softs/RB/index.html

I TLSE http://gridtlse.org/

26 / 165

Gaussian elimination

A = A(1), b = b(1), A(1)x = b(1):

a11 a12 a13a21 a22 a23a31 a32 a33

x1x2x3

=

b1b2b3

2← 2− 1× a21/a11

3← 3− 1× a31/a11

A(2)x = b(2)

a11 a12 a13

0 a(2)22 a

(2)23

0 a(2)32 a

(2)33

x1x2x3

=

b1

b(2)2

b(2)3

b

(2)2 = b2 − a21b1/a11 . . .

a(2)32 = a32 − a31a12/a11 . . .

Finally A(3)x = b(3)

a11 a12 a13

0 a(2)22 a

(2)23

0 0 a(3)33

x1x2x3

=

b1

b(2)2

b(3)3

a(3)(33)

= a(2)(33)− a

(2)32 a

(2)23 /a

(2)22 . . .

Typical Gaussian elimination step k : a(k+1)ij = a

(k)ij −

a(k)ik a

(k)kj

a(k)kk

27 / 165

Relation with A = LU factorization

I One step of Gaussian elimination can be written:A(k+1) = L(k)A(k) , with

L(k) =

1.

.1

−lk+1,k .. .−ln,k 1

and lik =a(k)ik

a(k)kk

.

I Then, A(n) = U = L(n−1) . . .L(1)A, which gives A = LU ,

with L = [L(1)]−1 . . . [L(n−1)]−1 =

1 0.

..

li,j 1

,

I In dense codes, entries of L and U overwrite entries of A.

I Furthermore, if A is symmetric, A = LDLT with dkk = a(k)kk :

A = LU = At = U tLt implies (U)(Lt)−1 = L−1U t = D diagonal and

U = DLt , thus A = L(DLt) = LDLt

Dense LU factorization

I Step by step columns of A are set to zero and A is updatedL(n−1) . . .L(1)A = U leading toA = LU where L = [L(1)]−1 . . . [L(n−1)]−1

I - zero entries in column of A can be replaced by entries in L- row entries of U can be stored in corresponding locations of A

Algorithm 1 Dense LU factorization1: for k = 1 to n do2: L(k : k) = 1 ; L(k + 1 : n, k) = A(k+1:n,k)

A(k,k)

3: U(k, k : n) = A(k, k : n)4: for j = k + 1 to n do5: for i = k + 1 to n do6: A(i , j) = A(i , j)− L(i , k)× U(k, j)7: end for8: end for9: end for

When |A(k, k)| is relatively too small, numerical pivoting required29 / 165

Gaussian elimination and sparsity

Step k of LU factorization (akk pivot):

I For i > k compute lik = aik/akk (= a′ik),

I For i > k , j > k update remaining rows/cols in matrix

a′ij = aij −aik × akj

akk= aij − lik × akj

I If aik 6= 0 and akj 6= 0 then a′ij 6= 0

I If aij was zero → its non-zero value must be stored k j

k

i

x

x

x

x

k j

k

i

x

x

x

0

fill-in

30 / 165


I Idem for Cholesky :

I For i > k compute lik = aik/√

akk (= a′ik),

I For i > k , j > k , j ≤ i (lower triang.)

a′ij = aij −aik × ajk√

akk= aij − lik × ajk

31 / 165


I Interest of permuting a matrix:

X X X X XX X 0 0 0X 0 X 0 0X 0 0 X 0X 0 0 0 X

1↔ 5

X 0 0 0 X0 X 0 0 X0 0 X 0 X0 0 0 X XX X X X X

I Ordering the variables has a strong impact onI fill-inI number of operationsI shape of the dependency graph (tree) and parallelism

I Fill reduction is NP-hard in general [Yannakakis 81]

32 / 165

Illustration: Reverse Cuthill-McKee on matrix dwt 592.rua

Harwell-Boeing matrix: dwt 592.rua, structural computing on asubmarine. NZ(LU factors)=58202

Original matrix Factorized matrix

0 100 200 300 400 500

0

100

200

300

400

500

nz = 51040 100 200 300 400 500

0

100

200

300

400

500

nz = 58202

33 / 165

Illustration: Reverse Cuthill-McKee on matrix dwt 592.rua

NZ(LU factors)=16924

Permuted matrix Factorized permuted matrix(RCM)

0 100 200 300 400 500

0

100

200

300

400

500

nz = 51040 100 200 300 400 500

0

100

200

300

400

500

nz = 16924

34 / 165

Table: Benefits of Sparsity on a matrix of order 2021 × 2021 with 7353nonzeros. (Dongarra etal 91) .

Procedure Total storage Flops Time (sec.)on CRAY J90

Full Syst. 4084 Kwords 5503 ×106 34.5Sparse Syst. 71 Kwords 1073×106 3.4Sparse Syst. and reordering 14 Kwords 42×103 0.9

35 / 165

Sparse LU factorization : a (too) simple algorithm

Only non-zeros are stored and operated on

Algorithm 2 Simple sparse LU factorization1: Permute matrix A to reduce fill-in and flops (NP complete problem)2: for k = 1 to n do3: L(k : k) = 1 ; For nonzeros in column k: L(k + 1 : n, k) = A(k+1:n,k)

A(k,k)

4: U(k, k : n) = A(k, k : n)5: for j = k + 1 to n limited to nonzeros in row U(k, :) do6: for i = k + 1 to n limited to nonzeros in col. L(:, k) do7: A(i , j) = A(i , j)− L(i , k)× U(k, j)8: end for9: end for

10: end for

Questions

Dynamic data structure for A to accommodate fill-inData access efficiency; Can we predict position of fill-in ?|A(k , k)| too small −→ numerical permutation needed !!!

36 / 165

Control of numerical stability: numerical pivoting

I In dense linear algebra partial pivoting commonly used (at eachstep the largest entry in the column is selected).

I In sparse linear algebra, flexibility to preserve sparsity is offered :I Partial threshold pivoting : Eligible pivots are not too small with

respect to the maximum in the column.

Set of eligible pivots = {r | |a(k)rk | ≥ u ×maxi |a(k)ik |}, where0 < u ≤ 1.

I Then among eligible pivots select one preserving better sparsity.I u is called the threshold parameter (u = 1 → partial pivoting).I It restricts the maximum possible growth of: aij = aij − aik×akj

akkto

1 + 1u which is sufficient to the preserve numerical stability.

I u ≈ 0.1 is often chosen in practice.

I For symmetric indefinite problems 2× 2 pivots (with threshold)is also used to preserve symmetry and sparsity.

Threshold pivoting and numerical accuracy

Table: Effect of variation in threshold parameter u on matrix 541× 541with 4285 nonzeros (Dongarra etal 91) .

u Nonzeros in LU factors Error

1.0 16767 3 ×10−9

0.25 14249 6 ×10−10

0.1 13660 4 ×10−9

0.01 15045 1 ×10−5

10−4 16198 1 ×102

10−10 16553 3 ×1023

Difficulty: numerical pivoting implies dynamic datastructures thatcan not be forecasted symbolically

38 / 165

Three-phase scheme to solve Ax = b

1. Analysis stepI Preprocessing of A (symmetric/unsymmetric orderings, scalings)I Build the dependency graph (elimination tree, eDAG . . . )I Apre = P Dr A Qc Dc PT ,

2. Factorization (Apre = LU, LDLT, LLT, QR)

Numerical pivoting

3. Solution based on factored matricesI triangular solves: Ly = bpre , then Uxpre = yI improvement of solution (iterative refinement), error analysis

39 / 165

Efficient implementation of sparse algorithms

I Indirect addressing is often used in sparse calculations: e.g.sparse SAXPY

do i = 1, m

A( ind(i) ) = A( ind(i) ) + alpha * w( i )

enddo

I Even if manufacturers provide hardware for improving indirectaddressing

I It penalizes the performance

I Identify dense blocks or switch to dense calculations as soon asthe matrix is not sparse enough

40 / 165

Effect of switch to dense calculations

Matrix from 5-point discretization of the Laplacian on a 50× 50 grid(Dongarra etal 91)

Density for Order of Millions Timeswitch to full code full submatrix of flops (seconds)

No switch 0 7 21.81.00 74 7 21.40.80 190 8 15.00.60 235 11 12.50.40 305 21 9.00.20 422 50 5.50.10 531 100 3.70.005 1420 1908 6.1

Sparse structure should be exploited if density < 10%.

41 / 165

Complexity of sparse direct methods

Regular problems 2D 3D(nested dissections) N × N grid N × N × N gridNonzeros in original matrix Θ(N2) Θ(N3)Nonzeros in factors Θ(N2log n) Θ(N4)Floating-point ops Θ(N3) Θ(N6)

3D example in earth science:acoustic wave propagation,27-point finite difference grid

Current goal [Operto and Virieux,

http://seiscope.oca.eu/]:LU on complete earth

Extrapolation on a 1000× 1000× 1000 grid:15 exaflops, 200 TBytes for factors, 32 TBytes of working memory!

42 / 165

Summary – sparse matrices main issues

I Widely used in engineering and industry (critical for performanceof simulations)

I Irregular dynamic data structures of very large size(Tera/PetaBytes)

I Strong relations between sparse matrices and graphs

I Which graph for which matrix (symmetric, unsymmetric,triangular)?

I Efficient algorithms needed for:

I Which graph for which algorithm ?I Reordering sparse matricesI Predict data structures of factor matricesI Model factorization and accomodate numerical issues

43 / 165

Outline

Graph definitions and relations to sparse matricesWhich graph for which matrix?Trees and spanning treesOrdering and tree traversalPeripheral and pseudo-peripheral verticesGraph matching and matrix permutationReducibility and connected componentsDefinition of hypergraphs

Postorderings and memory usage

44 / 165

Graph notations and definitions

A graph G = (V ,E ) consists of a finite set V , called the vertex setand a finite, binary relation E on V , called the edge set.

Three standard graph models

Undirected graph: The edges are unordered pair of vertices, i.e.,{u, v} ∈ E for some u, v ∈ V .

Directed graph: The edges are ordered pair of vertices, that is, (u, v)and (v , u) are two different edges.

Bipartite graph: G = (U ∪ V ,E ) consists of two disjoint vertex setsU and V such that for each edge (u, v) ∈ E , u ∈ U and v ∈ V .

An ordering or labelling of G = (V ,E ) having n vertices, i.e.,|V | = n, is a mapping of V onto 1, 2, . . . , n.

45 / 165

Matrices and graphs: Rectangular matrices

The rows/columns and nonzeros of a given sparse matrix correspond (withnatural labelling) to the vertices and edges, respectively, of a graph.

Rectangular matrices

A =

1 2 3 4

1 × ×2 × ×3 × ×

Bipartite graph

1

r2

r3

1c

2c

3c

4c

r

The set of rows corresponds to one of the vertex set R, the set of columns

corresponds to the other vertex set C such that for each aij 6= 0, (ri , cj) is

an edge.46 / 165

Matrices and graphs: Square unsymmetric pattern


Squareunsymmetricpattern matrices

A =

1 2 3

1 × ×2 × ×3 ×

Graph models

I Bipartite graph as before.

I Directed graph

v2 3v

1v

The set of rows/cols corresponds the vertex set V such that for each

aij 6= 0, (vi , vj) is an edge. Transposed view possible too, i.e., the edge

(vi , vj) directed from column i to row j . Usually self-loops are omitted. 47 / 165

Matrices and graphs: Square unsymmetric pattern

A special subclass

Directed acyclic graphs (DAG):A directed graphs with no loops(maybe except for self-loops).

DAGs

We can sort the vertices suchthat if (u, v) is an edge, then uappears before v in the ordering.

Question: What kind of matrices have a DAG structure ?

48 / 165

Matrices and graphs: Symmetric pattern


Square symmetricpattern matrices

A =

1 2 3

1 ×2 × × ×3 × ×

Graph models

I Bipartite and directed graphs as before.

I Undirected graph

v2 3v

1v

The set of rows/cols corresponds the vertex set V such that for each

aij , aji 6= 0, {vi , vj} is an edge. No self-loops; usually the main diagonal is

assumed to be zero-free. 49 / 165

Definitions: Edges, degrees, and paths

Many definitions for directed and undirected graphs are the same.We will use (u, v) to refer to an edge of an undirected or directedgraph to avoid repeated definitions.

I An edge (u, v) is said to incident on the vertices u and v .

I For any vertex u, the set of vertices in adj(u) = {v : (u, v) ∈ E}are called the neighbors of u. The vertices in adj(u) are said tobe adjacent to u.

I The degree of a vertex is the number of edges incident on it.

I A path p of length k is a sequence of vertices 〈v0, v1, . . . , vk〉where (vi−1, vi ) ∈ E for i = 1, . . . , k. The two end points v0 andvk are said to be connected by the path p, and the vertex vk issaid to be reachable from v0.

50 / 165

Definitions: Components

I An undirected graph is said to be connected if every pair ofvertices is connected by a path.

I The connected components of an undirected graph are theequivalence classes of vertices under the “is reachable” fromrelation.

I A directed graph is said to be strongly connected if every pair ofvertices are reachable from each other.

I The strongly connected components of a directed graph are theequivalence classes of vertices under the “are mutuallyreachable” relation.

51 / 165

Definitions: Trees and spanning trees

A tree is a connected, acyclic, undirected graph. If an undirectedgraph is acyclic but disconnected, then it is a forest.

Properties of trees

I Any two vertices are connected by a unique path.

I |E | = |V | − 1

A rooted tree is a tree with a distinguished vertex r , called the root.

There is a unique path from the root r to every other vertex v . Anyvertex y in that path is called an ancestor of v . If y is an ancestor ofv , then v is a descendant of y .

The subtree rooted at v is the tree induced by the descendants of v ,rooted at v .

A spanning tree of a connected graph G = (V ,E ) is a treeT = (V ,F ), such that F ⊆ E .

52 / 165

Ordering of the vertices of a rooted tree

I A topological ordering of a rooted tree is an ordering thatnumbers children vertices before their parent.

I A postorder is a topological ordering which numbers the verticesin any subtree consecutively.

u w

x

y

z

v

with topological ordering

1 3

2

4

5

6

Rooted spanning tree

w

yz

xu

v

Connected graph G

u w

x

y

z

v

1

6

54

3

2

Rooted spanning treewith postordering

53 / 165

How to explore a graph:

Two algorithms such that each edge is traversed exactly once in theforward and reverse direction and each vertex is visited. (Connectedgraphs are considered in the description of the algorithms.)

I Depth first search : Starting from a given vertex (mark it)follow a path (and mark each vertex in the path) until it isadjacent to only marked vertices. Return to previous vertex andcontinue.

I Breadth first search Select a vertex and put it into an initiallyempty queue of vertices to be visited. Repeatedly remove thevertex x at the head of the queue and place all neighbors of xthat were not enqueue before into the queue.

54 / 165

Illustration of DFS and BFS exploration

z

xu

vts w

v

z t

sy x

w

u

Breadth−first spanning tree (v)1

32 4

5 6 7

8

v

z

y

x

t

s

w u

1

2

3

4

7

8

5 6

Depth−first spanning tree (v)

y

55 / 165

Properties of DFS and BFS exploration

I If the graph G is not connected the algorithm is restarted and infact it detects the connected components of G .

I Each vertex is visited once.

I The visited edges of the DFS algorithm is a spanning forest of ageneral graph G so-called the depth first spanning forest of G .

I The visited edges of the BFS algorithm is a spanning forest of ageneral graph G so-called the breadth first spanning forest ofG .

56 / 165

Graph shapes and peripheral vertices

I The distance between two vertices of a graph is the size of theshortest path joining those vertices.

I The eccentricity l(v) = maxd(v ,w)|w ∈ V

I The diameter δ(G ) = maxl(v)|v ∈ V

I v is a peripheral vertex if l(v) = δ(G )

I Example of connected graph with δ(G ) = 5 and peripheralvertices s,w , x .

yz

xu

wvts

57 / 165

Pseudo-peripheral vertices

I The rooted level structure L(v) of G is a partitioning of GL(v) = Lo(v), . . . , Ll(v)(v) such that Lo(v) = vLi (v) = Adj(Li−1(v)) \ Li−2(v)

I The eccentricity l(v) is also-called the length of L(v). Thewidth w(v) of L(v) is defined as w(v) = max |Li (v)||0 ≤ l(v)

I The aspect ratio of L(v) is defined as l(v)/w(v).

I A pseudo-peripheral is a node with a large eccentricity.

I Algorithm to determine a pseudo-peripheral :Let i = 0, ri be an arbitrary node, and nblevels = l(ri ). Thealgorithm iteratively updates nblevels by selecting the vertex ri+1

of minimum degree of Ll(ri ) until l(ri+1) > nblevels.

58 / 165

Pseudo-peripheral vertices

yz

xu

wvts

w

z x

y

uv

t

s

2nd step: select W in L3 (min. degree)

L0(w)

L1

L2

L3

L4

L5

v

z

s

x

y

w

u

t

L0(v)

L1

L2

First step, select v

Level structure(v)

L3

59 / 165

Permutation matrices

A permutation matrix is a square (0, 1)-matrix where each row andcolumn has a single 1.

If P is a permutation matrix, PPT = I , i.e., it is an orthogonalmatrix. Let,

A =

1 2 3

1 × ×2 × ×3 ×

and suppose we want to permute columns as [2, 1, 3]. Definep2,1 = 1, p1,2 = 1, p3,3 = 1, and B = AP (if column j to be atposition i , set pji = 1)

B =

2 1 3

1 × ×2 × ×3 ×

=

1 2 3

1 × ×2 × ×3 ×

1 2 3

1 12 13 1

60 / 165

Symmetric matrices and graphs

I Remarks:I Number of nonzeros in column j = |adjG (j)|I Symmetric permutation ≡ renumbering the graph

3

4

25

11

2

3

4

5

1 2 3 4 5

Symmetric matrix Corresponding graph

61 / 165

Matching in bipartite graphs and permutations

A matching in a graph is a set of edges no two of which share acommon vertex. We will be mostly dealing with matchings inbipartite graphs.

In matrix terms, a matching in the bipartite graph of a matrixcorresponds to a set of nonzero entries no two of which are in thesame row or column.

A vertex is said to be matched if there is an edge in the matchingincident on the vertex, and to be unmatched otherwise. In a perfectmatching, all vertices are matched.

The cardinality of a matching is the number of edges in it. Amaximum cardinality matching or a maximum matching is amatching of maximum cardinality. Solvable in polynomial time.

62 / 165

Matching in bipartite graphs and permutations

Given a square matrix whose bipartite graph has a perfect matching,such a matching can be used to permute the matrix such that thematching entries are along the main diagonal.

1

r2

r3 3c

2c

1cr

1 2 3

1 × ×2 × ×3 ×

2 1 3

1 × ×2 × ×3 ×

=

1 2 3

1 × ×2 × ×3 ×

1 2 3

1 12 13 1

63 / 165

Definitions: Reducibility

Reducible matrix: An n × n square matrix is reducible if there existsan n × n permutation matrix P such that

PAPT =

(A11 A12

O A22

),

where A11 is an r × r submatrix, A22 is an (n − r)× (n − r)submatrix, where 1 ≤ r < n.

Irreducible matrix: There is no such a permutation matrix.

Theorem: An n × n square matrix is irreducible iff its directed graphis strongly connected.

Proof: Follows by definition.

64 / 165

Definitions: Hypergraphs

Hypergraph: A hypergraph H = (V ,N) consists of a finite set V called thevertex set and a set of non-empty subsets of vertices N called the hyperedgeset or the net set. A generalization of graphs.

For a matrix A, define a hypergraph whose vertices correspond to the rowsand whose nets correspond to the columns such that vertex vi is in net nj iffaij 6= 0 (the column-net model).

A sample matrix

1 2 3 4

1 × × ×2 × ×3 × × ×

The column-net hypergraph model

43n

1n

1v

2v

3v

n

n2v

2n

4n

3n

1n

1

2

3

v

v

65 / 165

Outline

Reordering sparse matricesFill-in and reorderingThe elimination graph modelFill-in characterizationFill-reducing heuristicsReduction to Block Triangular Form

66 / 165

Fill-in and reordering

Step k of LU factorization (akk pivot):I For i > k compute lik = aik/akk (= a′ik),I For i > k , j > k


akk= aij − lik × akj

I If aik 6= 0 and akj 6= 0 then a′ij 6= 0I If aij was zero → non-zero a′ij must be stored: fill-in

k j

k

i

x

x

x

x

k j

k

i

x

x

x

0

Interest ofpermutinga matrix:

X X X X XX X 0 0 0X 0 X 0 0X 0 0 X 0X 0 0 0 X

X 0 0 0 X0 X 0 0 X0 0 X 0 X0 0 0 X XX X X X X

67 / 165


I Assumptions: A symmetric and pivots are chosen from thediagonal

I Structure of A symmetric represented by the graph G = (V ,E )

I Vertices are associated to columns: V = {1, . . . , n}I Edges E are defined by: (i , j) ∈ E ↔ aij 6= 0

I G undirected (symmetry of A)

68 / 165


I Remarks:I Number of nonzeros in column j = |adjG (j)|I Symmetric permutation ≡ renumbering the graph

3

4

25

11

2

3

4

5

1 2 3 4 5

Symmetric matrix Corresponding graph

69 / 165

Breadth-First Search (BFS) of a graph to permute a matrix

Exercice 1 (BFS traversal to order a matrix)

Given a symmetric matrix A and its associated graph G = (V ,E )

1. Derive from BFS traversal of G an algorithm to reorder matrix A

2. Explain the structural property of the permuted matrix?

3. Explain why such an ordering can help reducing fill-in during LUfactorisation?

4. How can the proposed algorithm be used on an unsymmetricmatrix?

70 / 165

The elimination graph model for symmetric matrices

I Let A be a symmetric positive define matrix of order n

I The LLT factorization can be described by the equation:

A = A0 = H0 =

(d1 vT

1

v1 H1

)

=

( √d1 0

v1√d1

In−1

)(1 00 H1

)( √d1

vT1√d1

0 In−1

)

= L1A1LT1, where

H1 = H1 −v1vT

1

d1

I The basic step is applied on H1H2 · · · to obtain :

A = (L1L2 · · ·Ln−1) In(LTn−1 . . .L

T2 LT

1

)= LLT

71 / 165

The basic step: H1 = H1 − v1vT1

d1

What is v1vT1 in terms of structure?

v1 is a column of A, hence theneighbors of the correspondingvertex.

v1vT1 results in a dense sub-

block in H1.

If any of the nonzeros in densesubmatrix are not in A, then wehave fill-ins.

72 / 165

The elimination process in the graphs

GU(V ,E )← undirected graph of Afor k = 1 : n − 1 do

V ← V − {k} {remove vertex k}E ← E − {(k , `) : ` ∈ adj(k)} ∪ {(x , y) : x ∈ adj(k) and y ∈adj(k)}Gk ← (V ,E ) {for definition}

end for

Gk are the so-called elimination graphs (Parter,’61).

4

321

6 5

12

34

56

H0 =G0 :

73 / 165

A sequence of elimination graphs

4

321

6 5

G0 :

4

32

6 5

6 5

34G2 : H2 =

34

56

45

6H3 =

6 5

4

G1 :

G3 :

12

34

56

H0 =

23

45

6

H1 =

74 / 165

Fill-in and reordering

“Before permutation” Permuted matrix(A”(lhr01)) (A′(lhr01))

0 200 400 600 800 1000 1200 1400

0

200

400

600

800

1000

1200

1400

nz = 184270 200 400 600 800 1000 1200 1400

0

200

400

600

800

1000

1200

1400

nz = 18427

Factored matrix (LU(A′))

0 200 400 600 800 1000 1200 1400

0

200

400

600

800

1000

1200

1400

nz = 76105

75 / 165

Fill-in characterization

Let A be a symmetric matrix (G (A) its associated graph), L thematrix of factors A = LLt ;

Fill path theorem, Rose, Tarjan, Leuker, 76

lij 6= 0 iff there is a path in G (A) between i and j such that all nodesin the path have indices smaller than both i and j .

i

j

p1 p2

pk

p1

p2

pk

j

i

j

i

76 / 165

Fill-in characterization (proof intuition)




i

j

p1 p2

pk

p1

p2

pk

j

i

j

i

77 / 165





i

j

p1 p2

pk

p1

p2

pk

j

i

j

i

78 / 165





i

j

p1 p2

pk

p1

p2

pk

j

i

j

i

Fill(i,j) !!!

Fill(i,j) !!!

79 / 165

Introducing the filled graph G+(A)

I Let F = L + LT be the filled matrix,and G (F) the filled graph of A denoted by G+(A).

I Lemma (Parter 1961) : (vi , vj) ∈ G+ if and only if (vi , vj) ∈ G

or ∃k < min(i , j) such that (vi , vk) ∈ G+ and (vk , vj) ∈ G+.

5

1

6

2

34

+G (A) = G(F) F = L + L

T

1

54

32

6

80 / 165

Fill-reducing heuristics

Three main classes of methods for minimizing fill-in duringfactorization

I Global approach: The matrix is permuted into a matrix with agiven pattern

I Fill-in is restricted to occur within that structureI Cuthill-McKee (block tridiagonal matrix)

(Remark: BFS traversal of the associated graph)I Nested dissections (“block bordered” matrix)

(Remark: interpretation using the fill-path theorem)

Graph partitioning Permuted matrix

(1)

(5)

(4)

(2)

S1

S2

S3

S1

12

34

S2

S3

81 / 165

Fill-reducing heuristics

I Local heuristics: At each step of the factorization, selection ofthe pivot that is likely to minimize fill-in.

I Method is characterized by the way pivots are selected.I Markowitz criterion (for a general matrix).I Minimum degree or Minimum fill-in (for symmetric matrices).

I Hybrid approaches: Once the matrix is permuted to blockstructure, local heuristics are used within the blocks.

82 / 165

Local heuristics to reduce fill-in during factorization

Let G (A) be the graph associated to a matrix A that we want toorder using local heuristics.Let Metric such that Metric(vi ) < Metric(vj) implies vi is a betterthan vj

Generic algorithmLoop until all nodes are selected

Step1: select current node p (so called pivot) withminimum metric value,

Step2: update elimination graph,Step3: update Metric(vj) for all non-selected nodes vj .

Step3 should only be applied to nodes for which the Metric valuemight have changed.

83 / 165

Reordering unsymmetric matrices: Markowitz criterion

I At step k of Gaussian elimination:

Ak

L

U

I rki = number of non-zeros in row i of Ak

I ckj = number of non-zeros in column j of Ak

I aij must be large enough and should minimize(rki − 1)× (ck

j − 1) ∀i , j > k

I Minimum degree : Markowitz criterion for symmetricdiagonally dominant matrices

84 / 165

Minimum degree algorithm

I Step 1:Select the vertex that possesses the smallest number ofneighbors in G 0.

23

45

67

89

10

1

14

3

5

6

8

9

10

2

7

(a) Sparse symmetric matrix(b) Elimination graph

The node/variable selected is 1 of degree 2.

85 / 165

I Notation for the elimination graph

I Let G k = (V k ,E k), the graph built at step k.I G k describes the structure of Ak after elimination of k pivots.I G k is non-oriented (Ak is symmetric)I Fill-in in Ak ≡ adding edges in the graph.

86 / 165

Illustration

Step 1: elimination of pivot 1

/

12

34

56

78

910

4

3

5

6

8

9

10

2

7

1

23

45

67

89

10

1

7

4

3

5

6

8

9

10

2

1

(a) Elimination graph (b) Factors and active submatrix

Initial nonzeros Fill−inNonzeros in factors

Minimum degree algorithm applied to the graph:

I Step k : Select the node with the smallest number of neighbors

I G k is built from G k−1 by suppressing the pivot and adding edgescorresponding to fill-in.

88 / 165

Minimum degree algorithm based on elimination graphs

∀i ∈ [1 . . . n] ti = |AdjG0(i)|For k = 1 to n Do

p = mini∈Vk−1(ti )

For each i ∈ AdjG k−1(p) DoAdjG k (i) = (AdjG k−1(i) ∪ AdjG k−1(p)) \ {i , p}ti = |AdjG k (i)|

EndForV k = V k−1 \ p

EndFor

89 / 165

Illustration (cont’d)

Graphs G1,G2,G3 and corresponding reduced matrices.

e

7

4

3

5

6

8

9

10

2

12

34

56

78

910

12

34

56

78

910

4

3

5

6

8

9

10

7

23

45

67

89

10

1

4

5

6

8

9

10

7

(a) Elimination graphs

(b) Factors and active submatrices

Original nonzero Fill−in

Original nonzero modified Nonzeros in factors

90 / 165

Minimum Degree does not always minimize fill-in !!!

12

34

67

89

5

4

3

1

2

6

7

9

8

5

4

3

1

2

6

7

9

8

Consider the following matrix

Remark: Using initial ordering

No fill−in

Corresponding elimination graph

Step 1 of Minimum Degree:

Select pivot 5 (minimum degree = 2)

Updated graph

Add (4,6) i.e. fill−in

91 / 165

Reduce memory complexity

Decrease the size of working spaceUsing the elimination graph, working space is of order O(#nonzerosin factors).

I Property: Let pivot be the pivot at step k

If i ∈ AdjGk−1(pivot) Then AdjGk−1(pivot) ⊂ AdjGk (i)

I We can then use an implicit representation of fill-in by definingthe notions of element (variable already eliminated) andquotient graph. A variable of the quotient graph is adjacent tovariables and elements.

I We can show that ∀k ∈ [1 . . .N] Size of quotient graph= O(G 0)

92 / 165

Influence on the structure of factors

Harwell-Boeing matrix: dwt 592.rua, structural computing on asubmarine. NZ(LU factors)=58202

0 100 200 300 400 500

0

100

200

300

400

500

nz = 5104

93 / 165

Mininimum fill-in heuristics

Recalling the generic algorithmLet G (A) be the graph associated to a matrix A that we want toorder using local heuristics.Let Metric be such that Metric(vi ) < Metric(vj) ≡ vi is a betterthan vj

Generic algorithmLoop until all nodes are selected

Step1: Select current node p (so called pivot) withminimum metric value,

Step2: update elimination (or quotient) graph,Step3: update Metric(vj) for all non-selected nodes vj .

Step3 should only be applied to nodes for which the Metric valuemight have changed.

94 / 165

Minimum fill based algorithm

I Metric(vi ) is the amount of fill-in that vi would introduce if itwere selected as a pivot.

I Illustration: r has a degree d = 4 and a fill-in metric ofd × (d − 1)/2 = 6 whereas s has degree d = 5 but a fill-inmetric of d × (d − 1)/2− 9 = 1.

r

s

i2

i1

i5

i3 i4j1

j2j3 j4

95 / 165

Minimum fill-in properties

I The situation typically occurs when {i1, i2, i3} and {i2, i3, i4, i5}were adjacent to two already selected nodes (here e2 and e1)

e1 and e2 are previously selected nodes

rs

i1i2

i3i4

i5j1

j2j3

j4

r

s

i2

i1

i5

i3 i4j1

j2j3 j4

e2e1

I The elimination of a node vk affects the degree of nodes adjacentto vk . The fill-in metric of Adj(Adj(vk)) is also affected.

I Illustration: selecting r affects the fill-in of i1 (fill edge (j3, j4)should be deduced).

Reduction to Block Triangular Form (BTF)

I Suppose that there exists permutations matrices P and Q suchthat

PAQ =

B11

B21 B22

B31 B32 B33

. . . .

. . . .

. . . .BN1 BN2 BN3 . . . BNN

I If N > 1 A is said to be reducible (irreducible otherwise)

I Each Bii is supposed to be irreducible (otherwise finerdecomposition is possible)

I Advantage: to solve Ax = b only Bii need be factoredBiixi = (Pb)i −

∑i−1j=1 Bijyj , i = 1, . . . ,N with y = QTx

97 / 165

A two stage approach to compute the reduction

I Stage (1): compute a (column) permutation matrix Q such thatAQ1 has non zeros on the diagonal (find a maximum transversalof A), then

I Stage(2): compute a (symmetric) permutation matrix P suchthat PAQ1Pt is BTF.

The diagonal blocks of the BTF are uniquely defined. Techniquesexits to directly compute the BTF form. They do not presentadvantage over this two stage approach.

98 / 165

Main components of the algorithm

I Objective : assume AQ1 has non zeros on the diagonal andcompute P such that PAQ1Pt is BTF.

I Use of digraph (directed graph) associated to the matrix withself loops (diagonal entries) not needed.

I Symmetric permutations on digraph ≡ relabelling nodes of thegraph.

I If there is closed path through all nodes in the digraph then thedigraph can be subdivided in two parts.

I Strong components of a graph are the set of nodes belonging toa closed path.

I The strong components of the graph are the diagonal blocks Bii

of the BTF format.

99 / 165

Main components of the algorithm

34

51

2

21

6

3 4

53

45

Digraph of A.

PAPt =

BTF of A

6

12

A = 6

100 / 165

Preamble : Finding the triangular form of a matrix

I Observations: a triangular matrix has a BTF form with blocks ofsize 1(BTF will be considered as generalization of the triangular formwhere each diagonal entry is now a strong component).Note that if the matrix has triangular form the digraph has nocycle (i.e. acyclic).

I Sargent and Westerberg Algorithm :

1. Select a starting node and trace a path until finding a node withno paths leave.

2. Number the last node first, eliminate it (and all edges pointing toit).

3. Continue from previous node on the path (or choose a newstarting node).

101 / 165

Preamble : Finding the triangular form of a matrix

21

6

3 4

5

12

34

56

A = PAPt =

6

51

2

43

Digraph of A.Steps

Path

1

6 3 34

35 5 2

1225555

52 3 4 6 7 8 9

A permuted to triangular form

(not unique)

102 / 165

Generalizing Sargent and Westerberg Algorithm tocompute BTF

A composite node is a set nodes belonging to a closed path.Strat from any node and follow a path until(1) a closed path is found:

collapse all nodes in closed path into a composite nodethe path continue from the composite node.

(2) reaching a node or composite node with no path leave:the node or composite node is numbered next.

103 / 165


22 2334 4

3

5354322 2

1222

212

63 3 3

2

1 2 3 4 5 76 8 9 10 11

PATHS

STEPS

21

6

3 4

5

Digraph of A.

21

6

3 4

5

Digraph of A.

104 / 165


21

6

3 4

5

Digraph of A.

22 2334 4

3

5354322 2

1222

212

63 3 3

2

1 2 3 4 5 76 8 9 10 11

PATHS

STEPS

21

6

3 4

5

Digraph of A.

34

5

12

A = 6

BTF of A

63

4

12

5PAPt =

105 / 165

Concluding remarks

Summary:

Fill-in has been characterized;Ordering strategies and associated graphs have been introduced toreduce fill-in or to permute to target structure (block tridiagonal,recursive block-bordered, block triangular form) Influence onperformance (sequential and parallel)

Next step:

How to model sparse factorization, predict data structures to run thenumerical factorization ?

106 / 165

Outline

Graphs to model factorizationThe elimination tree

107 / 165

Set-up

Reminder

I A spanning tree of a connected graph G = (V ,E ) is a treeT = (V ,F ), such that F ⊆ E .

I A topological ordering of a rooted tree is an ordering thatnumbers children vertices before their parent.

I A postorder is a topological ordering which numbers the verticesin any subtree consecutively.

Let A be an n × n symmetric positive-definite and irreducible matrix,A = LLT its Cholesky factorization, and G+(A) its filled graph(graph of F = L + LT ).

108 / 165

A first definition

Since A is irreducible, each of the first n − 1 columns of L has atleast one off-diagonal nonzero (prove?).

For each column j < n of L, remove all the nonzeros in the column jexcept the first one below the diagonal.

Let Lt denote the remaining structure and consider the matrixFt = Lt + LT

t . The graph G (Ft) is a tree called the elimination tree.

138 JOSEPH W. H. LIU

a

d

fa

o

o o h0

o oo

d

I

oj

FIG. 2.1. An example of matrix structures.

3 10

84

G(AG(F)

G(Ft)= T(A

FIG. 2.2. Graph structures of the example in Fig. 2.1.


a

d

fa

o

o o h0

o oo

d

I

oj


3 10

84

G(AG(F)

G(Ft)= T(A


109 / 165

A first definition

The elimination tree of A is a spanning tree of G+(A) satisfying therelation PARENT [j ] = min{i > j : `ij 6= 0}.


a

d

fa

o

o o h0

o oo

d

I

oj


3 10

84

G(AG(F)

G(Ft)= T(A


+G (A) = G(F)

a

h

i

d

e

j

b

f

c g

j 10d4

i9

e5

b2

a1

c3

g7

h8f

6h8

d4

e5

b2

f6

c3

a1

i9

g7

j 10

ab

cd

ef

gh

ij

entries in T(A)fill!in entries

F =A =

ab

cd

ef

gh

ij

ab

cd

ef

gh

ij

entries in T(A)fill!in entries

F =A =

ab

cd

ef

gh

ij

G(A)

T(A)110 / 165

A second definition: Represents column dependencies

I Dependency between columns of L:I Column i > j depends on

column j iff `ij 6= 0

I Use a directed graph toexpress this dependency(edge from j to i , if columni depends on column j)

I Simplify redundantdependencies (transitivereduction)

j

i

j i

I The transitive reduction of the directed filled graph gives theelimination tree structure. Remove a directed edge (j , i) if thereis a path of length greater than one from j to i .

111 / 165

Directed filled graph and its transitive reduction

h8d

4e

5

b2

f6

c3

a1

i9

g7

j 10

d4

i9

e5

b2

a1

c3

g7

h8f

6

j 10d

4

i9

e5

b2

a1

c3

g7

h8f

6

j 10

T(A)

Directed filled graph Transitive reduction

112 / 165

The elimination tree: a major tool to model factorization

The elimination tree and its generalisation to unsymmetric matricesare compact structures of major importance during sparsefactorization

I Express the order in which variables can be eliminated: becausethe elimination of a variable only affects the elimination of itsancestors, any topological order of the elimination tree will leadto a correct result and to the same fill-in

I Express concurrency: because variables in separate subtrees donot affect each other, they can be eliminated in parallel

I Efficient to characterize the structure of the factors moreefficiently than with elimination graphs

113 / 165

Outline

Efficiency of the solution phaseSparsity in the right hand side and/or solution

114 / 165

Exploit sparsity of the right-hand-side/solution

Applications

I Highly reducible matrices and/or sparse right-hand-sides (linearprogramming, seismic processing)

I Null-space basis computationI Partial computation of A−1

I Computing variances of the unknowns of a data fitting problem =computing the diagonal of a so-called variance-covariance matrix.

I Computing short-circuit currents = computing blocks of aso-called impedance matrix.

I Approximation of the condition number of a SPD matrix.

Core idea

An efficient algorithm has to take advantage of the sparsity of A andof both the right-hand sides and the solution.

Exploit sparsity in RHS : an quick insight of mainproperties

solve y ← L \ b

L yb\1

5

2

3

4

L yb\

1

5

2

3

4

I In all application cases,only part of factors needs to be loaded

I Objectives with sparse RHSI Efficient use of the RHS sparsityI Characterize LU factors to be loaded

from diskI Efficiently load only needed factors

from disk

(1) Predicting structure of the solution vector,

Gilbert-Liu, ’93

116 / 165


solve y ← L \ b

L yb\1

5

2

3

4

L yb\

1

5

2

3

4




from disk


Gilbert-Liu, ’93

117 / 165


solve y ← L \ b

L yb\1

5

2

3

4

L yb\

1

5

2

3

4




from disk


Gilbert-Liu, ’93

118 / 165


solve y ← L \ b

L yb\1

5

2

3

4

L yb\

1

5

2

3

4




from disk


Gilbert-Liu, ’93

119 / 165


solve y ← L \ b

L yb\1

5

2

3

4

L yb\

4

3

1

5

2




from disk


Gilbert-Liu, ’93

120 / 165

Application : elements in A−1

AA−1 = I , specific entry: a−1ij = (A−1ej)i ,

A−1ej – column j of A−1

Theorem: structure of x (based on Gilbert and Liu ’93)

For any matrix A such that A = LU, the structure of the solution (x) is given by the set ofnodes reachable from nodes associated with right-hand side entries by paths in the e-tree.

compute some elements inA−1

2 4 5 8 11

1263

7 13

14

1

9

10

2

7

14

b xy

Which factors needed to compute a−182 ?a−182 = (U−1(L−1e2))8

We have to load :L factors associated with nodes 2, 3, 7, 14and U factors associated with nodes 14, 13, 9, 8

Note:A part of the tree is concerned

121 / 165







5 8 11

126

13

9

1042

3

7

14 2

7

14

2

7

14

3

b xy

1




122 / 165







2 4 5 11

1263

7

1 10

14

13

9

8

2

7

14 14

2

7

14

3

8

9

13

b xy




123 / 165







5 11

126

10

2

7

14 14

2

7

14

3

8

9

13

b xy

3

21 4

7

14

13

9

8




124 / 165

Entries of the inverse: a single one

Notation for later use

P(i): denotes the nodes in the uniquepath from the node i to the rootnode r (including i and r).

P(S): denotes⋃

s∈S P(s) for a set ofnodes S .

Use the elimination tree

For each requested(diagonal) entry a−1ii ,

(1) visit the nodes of theelimination tree fromthe node i to theroot: at each nodeaccess necessary partsof L,

(2) visit the nodes fromthe root to the node iagain; this timeaccess necessary partsof U.

125 / 165

Experiments: interest of exploiting sparsity

Implementation

These ideas have been implemented in MUMPS solver during Tz.Slavova’s PhD.

Experiments: computation of the diagonal of the inverse of matricesfrom data fitting in Astrophysics (CESR, Toulouse)

Matrix Time (s)size No ES ES

46,799 6,944 47272,358 27,728 408

148,286 >24h 1,391

Interest

Exploiting sparsity of the right-hand sides reduces the number ofaccesses to the factors (in-core: number of flops, out-of-core:accesses to hard disks).

126 / 165

Outline

Factorization of sparse matricesIntroductionElimination tree and Multifrontal approachEquivalent orderings and elimination treesSome parallel solversConcluding remarks

127 / 165

Outline

1. Introduction

2. Elimination tree and multifrontal method

3. Postorderings, equivalent orderings and memory usage

4. Concluding remarks

128 / 165

Recalling the Gaussian elimination

Step k of LU factorization (akk pivot):

I For i > k compute lik = aik/akk (= a′ik),

I For i > k , j > k such that aik and akj are nonzeros


akk

I If aik 6= 0 et akj 6= 0 then a′ij 6= 0

I If aij was zero → its non-zero value must be stored

I Orderings (minimum degree, Cuthill-McKee, ND) limit fill-in,the number of operations and modify the tasks graph

129 / 165

Three-phase scheme to solve Ax = b

1. Analysis stepI Preprocessing of A (symmetric/unsymmetric orderings, scalings)I Build the dependency graph (elimination tree, eDAG . . . )

2. Factorization (A = LU, LDLT, LLT, QR)

Numerical pivoting

3. Solution based on factored matricesI triangular solves: Ly = b, then Ux = yI improvement of solution (iterative refinement), error analysis

130 / 165

Elimination tree and Multifrontal approach

We recall that:

The elimination tree is the dependency graph of thefactorization.

Building the elimination tree (small matrices)

I Permute matrix (to reduce fill-in) PAPT.

I Build filled matrix AF = L + LT where PAPT = LLT

I Transitive reduction of associated filled graph

Multifrontal approach

→ Each column corresponds to a node of the tree

→ (multifrontal) Each node k of the tree corresponds to the partialfactorization of a frontal matrix whose row and column structureis that of column k of AF .

131 / 165

Illustration of multifrontal factorization

We assume pivots are chosen down the diagonal in order.

_

_

_

3, 4

4 4

1, 3, 4_ 2, 3, 4

3

1 2

F

F

Elimination graph

Filled matrix

Treatment at each node:

I Assembly of the frontal matrix using the contributions from thesons.

I Gaussian elimination on the frontal matrix

132 / 165

I Elimination of variable 1 (a11 pivot)I Assembly of the frontal matrix

1 3 4

1 x x x

3 x

4 x

I Contributions : aij =−(ai1×a1j )

a11i > 1, j > 1 on a33, a44, a34 and a43 :

a(1)33 = − (a31 × a13)

a11a(1)34 = − (a31 × a14)

a11

a(1)43 = − (a41 × a13)

a11a(1)44 = − (a41 × a14)

a11

Terms − ai1×a1ja11

of the contribution matrix are stored for later updates.

133 / 165

I Elimination of variable 2 (a22 pivot)I Assembly of frontal matrix: update of elements of pivot row and

column using contributions from previous updates (none here)2 3 4

2 x x x

3 x

4 x

I Contributions on a33, a34, a43, and a44.

a(2)33 = − (a32 × a23)

a22

a(2)34 = − (a32 × a24)

a22

a(2)43 = − (a42 × a23)

a22

a(2)44 = − (a42 × a24)

a22

134 / 165

I Elimination of variable 3.I Assembly of frontal matrix using the previous contribution

matrices(a(1)33 a

(1)34

a(1)43 a

(1)44

)and

(a(2)33 a

(2)34

a(2)43 a

(2)44

):

a′33 = a33 + a(1)33 + a

(2)33

a′34 = a34 + a(1)34 + a

(2)34 , (a34 = 0)

a′43 = a43 + a(1)43 + a

(2)43 , (a43 = 0)

a′44 = a(1)44 + a

(2)44

I Contribution on variable 4:a(3)44 = a′44 − (a′43×a34)

a33

135 / 165

3 4

3 x x

4 x

I Contribution on a44 : a(3)44 = a′44 −

(a′43×a′34)a′33

Note that a44 is partially summed since it still does not includethe entry from the initial matrix

I Elimination of variable 4I Frontal involves only a44 : a44 = a44 + a

(3)44

4

4 x

136 / 165

Supernodes/supervariables

Definition 1

A supernode (or supervariable) is a set of contiguous columns in thefactors L that share essentially the same sparsity structure. Columnsi1, i2, . . . , ip in the filled graph form a supernode: |Lik+1

| = |Lik | − 1.

I All algorithms (ordering, symbolic factor., factor., solve)generalize to blocked versions.

I Use of efficient matrix-matrix kernels (improve cache usage).

I Same concept as supervariables for elimination tree/minimumdegree ordering.

I Supernodes and pivoting: pivoting inside a supernode does notincrease fill-in.

137 / 165

Amalgamation

I GoalI Exploit a more regular structure in the original matrixI Decrease the amount of indirect addressingI Increase the size of frontal matrices

I How?I Relax the number of nonzeros of the matrixI Amalgamation of nodes of the elimination tree

138 / 165

I Consequences?I Increase in the total amount of flopsI But decrease of indirect addressingI And increase in performance

I Amalgamation of supernodes (same lower diagonal structure) iswithout fill-in

139 / 165

Illustration of amalgamation

Original Matrix Elimination tree

1

2

3

4

5

6

1 2 3 4 5 6

F3

F3

F4

F4

6

4, 5, 6

5, 6

3, 5, 62, 4, 51, 4

(Leaves)

(Root)

Structure of node i = frontal matrix noted i, i1, i2 . . . if

140 / 165

Illustration of amalgamation

Amalgamation

(WITHOUT fill-in) (WITH fill-in )

3, 5, 6

4, 5, 6

2, 4, 51, 4 2, 4, 51, 4

3, 4, 5, 6

fill-in : (3,4) and (4,3)

141 / 165

Amalgamation and Supervariables

Amalgamation of supervariables does not cause fill-inInitial Graph:

1

2

4

3

5

6

7

8

9 10

11

12

13

Reordering: 1, 3, 4, 2, 6, 8, 10, 11, 5, 7, 9, 12, 13Supervariables: {1, 3, 4} ; {2, 6, 8} ; {10, 11} ; {5, 7, 9, 12, 13}

142 / 165

Supervariables and multifrontal method

AT EACH NODE

F22 ← F22 − FT12F−111 F12

Pivot can ONLY be chosen from F11 block since F22 is NOT fullysummed

143 / 165

Multifrontal method

From children to parent

I ASSEMBLY: Scatter-addoperations (indirectaddressing)

I ELIMINATION: Densepartial Gaussian elimination,Level 3 BLAS (TRSM,GEMM)

I CONTRIBUTION to parent

144 / 165

Multifrontal method





145 / 165

Multifrontal method





146 / 165

Multifrontal method





147 / 165

Multifrontal method





148 / 165

Multifrontal method





149 / 165

The multifrontal method [Duff & Reid ’83]

1 2 3 4 512345

1 2 3 4 512345

A= L+U-I=

Storage is divided into two parts:

I Factors

I Active memory

FactorsActivefrontalmatrix

Stack ofcontributionblocks

Active storage

5

54

145

34

23

Elimination tree

150 / 165

The multifrontal method [Duff & Reid ’83]

I Factors are incompressible andusually scale fairly; they canoptionally be written on disk.

I In sequential, the traversal thatminimizes active memory is known[Liu’86].

I In parallel, active memory becomesdominant.

Example: share of active storage on theAUDI matrix using MUMPS 4.10.0

1 processor: 11%

256 processors: 59%

5

54

145

34

23

Elimination tree

151 / 165

Postorderings and memory usage

I Assumptions:I Tree processed from the leaves to the rootI Parents processed as soon as all children have completed

(postorder of the tree)I Each node produces and sends temporary data consumed by its

father.I Exercise: In which sense is a postordering-based tree traversal

more interesting than a random topological ordering ?

I Furthermore, memory usage also depends on the postorderingchosen:

a b ab

c

d f

e

g

h

c

d

e

f

g

h

ii

Best (abcdefghi) Worst (hfdbacegi)

Leaves

Root

152 / 165

Example 1: Processing a wide tree

1 2 3 6

4 5

7

1 2 3 6

4 5

7

Memory

..

.

..

.

unused memory space stack memory spacefactor memory space non-free memory space

1

1

65

5

5

5

5

5

11

1 1

1

1

1

1

1

1

2

2

2

2

2

2

2

23

3

3

3

3

3

3

3

4

4

4

4

4

6

6

6

6

6

6

4

7

7

7

Active memory

Example 2: Processing a deep tree

1 2

3

4

Memory

unused memory space stack memory spacefactor memory space non-free memory space

1

11

1122

1 23

1 23

1 2 33

1 2 33 4

1 23 4

..

.

12

..

.

Allocation of 3

Assembly step for 3

Factorization step for 3 +

Stack step for 3

Modelization of the problem

I Mi : memory peak for complete subtree rooted at i ,

I tempi : temporary memory produced by node i ,

I mparent : memory for storing the parent.

M2 M3

M(parent)

M1

temp3temp2

temp1

Mparent = max( maxnbchildrenj=1 (Mj +∑j−1

k=1 tempk),

mparent +∑nbchildren

j=1 tempj)(1)

Objective: order the children to minimize Mparent

156 / 165

Memory-minimizing schedules

Theorem 1

[Liu,86] The minimum of maxj(xj +∑j−1

i=1 yi ) is obtained when thesequence (xi , yi ) is sorted in decreasing order of xi − yi .

Corollary 1

An optimal child sequence is obtained by rearranging the childrennodes in decreasing order of Mi − tempi .

Interpretation: At each level of the tree, child with relatively largepeak of memory in its subtree (Mi large with respect to tempi )should be processed first.

⇒ Apply on complete tree starting from the leaves(or from the root with a recursive approach)

Optimal tree reordering

Objective: Minimize peak of stack memory

Tree Reorder (T ):Begin

for all i in the set of root nodes doProcess Node(i);

end forEnd

Process Node(i):if i is a leaf then

Mi=mi

elsefor j = 1 to nbchildren do

Process Node(j th child);end forReorder the children of i in decreasing order of (Mj − tempj);Compute Mparent at node i using Formula (1);

end if

Equivalent orderings of symmetric matrices

Let F be the filled matrix of a symmetric matrix A (that is,F = L + Lt , where A = LLT)G+(A) = G (F) is the associated filled graph.

Definition 2 (Equivalent orderings)

P and Q are said to be equivalent orderings iffG+(PAPT) = G+(QAQT)By extension, a permutation P is said to be an equivalent ordering ofa matrix A iff G+(PAPT) = G+(A)

It can be shown that an equivalent reordering also preserves theamount of arithmetic operations for sparse Cholesky factorization.

159 / 165

Relation with elimination trees

I Let A be a reordered matrix, and G+(A) be its filled graphI In the elimination tree, any topological tree traversal (that

processes children before parents) corresponds to an equivalentordering P of A and the elimination tree of PAPT is identical tothat of A.

u w

x

y

z

v

w

yz

xu

v

uv

wx

yz

F

F

A =

1 3

2

4

5

6

with topological orderingElimination Tree of A

341

562

Graph G (A)+

160 / 165

Tree rotations (Liu, 1988)

Definition 3

An ordering that does not introduce any fill is referred to as a perfectelimination ordering (or PEO in short)

Natural ordering is a PEO of the filled matrix F.

Theorem 2

For any node x of G+(A) = G (F), there exists a PEO on G (F) suchthat x is numbered last.

I Essence of tree rotations :I Nodes in the clique of x in F are numbered lastI Relative ordering of other nodes is preserved.

161 / 165

Example of equivalent orderings

On the right-hand side tree rotation applied on w :(clique of w is {w , x} and for other nodes relative ordering w.r.t. treeon the left is preserved).

u

vz

wx

yF

F

u w

x

y

z

v

1

6

2

54

3

with a postordering

x

w

z

vy

u

xw

5

4

6

32

1

yu

zv

FF

F =

Remark: Tree rotations can help reducing the temporary memory usage!

Some (shared memory) sparse direct codes

Code Technique Scope Availability (www.)

BCSLIB Multifrontal SYM/UNS Boeing → Access Analytics

HSL MA87 Supernodal SPD cse.clrc.ac.uk/Activity/HSL

MA41 Multifrontal UNS cse.clrc.ac.uk/Activity/HSL

MA49 Multifr. QR RECT cse.clrc.ac.uk/Activity/HSL

PanelLLT Left-looking SPD NgPARDISO Left-right SYM/UNS SchenkPSL† Left-looking SPD/UNS SGI productSuperLU MT Left-looking UNS nersc.gov/∼xiaoye/SuperLU

SuiteSparseQR Multifr. QR RECT cise.ufl.edu/research/sparse/SPQR

TAUCS Left/Multifr. SYM/UNS tau.ac.il/∼stoledo/taucs

WSMP† Multifrontal SYM/UNS IBM product† Only object code is available.

163 / 165

Distributed-memory sparse direct codes

Code Technique Scope Availability (www.)

DSCPACK Multifr./Fan-in SPD cse.psu.edu/∼raghavan/Dscpack

MUMPS Multifrontal SYM/UNS graal.ens-lyon.fr/MUMPS

mumps.enseeiht.fr

PaStiX Fan-in SPD labri.fr/perso/ramet/pastix

PSPASES Multifrontal SPD cs.umn.edu/∼mjoshi/pspases

SPOOLES Fan-in SYM/UNS netlib.org/linalg/spooles

SuperLU Fan-out UNS nersc.gov/∼xiaoye/SuperLU

S+ Fan-out† UNS cs.ucsb.edu/research/S+

WSMP † Multifrontal SYM IBM product‡ Only object code is available. Case study: Comparison of MUMPS and SuperLU

164 / 165

I Key parameters in selecting a method1. Functionalities of the solver2. Characteristics of the matrix

I Numerical properties and pivoting.I Symmetric or generalI Pattern and density

3. Preprocessing of the matrixI ScalingI Reordering for minimizing fill-in

4. Target computer (architecture)

I Substantial gains can be achieved with an adequate solver: interms of numerical precision, computing and storage

I Good knowledge of matrix and solversI Many challenging problems

I Active research area

165 / 165

outline sparse linear algebra: direct methodsamestoy.perso.enseeiht.fr/cours/alc_2013_2014.pdf ·...

Documents