parallel numerical algorithmssolomon2.web.engr.illinois.edu/teaching/cs554/slides/slides_08.pdf ·...
TRANSCRIPT
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Parallel Numerical AlgorithmsChapter 4 – Sparse Linear Systems
Section 4.1 – Direct Methods
Michael T. Heath and Edgar Solomonik
Department of Computer ScienceUniversity of Illinois at Urbana-Champaign
CS 554 / CSE 512
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 1 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Outline
1 Sparse Matrices
2 Sparse Triangular Solve
3 Cholesky Factorization
4 Sparse Cholesky Factorization
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 2 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse Matrix DefinitionsSparse Matrix Products
Sparse Matrices
Matrix is sparse if most of its entries are zero
For efficiency, store and operate on only nonzero entries,e.g., ajk · xk need not be done if ajk = 0
But more complicated data structures required incur extraoverhead in storage and arithmetic operations
Matrix is “usefully” sparse if it contains enough zero entriesto be worth taking advantage of them to reduce storageand work required
In practice, grid discretizations often yield matrices withΘ(n) nonzero entries, i.e., (small) constant number ofnonzeros per row or column
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 3 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse Matrix DefinitionsSparse Matrix Products
Graph Model
Adjacency Graph G(A) of symmetric n× n matrix A isundirected graph having n vertices, with edge betweenvertices i and j if aij 6= 0
Number of edges in G(A) is the number of nonzeros in A
For a grid-based discretization, G(A) is the grid
Adjacency graph provides visual representation ofalgorithms and highlights connections between numericaland combinatorial algorithms
For nonsymmetric A, G(A) would be directed
Often convenient to think of aij as the weight of edge (i, j)
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 4 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse Matrix DefinitionsSparse Matrix Products
Sparse Matrix Representations
Coordinate (COO) (naive) format – store each nonzeroalong with its row and column indexCompressed-sparse-row (CSR) format
Store value and column index for each nonzero
Store index of first nonzero for each row
Storage for CSR is less than COO and CSR ordering isoften convenient
CSC (compressed-sparse column), blocked versions (e.g.CSB), and other storage formats are also used
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 5 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse Matrix DefinitionsSparse Matrix Products
Sparse Matrix Distributions
Dense matrix mappings can be adapted to sparse matrices
1-D blocked mapping – store all nonzeros in n/pconsecutive rows on each processor1-D cyclic or randomized mapping – store all nonzeros insome subset of n/p rows on each processor2-D block mapping – store all nonzeros in a n/
√p× n/√p
block of matrix
1-D blocked mappings are best for exploiting locality ingraph, especially when there are Θ(n) nonzeros
Row ordering matters for all mappings, randomization andcyclicity yield load balance, blocking can yield locality
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 6 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse Matrix DefinitionsSparse Matrix Products
Sparse Matrix Vector Multiplication
Sparse matrix vector multiplication (SpMV) is
y = Ax
where A is sparse and x is dense
CSR-based matrix-vector product, for all i (in parallel) do
xi =∑j
ai,c(j)xc(j) =n∑
j=1
aijxj
where c(j) is the index of the jth nonzero in row i
For random 1-D or 2-D mapping, cost of vectorcommunication is same as in corresponding dense case
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 7 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse Matrix DefinitionsSparse Matrix Products
SpMV with 1-D Mapping
For 1D blocking (each processor owns n/p rows), numberof elements of x needed by a processor is the number ofcolumns with a nonzero in the rows it ownsIn general, want to order rows to minimize maximumnumber of vector elements needed on any processorGraphically, we want to partition the graph into p subsets ofn/p nodes, to minimize the maximum number of nodes towhich any subset is connected, i.e., for G(A) = (V,E),
V = V1 ∪ · · · ∪ Vp, |Vi| = n/p
is selected to minimize
maxi
(|{v : v ∈ V \ Vi,∃w ∈ Vi, (v, w) ∈ E}|)
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 8 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse Matrix DefinitionsSparse Matrix Products
Surface Area to Volume Ratio in SpMV
The number of external vertices the maximum partition isadjacent to depends on the expansion of the graph
Expansion can be interpreted as a measure of thesurface-area to volume ratio of the subgraphs
For example, for a k × k × k grid, a subvolume ofk/p1/3 × k/p1/3 × k/p1/3 has surface area Θ(k2/p2/3)
Communication for this case becomes a neighbor haloexchange on a 3-D processor mesh
Thus, finding the best 1-D partitioning for SpMV oftencorresponds to domain partitioning and depends on thephysical geometry of the problem
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 9 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse Matrix DefinitionsSparse Matrix Products
Other Sparse Matrix Products
SpMV is of critical importance to many numerical methods,but suffers from a low flop-to-byte ratio and a potentiallyhigh communication bandwidth cost
In graph algorithms SpMSpV (x and y are sparse) isprevalent, which is even harder to perform efficiently (e.g.,to minimize work need layout other than CSR, like CSC)
SpMM (x becomes dense matrix X) provides a higherflop-to-byte ratio and is much easier to do efficiently
SpGEMM (SpMSpM) (matrix multiplication where allmatrices are sparse) arises in e.g., algebraic multigrid andgraph algorithms, efficiency is highly dependent on sparsity
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 10 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sequential Sparse Triangular SolveParallel Sparse Triangular Solve
Solving Triangular Sparse Linear Systems
Given sparse lower-triangular matrix L and vector b, solve
Lx = b
all nonzeros of L must be in its lower-triangular part
Sequential algorithm: take xi = bi/lii, update
bj = bj − ljixi for all j ∈ {i + 1, . . . , n}
If L has m > n nonzeros, require Q1 ≈ 2m operations
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 11 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sequential Sparse Triangular SolveParallel Sparse Triangular Solve
Parallelism in Sparse Triangular Solve
We can adapt any dense parallel triangular solve algorithmto the sparse case
Again have fan-in (left-looking) and fan-out (right-looking)variantsCommunication cost stays the same, computational costdecreases
In fact there may be additional sources of parallelism, e.g.,if l21 = 0, we can solve for x1 and x2 concurrently
More generally, can concurrently prune leaves of directedacyclic adjacency graph (DAG) G(A) = (V,E), where(i, j) ∈ E if lij 6= 0
Depth of algorithm corresponds to diameter of this DAG
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 12 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sequential Sparse Triangular SolveParallel Sparse Triangular Solve
Parallel Algorithm for Sparse Triangular Solve
Partition: associate fine-grain tasks with each (i, j) suchthat lij 6= 0Communicate: task (i, i) communicates with task (j, i) and(i, j) for all possible jAgglomerate: form coarse-grain tasks for each column ofL, i.e., do 1-D agglomeration, combining fine-grain tasks(?, i) into agglomerated task iMap: assign coarse-grain tasks (columns of L) toprocessors with blocking (for locality) and/or cyclicity (forload balance and concurrency)
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 13 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sequential Sparse Triangular SolveParallel Sparse Triangular Solve
Analysis of 1-D Parallel Sparse Triangular Solve
Cost of 1-D algorithm will clearly be less than thecorresponding algorithm for the dense case
Load balance depends on distribution of nonzeros, cyclicitycan help distribute dense blocks
Naive algorithm with 1-D column blocking exploitsconcurrency only in fan-out updates
Communication bandwidth cost depends onsurface-to-volume ratio of each subset of verticesassociated with a block of columns
Higher concurrency and better performance possible withdynamic/adaptive algorithms
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 14 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Cholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm
Cholesky Factorization
Symmetric positive definite matrix A has Choleskyfactorization
A = LLT
where L is lower triangular matrix with positive diagonalentries
Linear systemAx = b
can then be solved by forward-substitution in lowertriangular system Ly = b, followed by back-substitution inupper triangular system LTx = y
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 15 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Cholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm
Computing Cholesky Factorization
Algorithm for computing Cholesky factorization can bederived by equating corresponding entries of A and LLT
and generating them in correct order
For example, in 2× 2 case[a11 a21a21 a22
]=
[`11 0`21 `22
] [`11 `210 `22
]so we have
`11 =√a11, `21 = a21/`11, `22 =
√a22 − `221
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 16 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Cholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm
Cholesky Factorization Algorithm
for k = 1 to nakk =
√akk
for i = k + 1 to naik = aik/akk
endfor j = k + 1 to n
for i = j to naij = aij − aik ajk
endend
end
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 17 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Cholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm
Cholesky Factorization Algorithm
All n square roots are of positive numbers, so algorithmwell defined
Only lower triangle of A is accessed, so strict uppertriangular portion need not be stored
Factor L computed in place, overwriting lower triangle of APivoting is not required for numerical stability
About n3/6 multiplications and similar number of additionsare required (about half as many as for LU)
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 18 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Cholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm
Parallel Algorithm
Partition
For i, j = 1, . . . , n, fine-grain task (i, j) stores aij andcomputes and stores {
`ij , if i ≥ j`ji, if i < j
yielding 2-D array of n2 fine-grain tasks
Zero entries in upper triangle of L need not be computedor stored, so for convenience in using 2-D mesh network,`ij can be redundantly computed as both task (i, j) andtask (j, i) for i > j
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 19 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Cholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm
Fine-Grain Tasks and Communication
a11ℓ11
a21ℓ21
a21ℓ21
a22ℓ22
a31ℓ31
a41ℓ41
a32ℓ32
a42ℓ42
a31ℓ31
a32ℓ32
a41ℓ41
a42ℓ42
a33ℓ33
a43ℓ43
a43ℓ43
a44ℓ44
a51ℓ51
a52ℓ52
a61ℓ61
a62ℓ62
a53ℓ53
a54ℓ54
a63ℓ63
a64ℓ64
a51ℓ51
a61ℓ61
a52ℓ52
a62ℓ62
a53ℓ53
a63ℓ63
a54ℓ54
a64ℓ64
a55ℓ55
a65ℓ65
a65ℓ65
a66ℓ66
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 20 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Cholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm
Fine-Grain Parallel Algorithmfor k = 1 to min(i, j)− 1
recv broadcast of akj from task (k, j)recv broadcast of aik from task (i, k)aij = aij − aik akj
endif i = j then
aii =√aii
broadcast aii to tasks (k, i) and (i, k), k = i+ 1, . . . , nelse if i < j then
recv broadcast of aii from task (i, i)aij = aij/aiibroadcast aij to tasks (k, j), k = i+ 1, . . . , n
elserecv broadcast of ajj from task (j, j)aij = aij/ajjbroadcast aij to tasks (i, k), k = j + 1, . . . , n
end
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 21 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Cholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm
Agglomeration Schemes
Agglomerate
Agglomeration of fine-grain tasks produces
2-D1-D column1-D row
parallel algorithms analogous to those for LU factorization,with similar performance and scalability
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 22 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Cholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm
Loop Orderings for Cholesky
Each choice of i, j, or k index in outer loop yields differentCholesky algorithm, named for portion of matrix updated bybasic operation in inner loops
Submatrix-Cholesky : (fan-out) with k in outer loop, innerloops perform rank-1 update of remaining unreducedsubmatrix using current column
Column-Cholesky : (fan-in) with j in outer loop, inner loopscompute current column using matrix-vector product thataccumulates effects of previous columns
Row-Cholesky : (fan-in) with i in outer loop, inner loopscompute current row by solving triangular system involvingprevious rows
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 23 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Cholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm
Memory Access Patterns
read only read and write
Submatrix-Cholesky Column-Cholesky Row-Cholesky
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 24 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Cholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm
Column-Oriented Cholesky Algorithms
Submatrix-Cholesky
for k = 1 to nakk =
√akk
for i = k + 1 to naik = aik/akk
endfor j = k + 1 to n
for i = j to naij = aij − aik ajk
endend
end
Column-Cholesky
for j = 1 to nfor k = 1 to j − 1
for i = j to naij = aij − aik ajk
endendajj =
√ajj
for i = j + 1 to naij = aij/ajj
endend
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 25 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Cholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm
Column Operations
Column-oriented algorithms can be stated more compactly byintroducing column operations
cdiv ( j ): column j is divided by square root of its diagonalentry
ajj =√ajj
for i = j + 1 to naij = aij/ajj
end
cmod ( j, k): column j is modified by multiple of column k,with k < j
for i = j to naij = aij − aik ajk
end
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 26 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Cholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm
Column-Oriented Cholesky Algorithms
Submatrix-Cholesky
for k = 1 to ncdiv (k)for j = k + 1 to n
cmod ( j, k)end
end
right-lookingimmediate-updatedata-drivenfan-out
Column-Cholesky
for j = 1 to nfor k = 1 to j − 1
cmod ( j, k)endcdiv ( j)
end
left-lookingdelayed-updatedemand-drivenfan-in
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 27 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Cholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm
Data Dependences
cmod (k + 1, k ) cmod (k + 2, k ) cmod (n, k )• • •
cdiv (k )
cmod (k, 1) cmod (k, 2 ) cmod (k, k - 1 )• • •
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 28 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Cholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm
Data Dependences
cmod (k, ?) operations along bottom can be done in anyorder, but they all have same target column, so updatingmust be coordinated to preserve data integrity
cmod (?, k) operations along top can be done in any order,and they all have different target columns, so updating canbe done simultaneously
Performing cmods concurrently is most important sourceof parallelism in column-oriented factorization algorithms
For dense matrix, each cdiv (k) depends on immediatelypreceding column, so cdivs must be done sequentially
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 29 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Sparsity Structure
For sparse matrix M , let Mi? denote its ith row and M?jits jth column
Define Struct (Mi?) = {k < i | mik 6= 0}, nonzero structureof row i of strict lower triangle of M
Define Struct (M?j) = {k > j | mkj 6= 0}, nonzero structureof column j of strict lower triangle of M
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 30 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Sparse Cholesky Algorithms
Submatrix-Cholesky
for k = 1 to ncdiv ( k)for j ∈ Struct (L?k)
cmod ( j, k)end
end
right-lookingimmediate-updatedata-drivenfan-out
Column-Cholesky
for j = 1 to nfor k ∈ Struct (Lj?)
cmod ( j, k)endcdiv ( j)
end
left-lookingdelayed-updatedemand-drivenfan-in
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 31 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Graph Model
Recall that adjacency graph G(A) of symmetric n× nmatrix A is undirected graph with edge between vertices iand j if aij 6= 0
At each step of Cholesky factorization algorithm,corresponding vertex is eliminated from graph
Neighbors of eliminated vertex in previous graph becomeclique (fully connected subgraph) in modified graph
Entries of A that were initially zero may become nonzeroentries, called fill
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 32 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Example: Graph Model of Elimination
××
×××
×××
× × ×× ×××× ×
×××
×××× ×××
× ×××××
A
××
×××
×××
×
××× ×××
× ×××××
L
+ ++ ++
3
7
1
6
9
5
4
8
2
3
7
6
9
5
4
8
2
3
7
6
9
5
4
8 7
6
9
5
4
8
7
6
9
5
89 8 7
6
9 87 9 89
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 33 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Elimination Tree
parent ( j) is row index of first offdiagonal nonzeroin column j of L, if any, and j otherwise
Elimination tree T (A) is graph having n vertices, with edgebetween vertices i and j, for i > j, if i = parent ( j)
If matrix is irreducible, then elimination tree is single treewith root at vertex n; otherwise, it is more accuratelytermed elimination forest
T (A) is spanning tree for filled graph, F (A), which is G(A)with all fill edges added
Each column of Cholesky factor L depends only on itsdescendants in elimination tree
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 34 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Example: Elimination Tree
××
×××
×××
× × ×× ×××× ×
×××
×××× ×××
× ×××××
A
××
×××
×××
×
××× ×××
× ×××××
L
+ ++ ++
3
7
1
6
9
5
4
8
2
3
7
1
6
9
5
4
8
23
7
1
6
9
5
4
8
2G (A) F (A) T (A)
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 35 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Effect of Matrix Ordering
Amount of fill depends on order in which variables areeliminated
Example: “arrow” matrix — if first row and column aredense, then factor fills in completely, but if last row andcolumn are dense, then they cause no fill
××
×××
×××
× ×××××××
×
×× × ×××××
××
×× ×
×× ×××
××××××
×
×××××× × ×
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 36 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Ordering Heuristics
General problem of finding ordering that minimizes fill isNP-complete, but there are relatively cheap heuristics that limitfill effectively
Bandwidth or profile reduction : reduce distance of nonzerodiagonals from main diagonal (e.g., RCM)
Minimum degree : eliminate node having fewest neighborsfirst
Nested dissection : recursively split graph into pieces usinga vertex separator, numbering separator vertices last
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 37 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Symbolic Factorization
For symmetric positive definite (SPD) matrices, orderingcan be determined in advance of numeric factorization
Only locations of nonzeros matter, not their numericalvalues, since pivoting is not required for numerical stability
Once ordering is selected, locations of all fill entries in Lcan be anticipated and efficient static data structure set upto accommodate them prior to numeric factorization
Structure of column j of L is given by union of structures oflower triangular portion of column j of A and prior columnsof L whose first nonzero below diagonal is in row j
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 38 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Solving Sparse SPD Systems
Basic steps in solving sparse SPD systems by Choleskyfactorization
1 Ordering : Symmetrically reorder rows and columns ofmatrix so Cholesky factor suffers relatively little fill
2 Symbolic factorization : Determine locations of all fillentries and allocate data structures in advance toaccommodate them
3 Numeric factorization : Compute numeric values of entriesof Cholesky factor
4 Triangular solve : Compute solution by forward- andback-substitution
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 39 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Parallel Sparse Cholesky
In sparse submatrix- or column-Cholesky, if ajk = 0, thencmod ( j, k) is omitted
Sparse factorization thus has additional source ofparallelism, since “missing” cmods may permit multiplecdivs to be done simultaneously
Elimination tree shows data dependences among columnsof Cholesky factor L, and hence identifies potentialparallelism
At any point in factorization process, all factor columnscorresponding to leaves in the elimination tree can becomputed simultaneously
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 40 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Parallel Sparse Cholesky
Height of elimination tree determines longest serial paththrough computation, and hence parallel execution time
Width of elimination tree determines degree of parallelismavailable
Short, wide, well-balanced elimination tree desirable forparallel factorization
Structure of elimination tree depends on ordering of matrix
So ordering should be chosen both to preserve sparsityand to enhance parallelism
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 41 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Levels of Parallelism in Sparse Cholesky
Fine-grainTask is one multiply-add pairAvailable in either dense or sparse caseDifficult to exploit effectively in practice
Medium-grainTask is one cmod or cdivAvailable in either dense or sparse caseAccounts for most of speedup in dense case
Large-grainTask computes entire set of columns in subtree ofelimination treeAvailable only in sparse case
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 42 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Example: Band Ordering, 1-D Grid
×××
×××
××××
××
×
××
×××
×
A
×××
×××
×××
×××
×
L
4
1
6
3
5
7
2
T (A)
4
1
6
3
5
7
2
G (A)
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 43 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Example: Minimum Degree, 1-D Grid
×××
×××
× ×××
×××
××
×××
×
A
×××
×××
×××
×××
×
L
7
1
4
5
6
2
3
G (A)
4
6
7
2
T (A)
1
3
5
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 44 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Example: Nested Dissection, 1-D Grid
×××
×××
× ×
×
×
×××××
×××
×
A
×××
×××
×××
××
××
L
7
1
6
2
5
4
3
G (A)
+ +4
6
7
2
T (A)
1
3
5
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 45 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Example: Band Ordering, 2-D Grid
××
×××
×××
×× ×× ×
××× ×
×××
×
×
××
××××
××
××
×
A
××
×××
×××
××××
×××
×××
××
×
L
+ ++
++
7
4
1
8
5
2
9
6
3
G (A)
+++
4
1
6
3
5
7
9
8
2
T (A)
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 46 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Example: Minimum Degree, 2-D Grid
××
×××
×××
× × ×× ×××× ×
×××
×××× ×××
× ×××××
A
××
×××
×××
×
××× ×××
× ×××××
L
+ ++ ++
3
7
1
6
9
5
4
8
2
G (A)
3
7
1
6
9
5
4
8
2
T (A)
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 47 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Example: Nested Dissection, 2-D Grid
××
×××
×××
× × ×× ×
×××××××
××
×
×
×××
××
×××
×A
××
×××
×××
××
×
×
×××
× ××
×××L
+ ++ ++
4
7
1
6
8
3
5
9
2
G (A)
4
7
1
6
9
3
5
8
2
T (A)
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 48 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Mapping
Cyclic mapping of columns to processors works well fordense problems, because it balances load andcommunication is global anyway
To exploit locality in communication for sparsefactorization, better approach is to map columns in subtreeof elimination tree onto local subset of processors
Still use cyclic mapping within dense submatrices(“supernodes”)
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 49 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Example: Subtree Mapping
00
00
00
0
0
0
1
1
1
1
1
1
1
1
1 2
2
2
2
2
2
2
2
2 3
3
3
3
3
3
3
3
3
0
0
1
2
2
3
2
3
0
1
3
1
0
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 50 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Fan-Out Sparse Choleskyfor j ∈ mycols
if j is leaf node in T (A) thencdiv ( j)send L?j to processes in map (Struct (L?j))mycols = mycols − { j }
endendwhile mycols 6= ∅
receive any column of L, say L?kfor j ∈ mycols ∩ Struct (L?k)
cmod ( j, k)if column j requires no more cmods then
cdiv ( j)send L?j to processes in map (Struct (L?j))mycols = mycols − { j }
endend
end
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 51 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Fan-In Sparse Choleskyfor j = 1 to n
if j ∈ mycols or mycols ∩ Struct (Lj?) 6= ∅ thenu = 0for k ∈ mycols ∩ Struct (Lj?)
u = u+ `jk L?kif j ∈ mycols then
incorporate u into factor column jwhile any aggregated update column
for column j remains, receive oneand incorporate it into factor column j
endcdiv ( j)
elsesend u to process map ( j)
endend
end
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 52 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Multifrontal Sparse Cholesky
Multifrontal algorithm operates recursively, starting fromroot of elimination tree for A
Dense frontal matrix Fj is initialized to have nonzeroentries from corresponding row and column of A as its firstrow and column, and zeros elsewhere
Fj is then updated by extend_add operations with updatematrices from its children in elimination tree
extend_add operation, denoted by ⊕, merges matrices bytaking union of their subscript sets and summing entries forany common subscripts
After updating of Fj is complete, its partial Choleskyfactorization is computed, producing corresponding rowand column of L as well as update matrix Uj
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 53 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Example: extend_add
a11 a13 a15 a18a31 a33 a35 a38a51 a53 a55 a58a81 a83 a85 a88
⊕b11 b12 b15 b17b21 b22 b25 b27b51 b52 b55 b57b71 b72 b75 b77
=
a11 + b11 b12 a13 a15 + b15 b17 a18b21 b22 0 b25 b27 0a31 0 a33 a35 0 a38
a51 + b51 b52 a53 a55 + b55 b57 a58b71 b72 0 b75 b77 0a81 0 a83 a85 0 a88
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 54 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Multifrontal Sparse CholeskyFactor( j)
Let {i1, . . . , ir} = Struct (L?j)
Let Fj =
aj,j aj,i1 . . . aj,irai1,j 0 . . . 0
......
. . ....
air,j 0 . . . 0
for each child i of j in elimination tree
Factor(i)Fj = Fj ⊕Ui
endPerform one step of dense Cholesky:
Fj =
`j,j 0`i1,j
... I`ir,j
1 00 Uj
`j,j `i1,j . . . `ir,j0 I
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 55 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Advantages of Multifrontal Method
Most arithmetic operations performed on dense matrices,which reduces indexing overhead and indirect addressing
Can take advantage of loop unrolling, vectorization, andoptimized BLAS to run at near peak speed on many typesof processors
Data locality good for memory hierarchies, such as cache,virtual memory with paging, or explicit out-of-core solvers
Naturally adaptable to parallel implementation byprocessing multiple independent fronts simultaneously ondifferent processors
Parallelism can also be exploited in dense matrixcomputations within each front
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 56 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Summary for Parallel Sparse Cholesky
Principal ingredients in efficient parallel algorithm for sparseCholesky factorization
Reordering matrix to obtain relatively short and wellbalanced elimination tree while also limiting fill
Multifrontal or supernodal approach to exploit densesubproblems effectively
Subtree mapping to localize communication
Cyclic mapping of dense subproblems to achieve goodload balance
2-D algorithm for dense subproblems to enhancescalability
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 57 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
Scalability of Sparse Cholesky
Performance and scalability of sparse Cholesky depend onsparsity structure of particular matrix
Sparse factorization with nested dissection requiresfactorization of dense matrix of dimension Θ(
√n ) for 2-D
grid problem with n grid points (√n is the size of the root
vertex separator), for which unconditional weak scalabilityis possible
However, efficiency often deteriorates as a result of therest of the sparse factorization taking more time
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 58 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
References – Dense Cholesky
G. Ballard, J. Demmel, O. Holtz, and O. Schwartz,Communication-optimal parallel and sequential Choleskydecomposition, SIAM J. Sci. Comput. 32:3495-3523, 2010
J. W. Demmel, M. T. Heath, and H. A. van der Vorst,Parallel numerical linear algebra, Acta Numerica2:111-197, 1993
D. O’Leary and G. W. Stewart, Data-flow algorithms forparallel matrix computations, Comm. ACM 28:840-853,1985
D. O’Leary and G. W. Stewart, Assignment and schedulingin parallel matrix factorization, Linear Algebra Appl.77:275-299, 1986
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 59 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
Sparse EliminationMatrix OrderingsParallel Algorithms
References – Sparse CholeskyM. T. Heath, Parallel direct methods for sparse linearsystems, D. E. Keyes, A. Sameh, and V. Venkatakrishnan,eds., Parallel Numerical Algorithms, pp. 55-90, Kluwer,1997
M. T. Heath, E. Ng and B. W. Peyton, Parallel algorithmsfor sparse linear systems, SIAM Review 33:420-460, 1991
J. Liu, Computational models and task scheduling forparallel sparse Cholesky factorization, Parallel Computing3:327-342, 1986
J. Liu, Reordering sparse matrices for parallel elimination,Parallel Computing 11:73-91, 1989
J. Liu, The role of elimination trees in sparse factorization,SIAM J. Matrix Anal. Appl. 11:134-172, 1990
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 60 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
References – Multifrontal Methods
I. S. Duff, Parallel implementation of multifrontal schemes,Parallel Computing 3:193-204, 1986
A. Gupta, Parallel sparse direct methods: a short tutorial,IBM Research Report RC 25076, November 2010
J. Liu, The multifrontal method for sparse matrix solution:theory and practice, SIAM Review 34:82-109, 1992
J. A. Scott, Parallel frontal solvers for large sparse linearsystems, ACM Trans. Math. Software 29:395-417, 2003
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 61 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
References – Scalability
A. George, J. Lui, and E. Ng, Communication results forparallel sparse Cholesky factorization on a hypercube,Parallel Computing 10:287-298, 1989
A. Gupta, G. Karypis, and V. Kumar, Highly scalableparallel algorithms for sparse matrix factorization, IEEETrans. Parallel Distrib. Systems 8:502-520, 1997
T. Rauber, G. Runger, and C. Scholtes, Scalability ofsparse Cholesky factorization, Internat. J. High SpeedComputing 10:19-52, 1999
R. Schreiber, Scalability of sparse direct solvers,A. George, J. R. Gilbert, and J. Liu, eds., Graph Theoryand Sparse Matrix Computation, pp. 191-209,Springer-Verlag, 1993
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 62 / 63
-
Sparse MatricesSparse Triangular SolveCholesky Factorization
Sparse Cholesky Factorization
References – Nonsymmetric Sparse Systems
I. S. Duff and J. A. Scott, A parallel direct solver for largesparse highly unsymmetric linear systems, ACM Trans.Math. Software 30:95-117, 2004
A. Gupta, A shared- and distributed-memory parallelgeneral sparse direct solver, Appl. Algebra Engrg.Commun. Comput., 18(3):263-277, 2007
X. S. Li and J. W. Demmel, SuperLU_Dist: A scalabledistributed-memory sparse direct solver for unsymmetriclinear systems, ACM Trans. Math. Software 29:110-140,2003
K. Shen, T. Yang, and X. Jiao, S+: Efficient 2D sparse LUfactorization on parallel machines, SIAM J. Matrix Anal.Appl. 22:282-305, 2000
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 63 / 63
Sparse MatricesSparse Matrix DefinitionsSparse Matrix Products
Sparse Triangular SolveSequential Sparse Triangular SolveParallel Sparse Triangular Solve
Cholesky FactorizationCholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm
Sparse Cholesky FactorizationSparse EliminationMatrix OrderingsParallel Algorithms