sparse linear algebra
TRANSCRIPT
Sparse Linear Algebra: Direct Methods, advanced features
P. Amestoy and A. Buttari(INPT(ENSEEIHT)-IRIT)
A. Guermouche (Univ. Bordeaux-LaBRI),J.-Y. L’Excellent and B. Ucar (INRIA-CNRS/LIP-ENS Lyon)
F.-H. Rouet (Lawrence Berkeley National Laboratory)C. Weisbecker INPT(ENSEEIHT)-IRIT
2013-2014
Sparse Matrix Factorizations
• A ∈ <m×m, symmetric positive definite → LLT = A
Ax = b
• A ∈ <m×m, symmetric → LDLT = A
Ax = b
• A ∈ <m×m, unsymmetric → LU = A
Ax = b
• A ∈ <m×n, m 6= n → QR = A
minx‖Ax− b‖ if m > n
min ‖x‖ such that Ax = b if n > m
Cholesky on a dense matrix
a11a21 a22a31 a32 a33a41 a42 a43 a44
modified
Left−looking Right−looking
used for modification
left-looking Cholesky
for k = 1, ..., n dofor i = k, ..., n do
for j = 1, ..., k − 1 do
a(k)ik = a
(k)ik − lij lkj
end forend for
lkk =
√a(k−1)kk
for i = k + 1, ..., n do
lik = a(k−1)ik /lkk
end forend for
right-looking Cholesky
for k = 1, ..., n do
lkk =
√a(k−1)kk
for i = k + 1, ..., n do
lik = a(k−1)ik /lkk
for j = k + 1, ..., i do
a(k)ij = a
(k)ij − likljk
end forend for
end for
Factorization of sparse matrices: problems
The factorization of a sparse matrix is problematic due to thepresence of fill-in.The basic LU step:
a(k+1)i,j = a
(k)i,j −
a(k)i,k a
(k)k,j
a(k)k,k
Even if a(k)i,j is null, a
(k+1)i,j can be a nonzero
138 JOSEPH W. H. LIU
a
d
fa
o
o o h0
o oo
d
I
oj
FIG. 2.1. An example of matrix structures.
3 10
84
G(AG(F)
G(Ft)= T(A
FIG. 2.2. Graph structures of the example in Fig. 2.1.
138 JOSEPH W. H. LIU
a
d
fa
o
o o h0
o oo
d
I
oj
FIG. 2.1. An example of matrix structures.
3 10
84
G(AG(F)
G(Ft)= T(A
FIG. 2.2. Graph structures of the example in Fig. 2.1.
• the factorization is more expensive than O(nz)• higher amount of memory required than O(nz)• more complicated algorithms to achieve the factorization
Factorization of sparse matrices: problems
Because of the fill-in, the matrix factorization is commonly precededby an analysis phase where:
• the fill-in is reduced through matrix permutations
• the fill-in is predicted though the use of (e.g.) elimination graphs
• computations are structured using elimination trees
The analysis must complete much faster than the actual factorization
Analysis
Fill-reducing ordering methods
Three main classes of methods for minimizing fill-in duringfactorization
Local approaches : At each step of the factorization, selection of thepivot that is likely to minimize fill-in.
• Method is characterized by the way pivots areselected.
• Markowitz criterion (for a general matrix)• Minimum degree (for symmetric matrices)
Global approaches : The matrix is permuted so as to confine thefill-in within certain parts of the permuted matrix
• Cuthill-McKee, Reverse Cuthill-McKee• Nested dissection
Hybrid approaches : First permute the matrix globally to confine thefill-in, then reorder small parts using local heuristics.
The elimination process in the graphs
GU (V,E)← undirected graph of Afor k = 1 : n− 1 doV ← V − k remove vertex kE ← E − (k, `) : ` ∈ adj(k) ∪ (x, y) : x ∈ adj(k) and y ∈adj(k)Gk ← (V,E) for definition
end for
Gk are the so-called elimination graphs (Parter,’61).
4
321
6 5
12
34
56
H0 =G0 :
The Elimination TreeThe elimination tree is a graph that expresses all the dependencies inthe factorization in a compact way.
Elimination tree: basic idea
If i depends on j (i.e., lij 6= 0, i > j) and k depends on i (i.e.,lki 6= 0, k > i), then k depends on j for the transitive relation andtherefore, the direct dependency expressed by lkj 6= 0, if it exists, isredundant
For each column j < n of L, remove all the nonzeros in the column jexcept the first one below the diagonal.Let Lt denote the remaining structure and consider the matrixFt = Lt + LT
t . The graph G(Ft) is a tree called the elimination tree.
138 JOSEPH W. H. LIU
a
d
fa
o
o o h0
o oo
d
I
oj
FIG. 2.1. An example of matrix structures.
3 10
84
G(AG(F)
G(Ft)= T(A
FIG. 2.2. Graph structures of the example in Fig. 2.1.
138 JOSEPH W. H. LIU
a
d
fa
o
o o h0
o oo
d
I
oj
FIG. 2.1. An example of matrix structures.
3 10
84
G(AG(F)
G(Ft)= T(A
FIG. 2.2. Graph structures of the example in Fig. 2.1.
Uses of the elimination tree
The elimination tree has several uses in the factorization of a sparsematrix:
• it expresses the order in which variables can be eliminated: the elimination of a variable only affects (directly or indirectly) its
ancestors and ... ... only depends on its descendants
Therefore, any topological order of the elimination tree leads to acorrect result and to the same fill-in
• it expresses concurrence: because variables in separate subtrees donot affect each other, they can be eliminated in parallel
Matrix factorization
Cholesky on a sparse matrix
The Cholesky factorization of a sparse matrix can be achieved with aleft-looking, right-looking or multifrontal method.
Reference case: regular 3× 3 grid ordered by nested dissection.Nodes in the separators are ordered last (see the section on orderings)
1 7 4
3 8 6
2 9 5
1 2
3
4 5
6
7
8
9
Notation:
• cdiv(j):√A(j, j) and then A(j + 1 : n, j)/
√A(j, j)
• cmod(j,k): A(j : n, j)−A(j, k) ∗A(j : n, k)
• struct(L(1:k,j)): the structure of L(1:k,j) submatrix
Sparse left-looking Cholesky
left-looking
for j=1 to n do
for k in struct(L(j,i:j-1)) do
cmod(j,k)
end for
cdiv(j)
end for
1 2
3
4 5
6
7
8
9
In the left-looking method, before variable j is eliminated, column j isupdated with all the columns that have a nonzero on line j. In theexample above, struct(L(7,1:6))=1, 3, 4, 6.
• this corresponds to receiving updates from nodes lower in thesubtree rooted at j
• the filled graph is necessary to determine the structure of each row
Sparse right-looking Cholesky
right-looking
for k=1 to n do
cdiv(k)
for j in struct(L(k+1:n,k)) do
cmod(j,k)
end for
end for
1 2
3
4 5
6
7
8
9
In the right-looking method, after variable k is eliminated, column kis used to update all the columns corresponding to nonzeros incolumn k. In the example above, struct(L(4:9,3))=7, 8, 9.
• this corresponds to sending updates to nodes higher in theelimination tree
• the filled graph is necessary to determine the structure of eachcolumn
The Multifrontal Method
Each node of the elimination tree is associated with a dense frontThe tree is traversed bottom-up and at each node
1. assembly of the front2. partial factorization
1 2
3
4 5
6
7
8
9
6789
467
569
The Multifrontal Method
Each node of the elimination tree is associated with a dense frontThe tree is traversed bottom-up and at each node
1. assembly of the front2. partial factorization
1 2
3
4 5
6
7
8
9
6789
467
569
a44 a46 a47a64 0 0a74 0 0
→ l44
l64 b66 b67l74 b76 b77
The Multifrontal Method
Each node of the elimination tree is associated with a dense frontThe tree is traversed bottom-up and at each node
1. assembly of the front2. partial factorization
1 2
3
4 5
6
7
8
9
6789
467
569
a55 a56 a59a65 0 0a95 0 0
→ l55
l65 c66 c69l95 c96 c99
The Multifrontal Method
Each node of the elimination tree is associated with a dense frontThe tree is traversed bottom-up and at each node
1. assembly of the front2. partial factorization
1 2
3
4 5
6
7
8
9
6789
467
569
a66 0 a68 00 0 0 0
a86 0 0 00 0 0 0
+
b66 b67 0 0b76 b77 0 00 0 0 00 0 0 0
+
c66 0 0 c690 0 0 00 0 0 0c96 0 0 c99
→
l66l76 d77 d78 d79l86 d87 d88 d89l86 d97 d98 d99
The Multifrontal Method
A dense matrix, called frontal matrix, is associated at each node ofthe elimination tree. The Multifrontal method consists in a bottom-uptraversal of the tree where at each node two operations are done:
• Assembly Nonzeros from the original matrix are assembledtogether with the contribution blocks from children nodes into thefrontal matrix
+ +...+cb-1 cb-n
arrowhead
• Elimination A partial factorization of the frontal matrix is done.The variable associated to the node of the frontal tree (called fullyassembled) can be eliminated. This step produces part of the finalfactors and a Schur complement (contribution block) that will beassembled into the father node
frontal
matrix
L
cb
The Multifrontal Method: example
1 2
3 4
5
6
Solve
Solve
Once the matrix is factorized, the problem can be solved against oneor more right-hand sides:
AX = LLTX = B, A,L ∈ Rn×n, X ∈ Rn×k, B ∈ Rn×k
The solution of this problem can be achieved in two steps:
forward substitution LZ = B
backward substitution LTX = Z
Solve: left-looking
1 2
3
4 5
6
7
8
9
=
=
1 2
3
4 5
6
7
8
9
Forward Substitution
Backward Substitution
Solve: right-looking
1 2
3
4 5
6
7
8
9
=
=
1 2
3
4 5
6
7
8
9
Forward Substitution
Backward Substitution
Direct solvers: resume
The solution of a sparse linear system can be achieved in three phasesthat include at least the following operations:
Analysis
• Fill-reducing permutation• Symbolic factorization• Elimination tree computation
Factorization
• The actual matrix factorization
Solve
• Forward substitution• Backward substitution
These phases, especially the analysis, can include many otheroperations. Some of them are presented next.
Parallelism
Parallelization: two levels of parallelism
tree parallelism arising from sparsity, it is formalized by the fact thatnodes in separate subtrees of the elimination tree can beeliminated at the same time
node parallelism within each node: parallel dense LU factorization(BLAS)
Incr
easi
ng n
ode
para
llelis
mD
ecre
asin
g tr
ee p
aral
lelis
m
LU
LL
U U
Exploiting the second level of parallelism is crucial
Multifrontal factorization(1) (2)
Computer #procs MFlops (speed-up) MFlops (speed-up)
Alliant FX/80 8 15 (1.9) 34 (4.3)IBM 3090J/6VF 6 126 (2.1) 227 (3.8)CRAY-2 4 316 (1.8) 404 (2.3)CRAY Y-MP 6 529 (2.3) 1119 (4.8)
Performance summary of the multifrontal factorization on matrix BCSSTK15. Incolumn (1), we exploit only parallelism from the tree. In column (2), we combinethe two levels of parallelism.
Task mapping and scheduling
Task mapping and scheduling: objective
Organize work to achieve a goal like makespan (total execution time)minimization, memory minimization or a mixture of the two:
• Mapping: where to execute a task?
• Scheduling: when and where to execute a task?
• several approaches are possible: static: Build the schedule before the execution and follow it at
run-time
• Advantage: very efficient since it has a global view of the system• Drawback: Requires a very-good performance model for the platform
dynamic: Take scheduling decisions dynamically at run-time
• Advantage: Reactive to the evolution of the platform and easy to useon several platforms
• Drawback: Decisions taken with local criteria (a decision which seemsto be good at time t can have very bad consequences at time t+ 1)
hybrid: Try to combine the advantages of static and dynamic
Task mapping and scheduling
The mapping/scheduling is commonly guided by a number ofparameters:
• First of all, a mapping/scheduling which is good for concurrency iscommonly not good in terms of memory consumption
3
11
4 521
16
6 7
1312
9 10
14
17
18
• Especially in distributed-memory environments, data transfers haveto be limited as much as possible
Dynamic scheduling on shared memory computers
The data can be shared between processors without anycommunication∗, therefore a dynamic scheduling is relatively simpleto implement and efficient:
• The dynamic scheduling can be done through a pool of “ready”tasks (a task can be, for example, seen as the assembly andfactorization of a front)
• as soon as a task becomes ready (i.e., when all the tasksassociated with the child nodes are done), it is queued in the pool
• processes pick up ready tasks from the pool and execute them. Assoon as one process finishes a task, it goes back to the pool topick up a new one and this continues until all the nodes of the treeare done
• the difficulty lies in choosing one task among all those in the poolbecause the order can influence both performance and memoryconsumption
Decomposition of the tree into levels
• Determination of Level L0 based on subtree cost.
Subtree roots
L0
1,2,...,8
1,2,3,4
6,7,8
1,23
6
7,8
1,2,3
5,6,7,8
4 5
6
6
7 833 1 2
• Scheduling of top of the tree can be dynamic.
Simplifying the mapping/scheduling problem
The tree commonly contains many more nodes than the availableprocesses.Objective : Find a layer L0 such that subtrees of L0 can be mappedonto the processor with a good balance.
Step A Step B Step C
Construction and mapping of the initial level L0
Let L0 ← Roots of the assembly treerepeat
Find the node q in L0 whose subtree has largest computationalcostSet L0 ← (L0\q) ∪ children of qGreedy mapping of the nodes of L0 onto the processorsEstimate the load unbalance
until load unbalance < threshold
Static scheduling on distributed memory computers
Main objective: reduce the volume of communication betweenprocessors because data is transferred through the (slow) network.
Proportional mapping
• initially assigns all processes to root node.
• performs a top-down traversal of the tree where the processesassign to a node are subdivided among its children in a way that isproportional to their relative weight
3
11
4 521
16
6 7
1312
9 10
14
17
181,2,3,4,5
1,2,3
1 2,3
4,5
4
1 1 2 3 3 4 4 5
4,5
4
Mapping of the tasks onto the 5 processors
Good at localizingcommunication but not soeasy if no overlappingbetween processorpartitions at each step.
Assumptions and Notations
• Assumptions : We assume that each column of L/each node of the tree is assigned
to a single processor. Each processor is in charge of computing cdiv(j) for columns j that
it owns.
• Notation : mycols(p) is the set of columns owned by processor p. map(j) gives the processor owning column j (or task j). procs(L(:,k)) = map(j) — j ∈ struct(L(:,k))
(only processors in procs(L(:,k)) require updates from column k –they correspond to ancestors of k in the tree).
father(j) is the father of node j in the elimination tree
Computational strategies for parallel direct solvers
The parallel algorithm is characterized by:
• Computational graph dependency
• Communication graph
There are three classical approaches to distributed memoryparallelism:
1. Fan-in : The fan-in algorithm is very similar to the left-lookingapproach and is demand-driven: data required are aggregatedupdate columns computed by sending processor
2. Fan-out : The fan-out algorithms is very similar to theright-looking approach and is data driven: data is sent as soon asit is produced.
3. Multifrontal : The communication pattern follows a bottom-uptraversal of the tree. Messages are contribution blocks and are sentto the processor mapped on the father node
Fan-in variant (similar to left looking)
1 2
3
4 5
6
7
8
9
fan-infor j=1 to n
u=0
for all k in (struct(L(j,1:j-1)) ∩ mycols(p) )
cmod(u,k)
end forif map(j) != p
send u to processor map(j)
elseincorporate u in column j
receive all the updates on column j and incorporate them
cdiv(j)
end ifend for
Fan-in variant
P0 P1 P2 P3
P4
if ∀i ∈ children map(i) = P0 and map(father) 6= P0 (only) onemessage sent by P0 → exploits data locality of proportional mapping.
Fan-in variant
P0 P1 P2 P3
P4
if ∀i ∈ children map(i) = P0 and map(father) 6= P0 (only) onemessage sent by P0 → exploits data locality of proportional mapping.
Fan-in variant
P0 P1 P2 P3
P4
if ∀i ∈ children map(i) = P0 and map(father) 6= P0 (only) onemessage sent by P0 → exploits data locality of proportional mapping.
Fan-in variant
P0 P1 P2 P3
P4
if ∀i ∈ children map(i) = P0 and map(father) 6= P0 (only) onemessage sent by P0 → exploits data locality of proportional mapping.
Fan-in variant
P0 P1 P2 P3
P4
if ∀i ∈ children map(i) = P0 and map(father) 6= P0 (only) onemessage sent by P0 → exploits data locality of proportional mapping.
Fan-in variant
P0 P1 P2 P3
P4
Communication
if ∀i ∈ children map(i) = P0 and map(father) 6= P0 (only) onemessage sent by P0 → exploits data locality of proportional mapping.
Fan-in variant
P0 P0 P0 P0
P4
Communication
if ∀i ∈ children map(i) = P0 and map(father) 6= P0 (only) onemessage sent by P0 → exploits data locality of proportional mapping.
Fan-out variant (similar to right-looking)
1 2
3
4 5
6
7
8
9
fan-outfor all leaf nodes j in mycols(p)
cdiv(j)
send column L(:,j) to procs(L(:,j))
mycols(p) = mycols(p) - jend forwhile mycols(p) != ∅
receive any column (say L(:,k))
for j in struct(L(:,k)) ∩ mycols(p)
cmod(j,k)
if column j is completely updated
cdiv(j)
send column L(:,j) to procs(L(:,j))
mycols(p) = mycols(p) - jend if
end forend while
Fan-out variant
P0 P1 P2 P3
P4
if ∀i ∈ children map(i) = P0 and map(father) 6= P0 then nmessages (where n is the number of children) are sent by P0 to
update the processor in charge of the father
Fan-out variant
P0 P0 P0 P0
P4
Communication
if ∀i ∈ children map(i) = P0 and map(father) 6= P0 then nmessages (where n is the number of children) are sent by P0 to
update the processor in charge of the father
Fan-out variant
P0 P0 P0 P0
P4
Communication
if ∀i ∈ children map(i) = P0 and map(father) 6= P0 then nmessages (where n is the number of children) are sent by P0 to
update the processor in charge of the father
Fan-out variant
P0 P0 P0 P0
P4
Communication
if ∀i ∈ children map(i) = P0 and map(father) 6= P0 then nmessages (where n is the number of children) are sent by P0 to
update the processor in charge of the father
Fan-out variant
P0 P0 P0 P0
P4
Communication
if ∀i ∈ children map(i) = P0 and map(father) 6= P0 then nmessages (where n is the number of children) are sent by P0 to
update the processor in charge of the father
Fan-out variant
Properties of fan-out:
• Historically the first implemented.
• Incurs greater interprocessor communications than fan-in (ormultifrontal) approach both in terms of total number of messages total volume
• Does not exploit data locality of proportional mapping.
• Improved algorithm (local aggregation): send aggregated update columns instead of individual factor columns
for columns mapped on a single processor. Improve exploitation of data locality of proportional mapping. But memory increase to store aggregates can be critical (as in fan-in).
Multifrontal
1 2
3
4 5
6
7
8
9
multifrontalfor all leaf nodes j in mycols(p)
assemble front j
partially factorize front j
send the schur complement to procs(father(j))
mycols(p) = mycols(p) - jend forwhile mycols(p) != ∅
receive any contribution block (say for node j)
assemble contribution block into front j
if front j is completely assembled
partially factorize front j
send the schur complement to procs(father(j))
mycols(p) = mycols(p) - jend if
end while
Multifrontal variant
P0 P0
P1
P2
Fan-in
P0 P0
P1
P2
Fan-out
P0 P0
P1
P2
Multifrontal
Multifrontal variant
P0 P0
P1
P2
Fan-in
P0 P0
P1
P2
Fan-out
P0 P0
P1
P2
Multifrontal
Multifrontal variant
P0 P0
P1
P2
Fan-in
P0 P0
P1
P2
Fan-out
P0 P0
P1
P2
Multifrontal
Multifrontal variant
P0 P0
P1
P2
Fan-in
P0 P0
P1
P2
Fan-out
P0 P0
P1
P2
Multifrontal
Multifrontal variant
P0 P0
P1
P2
Fan-in
P0 P0
P1
P2
Fan-out
P0 P0
P1
P2
Multifrontal
Importance of the shape of the tree
Suppose that each node in the tree corresponds to a task that:
• consumes temporary data from the children,
• produces temporary data, that is passed to the parent node.
• Wide tree Good parallelism Many temporary blocks to store Large memory usage
• Deep tree Less parallelism Smaller memory usage
Impact of fill reduction on the shape of the tree
Reorderingtechnique
Shape of the tree observations
AMD• Deep well-balanced
• Large frontal matrices ontop
AMF• Very deep unbalanced
• Small frontal matrices
Impact of fill reduction on the shape of the tree
PORD• deep unbalanced
• Small frontal matrices
SCOTCH• Very wide well-balanced
• Large frontal matrices
METIS• Wide well-balanced
• Smaller frontal matrices(than SCOTCH)
Memory consumption in the multifrontal method
Equivalent reorderings
A tree can be regarded as dependency graph where node producedata that is, then, processed by its parent.By this definition, a tree has to be traversed in topological order. Butthere are still many topological traversals of the same tree.
Definition
Two orderings P and Q are equivalent if the structures of the filledgraphs of PAP T and QAQT are the same (that is they areisomorphic).
Equivalent orderings result in the same amount of fill-in andcomputation during factorization. To ease the notation, we discussonly one ordering wrt A, i.e., P is an equivalent ordering of A if thefilled graph of A and that of PAP T are isomorphic.
Equivalent reorderings
Any topological ordering on T (A) are equivalent
Let P be the permutation matrix corresponding to a topologicalordering of T (A). Then, G+(PAP T ) and G+(A) are isomorphic.
Any topological ordering on T (A) are equivalent
Let P be the permutation matrix corresponding to a topologicalordering of T (A). The elimination tree T (PAP T ) and T (A) areisomorphic.
Because the fill-in won’t change, we have the freedom to choose anyspecific topological order that will provide other properties
Tree traversal orders
Which specific topological order to choose? postorder: why?Because data produced by nodes is consumed by parents in a LIFOorder.In the multifrontal method, we can thus use a stack memory wherecontribution blocks are pushed as soon as they are produced by theelimination on a frontal matrix and popped at the moment where thefather node is assembled.This provides a better data locality and makes the memorymanagement easier.
The multifrontal method [Duff & Reid ’83]
1 2 3 4 512345
1 2 3 4 512345
A= L+U-I=
Storage is divided into two parts:
• Factors
• Active memory
FactorsActivefrontalmatrix
Stack ofcontributionblocks
Active storage
5
54
145
34
23
Elimination tree
Active memory
The multifrontal method [Duff & Reid ’83]
1 2 3 4 512345
1 2 3 4 512345
A= L+U-I=
Storage is divided into two parts:
• Factors
• Active memory
5
54
145
34
23
Elimination tree
Active memory
The multifrontal method [Duff & Reid ’83]
1 2 3 4 512345
1 2 3 4 512345
A= L+U-I=
Storage is divided into two parts:
• Factors
• Active memory
5
54
145
34
23
Elimination tree
Active memory
The multifrontal method [Duff & Reid ’83]
1 2 3 4 512345
1 2 3 4 512345
A= L+U-I=
Storage is divided into two parts:
• Factors
• Active memory
5
54
145
34
23
Elimination tree
Active memory
The multifrontal method [Duff & Reid ’83]
1 2 3 4 512345
1 2 3 4 512345
A= L+U-I=
Storage is divided into two parts:
• Factors
• Active memory
5
54
145
34
23
Elimination tree
Active memory
The multifrontal method [Duff & Reid ’83]
1 2 3 4 512345
1 2 3 4 512345
A= L+U-I=
Storage is divided into two parts:
• Factors
• Active memory
5
54
145
34
23
Elimination tree
Active memory
The multifrontal method [Duff & Reid ’83]
1 2 3 4 512345
1 2 3 4 512345
A= L+U-I=
Storage is divided into two parts:
• Factors
• Active memory
5
54
145
34
23
Elimination tree
Active memory
The multifrontal method [Duff & Reid ’83]
1 2 3 4 512345
1 2 3 4 512345
A= L+U-I=
Storage is divided into two parts:
• Factors
• Active memory
5
54
145
34
23
Elimination tree
Active memory
The multifrontal method [Duff & Reid ’83]
1 2 3 4 512345
1 2 3 4 512345
A= L+U-I=
Storage is divided into two parts:
• Factors
• Active memory
5
54
145
34
23
Elimination tree
Active memory
The multifrontal method [Duff & Reid ’83]
1 2 3 4 512345
1 2 3 4 512345
A= L+U-I=
Storage is divided into two parts:
• Factors
• Active memory
5
54
145
34
23
Elimination tree
Active memory
The multifrontal method [Duff & Reid ’83]
1 2 3 4 512345
1 2 3 4 512345
A= L+U-I=
Storage is divided into two parts:
• Factors
• Active memory
5
54
145
34
23
Elimination tree
Active memory
The multifrontal method [Duff & Reid ’83]
1 2 3 4 512345
1 2 3 4 512345
A= L+U-I=
Storage is divided into two parts:
• Factors
• Active memory
5
54
145
34
23
Elimination tree
Active memory
The multifrontal method [Duff & Reid ’83]
1 2 3 4 512345
1 2 3 4 512345
A= L+U-I=
Storage is divided into two parts:
• Factors
• Active memory
5
54
145
34
23
Elimination tree
Active memory
The multifrontal method [Duff & Reid ’83]
Storage is divided into two parts:
• Factors
• Active memory
5
54
145
34
23
Elimination tree
Active memory
Example 1: Processing a wide tree
1 2 3 6
4 5
7
Memory
..
.
..
.
unused memory space stack memory spacefactor memory space non-free memory space
1
1
65
5
5
5
5
5
11
1 1
1
1
1
1
1
1
2
2
2
2
2
2
2
23
3
3
3
3
3
3
3
4
4
4
4
4
6
6
6
6
6
6
4
7
7
7
Active memory
Example 2: Processing a deep tree
1 2
3
4
Memory
unused memory space stack memory spacefactor memory space non-free memory space
1
11
1122
1 23
1 23
1 2 33
1 2 33 4
1 23 4
..
.
12
..
.
Allocation of 3
Assembly step for 3
Factorization step for 3 +
Stack step for 3
Postorder traversals: memory
Postorder provides a good data locality and better memoryconsumption that a general topological order since father nodes areassembled as soon as its children have been processed.But there are still many postorders of the same tree. Which one tochoose? the one that minimizes memory consumption
a b ab
c
d f
e
g
h
c
d
e
f
g
h
ii
Best (abcdefghi) Worst (hfdbacegi)
Leaves
Root
Problem model
• Mi: memory peak for complete subtree rooted at i,
• tempi: temporary memory produced by node i,
• mparent: memory for storing the parent.
M2 M3
M(parent)
M1
temp3temp2
temp1
Mparent = max( maxnbchildrenj=1 (Mj +
∑j−1k=1 tempk),
mparent +∑nbchildren
j=1 tempj)(1)
Objective: order the children to minimize Mparent
Problem model
• Mi: memory peak for complete subtree rooted at i,
• tempi: temporary memory produced by node i,
• mparent: memory for storing the parent.
M2 M3
M(parent)
M1
temp3temp2
temp1
Mparent = max( maxnbchildrenj=1 (Mj +
∑j−1k=1 tempk),
mparent +∑nbchildren
j=1 tempj)(1)
Objective: order the children to minimize Mparent
Memory-minimizing schedules
Theorem
[Liu,86] The minimum of maxj(xj +∑j−1
i=1 yi) is obtained when thesequence (xi, yi) is sorted in decreasing order of xi − yi.
Corollary
An optimal child sequence is obtained by rearranging the childrennodes in decreasing order of Mi − tempi.
Interpretation: At each level of the tree, child with relatively largepeak of memory in its subtree (Mi large with respect to tempi)should be processed first.
⇒ Apply on complete tree starting from the leaves(or from the root with a recursive approach)
Optimal tree reordering
Objective: Minimize peak of stack memory
Tree Reorder (T ):for all i in the set of root nodes do
Process Node(i);end for
Process Node(i):if i is a leaf thenMi=mi
elsefor j = 1 to nbchildren do
Process Node(jth child);end forReorder the children of i in decreasing order of (Mj − tempj);Compute Mparent at node i using Formula (1);
end if
Memory consumption in parallel
In parallel multiple branches have to be traversed at the same time inorder to feed all the processes. This means that an higher number ofCBs will have to be stored in memory
Metric: memory efficiency
e(p) =Sseq
p× Smax(p)
We would like e(p) ' 1, i.e. Sseq/p on each processor.
Common mappings/schedulings provide a poor memory efficiency:
• Proportional mapping: limp→∞
e(p) = 0 on regular problems.
Proportional mapping
Proportional mapping: top-down traversal of the tree, where processorsassigned to a node are distributed among its children proportionally to theweight of their respective subtrees.
• Targets run time but poor memory efficiency.• Usually a relaxed version is used: more memory-friendly but unreliable
estimates.
Proportional mapping
Proportional mapping: top-down traversal of the tree, where processorsassigned to a node are distributed among its children proportionally to theweight of their respective subtrees.
64
• Targets run time but poor memory efficiency.• Usually a relaxed version is used: more memory-friendly but unreliable
estimates.
Proportional mapping
Proportional mapping: top-down traversal of the tree, where processorsassigned to a node are distributed among its children proportionally to theweight of their respective subtrees.
64
3262640% 10% 50%
• Targets run time but poor memory efficiency.• Usually a relaxed version is used: more memory-friendly but unreliable
estimates.
Proportional mapping
Proportional mapping: top-down traversal of the tree, where processorsassigned to a node are distributed among its children proportionally to theweight of their respective subtrees.
64
32626
26
13 13
2 2 221 11
11 10
• Targets run time but poor memory efficiency.• Usually a relaxed version is used: more memory-friendly but unreliable
estimates.
Proportional mapping
Proportional mapping: assuming that the sequential peak is 5 GB,
Smax(p) > max
4 GB
26,
1 GB
6,
5 GB
32
= 0.16 GB⇒ e(p) 6
5
64× 0.166 0.5
64
326264GB 1GB 5GB
5GB
A more memory-friendly strategy. . .
All-to-all mapping: postorder traversal of the tree, where all the processors
work at every node:
64
646464
64
64 64
64 64
6464
646464
Optimal memory scalability (Smax(p) = Sseq/p) but no tree parallelism and
prohibitive amounts of communications.
“memory-aware” mappings
“Memory-aware” mapping (Agullo ’08): aims at enforcing a given memoryconstraint (M0, maximum memory per processor):
1. Try to apply proportional mapping.
2. Enough memory for each subtree?
If not, serialize them, and update M0:processors stack equal shares of the CBs from previous nodes.
64
326264GB 1GB 5GB
5GB
• Ensures the given memory constraint and provides reliable estimates.• Tends to assign many processors on nodes at the top of the tree
=⇒ performance issues on parallel nodes.
“memory-aware” mappings
“Memory-aware” mapping (Agullo ’08): aims at enforcing a given memoryconstraint (M0, maximum memory per processor):
1. Try to apply proportional mapping.2. Enough memory for each subtree?
If not, serialize them, and update M0:processors stack equal shares of the CBs from previous nodes.
64
326264GB 1GB 5GB
5GB
!
• Ensures the given memory constraint and provides reliable estimates.• Tends to assign many processors on nodes at the top of the tree
=⇒ performance issues on parallel nodes.
“memory-aware” mappings
“Memory-aware” mapping (Agullo ’08): aims at enforcing a given memoryconstraint (M0, maximum memory per processor):
1. Try to apply proportional mapping.2. Enough memory for each subtree? If not, serialize them, and update M0:
processors stack equal shares of the CBs from previous nodes.
64
64 6464
• Ensures the given memory constraint and provides reliable estimates.• Tends to assign many processors on nodes at the top of the tree
=⇒ performance issues on parallel nodes.
“memory-aware” mappings
“Memory-aware” mapping (Agullo ’08): aims at enforcing a given memoryconstraint (M0, maximum memory per processor):
1. Try to apply proportional mapping.2. Enough memory for each subtree?
If not, serialize them, and update M0:processors stack equal shares of the CBs from previous nodes.
64
64 6464
64 6464
32 32
22 21 21
32 32
• Ensures the given memory constraint and provides reliable estimates.• Tends to assign many processors on nodes at the top of the tree
=⇒ performance issues on parallel nodes.
“memory-aware” mappings
subroutine ma_mapping(r, p, M, S)
! r : root of the subtree
! p(r) : number of processes mapped on r
! M(r) : memory constraint on r
! S(:) : peak memory consumption for all nodes
call prop_mapping(r, p, M)
forall (i children of r)
! check if the constraint is respected
if( S(i)/p(i) > M(i) ) then
! reject PM and revert to tree serialization
call tree_serialization(r, p, M)
exit ! from forall block
end if
end forall
forall (i children of r)
! apply MA mapping recursively to all siblings
call ma_mapping(i, p, M, S)
end forall
end subroutine ma_mapping
“memory-aware” mappings
subroutine prop_mapping(r, p, M)
forall (i children of r)
p(i) = share of the p(r) processes from PM
M(i) = M(r)
end forall
end subroutine prop_mapping
subroutine tree_serialization(r, p, M)
stack_siblings = 0
forall (i children of r) ! in appropriate order
p(i) = p(r)
! update the memory constraint for the subtree
M(i) = M(r) - stack_siblings
stack_siblings = stack_siblings + cb(i)/p(r)
end forall
end subroutine tree_serialization
“memory-aware” mappings
A finer “memory-aware” mapping? Serializing all the children at once is very
constraining: more tree parallelism can be found.
64
51 641364
Find groups on which proportional mapping works, and serialize thesegroups.
Heuristic: follow a given order (e.g. the serial postorder) and form groups as
large as possible.
Experiments
• Matrix: finite-difference model of acoustic wave propagation,27-point stencil, 192× 192× 192 grid; Seiscope consortium.N = 7 M, nnz = 189 M, factors=152 GB (METIS).Sequential peak of active memory: 46 GB.
• Machine: 64 nodes with two quad-core Xeon X5560 per node.We use 256 MPI processes.
• Perfect memory scalability: 46 GB/256=180MB.
Experiments
Prop Memory-aware mapping
Map M0 = 225 MB M0 = 380 MB
w/o groups w/ groups
Max stack peak (MB) 1932 227 330 381
Avg stack peak (MB) 626 122 177 228
Time (s) 1323 2690 2079 1779
• proportional mapping delivers the fastest factorization because ittargets parallelism but, at the same time, it leads to a poormemory scaling with e(256) = 0.09
• the memory aware mapping with and without groups respects theimposed memory constraint with a speed that depends on howtight the constraint is
• grouping clearly provides a benefit in term of speed because itdelivers more concurrency while still respecting the constraint
Low-Rank approximations
Low-rank matrices
Take a dense matrix B of size n×n and compute its SVD B = XSY :
Low-rank matrices
Take a dense matrix B of size n×n and compute its SVD B = XSY :
B = X1S1Y1 +X2S2Y2 with S1(k, k) = σk > ε, S2(1, 1) = σk+1 ≤ ε
Low-rank matrices
Take a dense matrix B of size n×n and compute its SVD B = XSY :
B = X1S1Y1 +X2S2Y2 with S1(k, k) = σk > ε, S2(1, 1) = σk+1 ≤ ε
If B = X1S1Y1 then ‖B − B‖2 = ‖X2S2Y2‖2 = σk+1 ≤ ε
Low-rank matrices
Take a dense matrix B of size n×n and compute its SVD B = XSY :
B = X1S1Y1 +X2S2Y2 with S1(k, k) = σk > ε, S2(1, 1) = σk+1 ≤ ε
If B = X1S1Y1 then ‖B − B‖2 = ‖X2S2Y2‖2 = σk+1 ≤ ε
If the singular values of B decay very fast (e.g. exponentially) thenk n even for very small ε (e.g. 10−14) ⇒ memory and CPUconsumption can be reduced considerably with a controlled loss ofaccuracy (≤ ε) if B is used instead of B
Multifrontal BLR LU factorizationFrontal matrices are not low-rank but in many applications theyexhibit low-rank blocks.
L i1,1
U i1,1
L i2,2
U i2,2
L idi,di
U idi,di
CB
Because each row and column of a frontal matrix are associated witha point of the discretization mesh, blocking the frontal matrixamounts to grouping the corresponding points of the domain. Theobjective is thus to group (cluster) the points in such a way that thenumber of blocks with low-rank property is maximized and theirrespective rank is minimized.
Clustering of points/variables
A block represents the interaction between two subdomains σ and τ .If they have a small diameter and are far away the interaction is weak⇒ rank is low
τσ rank
of
80
100
120
140
160
180
200
0 10 20 30 40 50 60 70 80 90 100
128
distance between and
Guidelines for computing the grouping of points:
• ∀σ, cmin ≥ |σ| ≥ cmax : we want many compressible blocks butnot too small to maintain a good compression rate and a goodefficiency of operations.
• ∀σ, minimize diam(σ): for a given cluster size, this gives wellshaped clusters and consequently maximizes mutual distances.
Clustering of points/variables
A block represents the interaction between two subdomains σ and τ .If they have a small diameter and are far away the interaction is weak⇒ rank is low
τσ rank
of
80
100
120
140
160
180
200
0 10 20 30 40 50 60 70 80 90 100
128
distance between and
Guidelines for computing the grouping of points:
• ∀σ, cmin ≥ |σ| ≥ cmax : we want many compressible blocks butnot too small to maintain a good compression rate and a goodefficiency of operations.
• ∀σ, minimize diam(σ): for a given cluster size, this gives wellshaped clusters and consequently maximizes mutual distances.
Clustering of points/variables
A block represents the interaction between two subdomains σ and τ .If they have a small diameter and are far away the interaction is weak⇒ rank is low
τσ rank
of
80
100
120
140
160
180
200
0 10 20 30 40 50 60 70 80 90 100
128
distance between and
Guidelines for computing the grouping of points:
• ∀σ, cmin ≥ |σ| ≥ cmax : we want many compressible blocks butnot too small to maintain a good compression rate and a goodefficiency of operations.
• ∀σ, minimize diam(σ): for a given cluster size, this gives wellshaped clusters and consequently maximizes mutual distances.
Clustering of points/variables
A block represents the interaction between two subdomains σ and τ .If they have a small diameter and are far away the interaction is weak⇒ rank is low
τσ rank
of
80
100
120
140
160
180
200
0 10 20 30 40 50 60 70 80 90 100
128
distance between and
Guidelines for computing the grouping of points:
• ∀σ, cmin ≥ |σ| ≥ cmax : we want many compressible blocks butnot too small to maintain a good compression rate and a goodefficiency of operations.
• ∀σ, minimize diam(σ): for a given cluster size, this gives wellshaped clusters and consequently maximizes mutual distances.
Clustering of points/variables
A block represents the interaction between two subdomains σ and τ .If they have a small diameter and are far away the interaction is weak⇒ rank is low
τσ rank
of
80
100
120
140
160
180
200
0 10 20 30 40 50 60 70 80 90 100
128
distance between and
Guidelines for computing the grouping of points:
• ∀σ, cmin ≥ |σ| ≥ cmax : we want many compressible blocks butnot too small to maintain a good compression rate and a goodefficiency of operations.
• ∀σ, minimize diam(σ): for a given cluster size, this gives wellshaped clusters and consequently maximizes mutual distances.
Clustering of points/variables
A block represents the interaction between two subdomains σ and τ .If they have a small diameter and are far away the interaction is weak⇒ rank is low
τσ rank
of
80
100
120
140
160
180
200
0 10 20 30 40 50 60 70 80 90 100
128
distance between and
Guidelines for computing the grouping of points:
• ∀σ, cmin ≥ |σ| ≥ cmax : we want many compressible blocks butnot too small to maintain a good compression rate and a goodefficiency of operations.
• ∀σ, minimize diam(σ): for a given cluster size, this gives wellshaped clusters and consequently maximizes mutual distances.
Clustering
For example, if the points associated to the rows/columns of a frontalmatrix lie on a square surface, it is much better to cluster in acheckerboard fashion (right), rather than into slices (left).
In practice and in an algebraic context (where the discretization meshis not known) this can be achieved by extracting the subgraphassociated with the front variables and by computing a k-waypartitioning (with a tool like METIS or SCOTCH, for example).
Clustering
Here is an example of variables clustering achieved with the x-waypartitioning method in the SCOTCH package
Although irregular, the partitioning provides relatively well-shapedclusters.
Admissibility condition
Once the clustering of variables is defined, it is possible to compressall the blocks F (σ, τ) that satisfy a (strong) admissibility condition:
Adms(σ, τ) : max(diam(σ), diam(τ)) ≤ η dist(σ, τ)
For simplicity, we can use a relaxed version called weak admissibilitycondition:
Admw(σ, τ) : dist(σ, τ) > 0.
The far field of a cluster σ is defined as
F(σ) = τ | Adm(σ, τ) is true.
In practice it is easier to consider all block (except diagonal) eligible,try to compress them and eventually revert to non compressed form ifrank is too high.
Dense BLR LU factorization
off-diagonal blocks arecompressed in low-rank formright before updating thetrailing submatrix. As a result,updating a block Aij of size busing blocks Lik, Ukj of rank ronly takes O(br2) operationsinstead of O(b3)
• FACTOR
• SOLVE
• COMPRESS
• UPDATE
Dense BLR LU factorization
LU off-diagonal blocks are
compressed in low-rank formright before updating thetrailing submatrix. As a result,updating a block Aij of size busing blocks Lik, Ukj of rank ronly takes O(br2) operationsinstead of O(b3)
• FACTOR Akk = LkkUkk
• SOLVE
• COMPRESS
• UPDATE
Dense BLR LU factorization
LU
L
L
L
U U U off-diagonal blocks arecompressed in low-rank formright before updating thetrailing submatrix. As a result,updating a block Aij of size busing blocks Lik, Ukj of rank ronly takes O(br2) operationsinstead of O(b3)
• FACTOR
• SOLVE Ukj = L−1kkAkj , Lik = AikU−1kk
• COMPRESS
• UPDATE
Dense BLR LU factorization
LU
L
L
L
U U U off-diagonal blocks arecompressed in low-rank formright before updating thetrailing submatrix. As a result,updating a block Aij of size busing blocks Lik, Ukj of rank ronly takes O(br2) operationsinstead of O(b3)
• FACTOR
• SOLVE
• COMPRESS
• UPDATE Aij = Aij − LikUkj
Dense BLR LU factorization
LU
LU
L
L
L
U U U off-diagonal blocks arecompressed in low-rank formright before updating thetrailing submatrix. As a result,updating a block Aij of size busing blocks Lik, Ukj of rank ronly takes O(br2) operationsinstead of O(b3)
• FACTOR
• SOLVE
• COMPRESS
• UPDATE
Dense BLR LU factorization
LU
LU
L
L
L
U U U
U U
L
L
off-diagonal blocks arecompressed in low-rank formright before updating thetrailing submatrix. As a result,updating a block Aij of size busing blocks Lik, Ukj of rank ronly takes O(br2) operationsinstead of O(b3)
• FACTOR
• SOLVE
• COMPRESS
• UPDATE
Dense BLR LU factorization
LU
LU
L
L
L
U U U
U U
L
L
off-diagonal blocks arecompressed in low-rank formright before updating thetrailing submatrix. As a result,updating a block Aij of size busing blocks Lik, Ukj of rank ronly takes O(br2) operationsinstead of O(b3)
• FACTOR
• SOLVE
• COMPRESS
• UPDATE
Dense BLR LU factorization
LU
LU
LU
L
L
L
U U U
U U
L
L
off-diagonal blocks arecompressed in low-rank formright before updating thetrailing submatrix. As a result,updating a block Aij of size busing blocks Lik, Ukj of rank ronly takes O(br2) operationsinstead of O(b3)
• FACTOR
• SOLVE
• COMPRESS
• UPDATE
Dense BLR LU factorization
LU
LU
LU
L
L
L
U U U
U U
L
L
U
L
off-diagonal blocks arecompressed in low-rank formright before updating thetrailing submatrix. As a result,updating a block Aij of size busing blocks Lik, Ukj of rank ronly takes O(br2) operationsinstead of O(b3)
• FACTOR
• SOLVE
• COMPRESS
• UPDATE
Dense BLR LU factorization
LU
LU
LU
L
L
L
U U U
U U
L
L
U
L
off-diagonal blocks arecompressed in low-rank formright before updating thetrailing submatrix. As a result,updating a block Aij of size busing blocks Lik, Ukj of rank ronly takes O(br2) operationsinstead of O(b3)
• FACTOR
• SOLVE
• COMPRESS
• UPDATE
Dense BLR LU factorization
LU
LU
LU
LU
L
L
L
U U U
U U
L
L
U
L
off-diagonal blocks arecompressed in low-rank formright before updating thetrailing submatrix. As a result,updating a block Aij of size busing blocks Lik, Ukj of rank ronly takes O(br2) operationsinstead of O(b3)
• FACTOR
• SOLVE
• COMPRESS
• UPDATE
Dense BLR LU factorization
LU
LU
LU
LU
L
L
L
U U U
U U
L
L
U
L
off-diagonal blocks arecompressed in low-rank formright before updating thetrailing submatrix. As a result,updating a block Aij of size busing blocks Lik, Ukj of rank ronly takes O(br2) operationsinstead of O(b3)
• FACTOR
• SOLVE
• COMPRESS
• UPDATE
Dense BLR LU factorization
off-diagonal blocks arecompressed in low-rank formright before updating thetrailing submatrix. As a result,updating a block Aij of size busing blocks Lik, Ukj of rank ronly takes O(br2) operationsinstead of O(b3)
• FACTOR
• SOLVE
• COMPRESS
• UPDATE
Dense BLR LU factorization
LU off-diagonal blocks are
compressed in low-rank formright before updating thetrailing submatrix. As a result,updating a block Aij of size busing blocks Lik, Ukj of rank ronly takes O(br2) operationsinstead of O(b3)
• FACTOR
• SOLVE
• COMPRESS
• UPDATE
Dense BLR LU factorization
LU
L
L
L
U U U off-diagonal blocks arecompressed in low-rank formright before updating thetrailing submatrix. As a result,updating a block Aij of size busing blocks Lik, Ukj of rank ronly takes O(br2) operationsinstead of O(b3)
• FACTOR
• SOLVE
• COMPRESS
• UPDATE
Dense BLR LU factorization
LU
L
L
L
U U U off-diagonal blocks arecompressed in low-rank formright before updating thetrailing submatrix. As a result,updating a block Aij of size busing blocks Lik, Ukj of rank ronly takes O(br2) operationsinstead of O(b3)
• FACTOR
• SOLVE
• COMPRESS L∗k = X∗kYT∗k, Uk∗ = Xk∗Y
Tk∗
• UPDATE
Dense BLR LU factorization
LU
L
L
L
U U U off-diagonal blocks arecompressed in low-rank formright before updating thetrailing submatrix. As a result,updating a block Aij of size busing blocks Lik, Ukj of rank ronly takes O(br2) operationsinstead of O(b3)
• FACTOR
• SOLVE
• COMPRESS
• UPDATE
Dense BLR LU factorization
LU
LU
L
L
L
U U U off-diagonal blocks arecompressed in low-rank formright before updating thetrailing submatrix. As a result,updating a block Aij of size busing blocks Lik, Ukj of rank ronly takes O(br2) operationsinstead of O(b3)
• FACTOR
• SOLVE
• COMPRESS
• UPDATE
Dense BLR LU factorization
LU
LU
L
L
L
U U U
U U
L
L
off-diagonal blocks arecompressed in low-rank formright before updating thetrailing submatrix. As a result,updating a block Aij of size busing blocks Lik, Ukj of rank ronly takes O(br2) operationsinstead of O(b3)
• FACTOR
• SOLVE
• COMPRESS
• UPDATE
Dense BLR LU factorization
LU
LU
L
L
L
U U U
U U
L
L
off-diagonal blocks arecompressed in low-rank formright before updating thetrailing submatrix. As a result,updating a block Aij of size busing blocks Lik, Ukj of rank ronly takes O(br2) operationsinstead of O(b3)
• FACTOR
• SOLVE
• COMPRESS
• UPDATE
Dense BLR LU factorization
LU
LU
L
L
L
U U U
U U
L
L
off-diagonal blocks arecompressed in low-rank formright before updating thetrailing submatrix. As a result,updating a block Aij of size busing blocks Lik, Ukj of rank ronly takes O(br2) operationsinstead of O(b3)
• FACTOR
• SOLVE
• COMPRESS
• UPDATE
Dense BLR LU factorization
LU
LU
LU
L
L
L
U U U
U U
L
L
off-diagonal blocks arecompressed in low-rank formright before updating thetrailing submatrix. As a result,updating a block Aij of size busing blocks Lik, Ukj of rank ronly takes O(br2) operationsinstead of O(b3)
• FACTOR
• SOLVE
• COMPRESS
• UPDATE
Dense BLR LU factorization
LU
LU
LU
L
L
L
U U U
U U
L
L
U
L
off-diagonal blocks arecompressed in low-rank formright before updating thetrailing submatrix. As a result,updating a block Aij of size busing blocks Lik, Ukj of rank ronly takes O(br2) operationsinstead of O(b3)
• FACTOR
• SOLVE
• COMPRESS
• UPDATE
Dense BLR LU factorization
LU
LU
LU
L
L
L
U U U
U U
L
L
U
L
off-diagonal blocks arecompressed in low-rank formright before updating thetrailing submatrix. As a result,updating a block Aij of size busing blocks Lik, Ukj of rank ronly takes O(br2) operationsinstead of O(b3)
• FACTOR
• SOLVE
• COMPRESS
• UPDATE
Dense BLR LU factorization
LU
LU
LU
L
L
L
U U U
U U
L
L
U
L
off-diagonal blocks arecompressed in low-rank formright before updating thetrailing submatrix. As a result,updating a block Aij of size busing blocks Lik, Ukj of rank ronly takes O(br2) operationsinstead of O(b3)
• FACTOR
• SOLVE
• COMPRESS
• UPDATE
Dense BLR LU factorization
LU
LU
LU
LU
L
L
L
U U U
U U
L
L
U
L
off-diagonal blocks arecompressed in low-rank formright before updating thetrailing submatrix. As a result,updating a block Aij of size busing blocks Lik, Ukj of rank ronly takes O(br2) operationsinstead of O(b3)
• FACTOR
• SOLVE
• COMPRESS
• UPDATE
Dense BLR LU factorization
LU
LU
LU
LU
L
L
L
U U U
U U
L
L
U
L
off-diagonal blocks arecompressed in low-rank formright before updating thetrailing submatrix. As a result,updating a block Aij of size busing blocks Lik, Ukj of rank ronly takes O(br2) operationsinstead of O(b3)
• FACTOR
• SOLVE
• COMPRESS
• UPDATE
Experimental MF complexity
Setting:
1. Poisson: N3 grid with a 7-point stencil with u = 1 on theboundary ∂Ω
∆u = f
2. Helmholtz: N3 grid with a 27-point stencil, ω is the angularfrequency, v(x) is the seismic velocity field, and u(x, ω) is thetime-harmonic wavefield solution to the forcing term s(x, ω).(
−∆− ω2
v(x)2
)u(x, ω) = s(x, ω)
Experimental MF complexity: entries in factor
108
109
1010
64 96 128 160 192 224
Problem size N
Poisson entries in factors
56n1.07log(n)
62n1.04log(n)
BLR 10-14
BLR 10-10
Full Rank: O(n1.3)
108
109
1010
1011
64 96 128 160 192 224
Problem size N
Helmholtz entries for factors
18n1.20log(n)
BLR 10-4
Full Rank: O(n1.3)
• ε only plays a role in the constant factor
• for Poisson a factor ∼ 3 gain with almost no loss of accuracy
Experimental MF complexity: operations
1011
1012
1013
1014
1015
64 96 128 160 192 224
Problem size N
Poisson Flop
1065n1.50
1674n1.52
BLR 10-14
BLR 10-10
Full Rank: O(n2)
1011
1012
1013
1014
1015
1016
64 96 128 160 192 224
Problem size N
Helmholtz Flop
37n1.85
BLR 10-4
Full Rank: O(n2)
• ε only plays a role in the constant factor
• for Poisson a factor ∼ 9 gain with almost no loss of accuracy
Experiments: Poisson
0.2
0.4
0.6
0.8
1
1e-14 1e-12 1e-10 1e-8 1e-6 1e-4 1e-2
1e-14
1e-12
1e-10
1e-08
1e-06
1e-04
1e-02
%Flops wrt FR
Comp.wise Scaled Residual
BLR accuracy
Poisson Equation, 7pt on 1283domain
% Flops
% Time
CSR
CSR with IR
• Compression improves with higher ε values• Time reduction is not as good as flops due to lower efficiency of
operations• Componentwise Scaled Residual is in good accordance with ε• Iterative Refinement can be used to recover the lost accuracy• For very high values of ε, the factorization can be used as a
preconditioner
Experiments: Poisson
0.2
0.4
0.6
0.8
1
1e-14 1e-12 1e-10 1e-8 1e-6 1e-4 1e-2
1e-14
1e-12
1e-10
1e-08
1e-06
1e-04
1e-02
%Flops wrt FR
Comp.wise Scaled Residual
BLR accuracy
Helmholtz Equation, 27pt on 2563domain
% Flops
% Time
CSR
CSR with IR
• Compression improves with higher ε values• Time reduction is not as good as flops due to lower efficiency of
operations• Componentwise Scaled Residual is in good accordance with ε• Iterative Refinement can be used to recover the lost accuracy• For very high values of ε, the factorization can be used as a
preconditioner
Sparse QR Factorization
Overdetermined Problems
In the case Ax = b with A ∈ Rm×n, m > n, the problem isoverdetermined and the existence of a solution is not guaranteed (thesolution only exists if b ∈ R(A)). In this case, the desired solutionmay be the one that minimizes the norm of the residual (least squaresproblem):
minx‖Ax− b‖2
SpanA
rb
Ax
The figure above shows, intuitively, that the residual has minimumnorm if it is orthogonal to the R(A), i.e., Ax is the projection of bonto R(A).
Least Squares Problems: normal equations
minx ‖Ax− b‖2 ⇔ minx ‖Ax− b‖22
‖Ax− b‖22 = (Ax− b)T (Ax− b) = xTATAx− 2xTAT b+ bT b
First x-derivative is equal to zero is x = (ATA)−1AT b. This is thesolution of the n× n fully determined problem:
ATAx = AT b
called the normal equations system:
• ATA is symmetric, positive definite if rank(A) = n (nonnegativedefinite otherwise)
• can be solved with a Cholesky factorization. Not necessarilybackward stable because κ(ATA) = κ(A)2
• AT (Ax− b) = AT r = 0 means that the residual is orthogonal toR(A) which delivers the desired solution
Least Squares Problems: the QR factorization
The QR factorization of a matrix A ∈ Rm×n consists in finding anorthonormal matrix Q ∈ Rm×m and an upper triangular matrixR ∈ Rn×n such that
A = Q
[R0
]Assuming Q = [Q1Q2] where Q1 is made up of the first n columns ofQ and Q2 by the remaining m− n
QT b =
[QT
1
QT2
]b =
[cd
]we have
‖Ax− b‖22 = ‖QTAx−QT b‖22 = ‖Rx− c‖22 + ‖d‖22.
In the case of a full-rank matrix A, this quantity is minimized ifRx = c.
Least Squares Problems: the QR factorization
Assume Q = [Q1Q2] where Q1 contains the first n and Q2 the lastm− n columns of Q:
• Q1 is an orthogonal basis for R(A) and Q2 is an orthogonal basisfor N (AT )
• P1 = Q1QT1 and P2 = Q2Q
T2 are projectors onto R(A) and
N (AT ). Thus, Ax = P1b and r = b−Ax = P2b
SpanA
rb
Ax
Properties of the solution by QR factorization
• it is backward stable
• it requires 3n2(m− n/3) flops (much more expensive thanCholesky)
Underdetermined Problems
In the case Ax = b with A ∈ Rm×n, m < n, the problem isunderdetermined and admits infinite solutions; if xp is any solution ofAx = b, then the complete set of solutions is
x|Ax = y = xp + z|z ∈ N (A)
In this case, the desired solution may be the one with minimum2-norm (minimum 2-norm solution problem):
minimize ‖x‖2subject to Ax = b
Minimum 2-norm solution Problems
The minimum 2-norm solution is
xmn = AT (AAT )−1b
In fact, take any other solution x; then A(x− xmn) = 0 and
(x− xmn)Txmn = (x− xmn)TAT (AAT )−1b= (A(x− xmn))T (AAT )−1b= 0
which means (x− xmn)T ⊥ xmn. Thus
‖x‖22 = ‖xmn + x− xmn‖22 = ‖xmn‖22 + ‖x− xmn‖22
i.e., xmn has the smallest 2-norm of any solution.As for the normal equations system, this approach incurs instabilitydue to κ(AAT )
Minimum 2-norm solution Problems: the QR factorization
Assume
QR = [Q1Q2]
[R1
0
]= AT
is the QR factorization of AT where Q1 is made of the first mcolumns of Q and Q2 of the last n−m ones. Then
Ax = RTQTx = [RT1 0]
[z1z2
]= b
The minimum 2-norm solution follows by setting z2 = 0.Note that Q2 is an orthogonal basis for N (A) thus, the minimum2-norm solution is achieved by removing from any solution xp all itscomponents in N (A).
Householder QRA matrix H of the form
H = I − 2vvT
vT v
is called a Householder Reflection and v aHouseholder Vector. The transformation
Pv =vvT
vT v
is a projection onto the space of v
Householder reflections can be used insuch a way to annihilate all thecoefficients of a vector x except onewhich corresponds to reflecting x on oneof the original axes. If v = x± ‖x‖2e1then Hx has all coefficients equal to zeroexcept the first.
Householder QRA matrix H of the form
H = I − 2vvT
vT v
is called a Householder Reflection and v aHouseholder Vector. The transformation
Pv =vvT
vT v
is a projection onto the space of v
Householder reflections can be used insuch a way to annihilate all thecoefficients of a vector x except onewhich corresponds to reflecting x on oneof the original axes. If v = x± ‖x‖2e1then Hx has all coefficients equal to zeroexcept the first.
Householder QR
Note that in practice the H matrix is never explicitly formed. Instead,it is applied through products with the vector v.Also, note that if x is sparse, the v vector (and thus the H matrix)follows its structure. Imagine a sparse x vector and an x one builtremoving all the zeros in x:
x =
x10x2
x =
[x1x2
]
If H is the Householder reflector that annihilates all the elements of xbut the first, then the equivalent transformation H for X is
v =
[v1v2
], H =
[H11 H12
H21 H22
]⇒v =
v10v2
, H =
H11 0 H12
0 1 0H21 0 H22
Householder QR
The complete QR factorization of a matrix is achieved in n steps:
H1 =
(I − 2v1v
T1
vT1 v1
)
∗
a11 a12 a13a21 a22 a23a31 a32 a33a41 a42 a43
=
r11 r12 r130 a122 a1230 a132 a1330 a142 a143
H2 =
1 0 0 00
(I − 2v2v
T2
vT2 v2
)00
∗
r11 r12 r130 a122 a1230 a132 a1330 a142 a143
=
r11 r12 r130 r22 r230 0 a2330 0 a243
H3 =
1 0 0 00 1 0 00 0
(I − 2v3v
T3
vT3 v3
)0 0
∗
r11 r12 r130 r22 r230 0 a2330 0 a243
=
r11 r12 r130 r22 r230 0 r330 0 0
The Householder vectors can be stored in the bottom left corner ofthe original matrix where the zeros were introduced:
H1 ∗H2 ∗H3 = Q ∗
a11 a12 a13a21 a22 a23a31 a32 a33a41 a42 a43
=
r11 r12 r13v21 r22 r23v31 v32 r33v41 v42 v43
Householder QR
Depending on the position of the coefficients that have to beannihilated, the Householder transformation only touches certainregions of the matrix. Imagine a matrix
A =
[B D0 E
]QBRB = B being the QR factorization of B.
Transformations that modify non overlapping regions of a matrix canbe applied independently and thus in parallel
Householder QR
Depending on the position of the coefficients that have to beannihilated, the Householder transformation only touches certainregions of the matrix. Imagine a matrix
A =
[B D0 E
]QBRB = B being the QR factorization of B.
* =
Transformations that modify non overlapping regions of a matrix canbe applied independently and thus in parallel
Sparse Multifrontal QR
Assume a pseudo-sparse matrix A defined as below. Its QRfactorization can be achieved in three steps:Step-1
QBRB =
[BC
]A =
B FC G
D HE L
M
→ A1 =
RB F
GD HE L
M
Step-2
QDRD =
[DE
]A1 → A2 =
RB F
G
RD H
LM
= P
RB F
RD HM
G
L
Step-3
QMRM =
M
G
L
A2 → R =
RB F
RD H
RM
Sparse Multifrontal QR
We just did a sparse, multifrontal QR factorization. The previousoperation can be arranged in the following elimination tree:
Most of the theory related to the multifrontal QR factorization can beformalized thanks to structural properties of the normal equationmatrix.
Sparse Multifrontal QR: fill-in
After applying an Householder reflection (or a set of reflections) thataffects k rows of a sparse matrix, all the columns presenting at leastone nonzero along those lines will be completely filled, i.e. will have anonzero on each of those k rows:
Sparse Multifrontal QR: fill-in
After applying an Householder reflection (or a set of reflections) thataffects k rows of a sparse matrix, all the columns presenting at leastone nonzero along those lines will be completely filled, i.e. will have anonzero on each of those k rows:
The strong Hall property
Based on the relation ATA = (QR)TQR = RTR, the R factor of amatrix A is equal to the Cholesky factor of the corresponding normalequation ATA. The structure of the R factor can be performed by asymbolic Cholesky factorization of the ATA matrix is A is a StrongHall matrix
Strong Hall
A matrix A ∈ Rm×n,m ≥ n is said to be Strong Hall if every subsetof 0 < k < m columns, the corresponding submatrix has nonzeros inat least k + 1 rows.In this case, the symbolic Cholesky factorization of ATA will correctlypredict the structure of R.
The symbolic Cholesky factorization would not predict structuralcancellations that happen in the case where A is not strong Hall.
The strong Hall property: two examples
NON-Strong Hall
struct(A)× × × × ×××××
predicted struct(R)× × × × ×× × × ×× × ×× ××
actual struct(R)× × × × ×××××
Strong Hall
struct(A)× × × ×××××
predicted struct(R)× × × ×× × ×× ××
actual struct(R)× × × ×× × ×× ××
The strong Hall property
In the case of non-strong Hall matrices, two options are possible:
1. Reduce the matrix to Block-Triangular Form (BTF). In this case allthe diagonal blocks have the strong Hall property and can beQR-factorized independently:
Coarse BTF Fine BTF
2. Rediscover the structure of R during the factorization phase. Thesymbolic Cholesky factorization of ATA overestimates the fill-in inR. This analysis can be thus taken as a starting point and can berefined during the numerical computation according to the actualfill-in
Without loss of generality we can assume A is always strong Hall
Sparse Multifrontal QR: the analysis phase
The analysis phase of the sparse, multifrontal QR factorizationproceeds in the same way as for the Cholesky or LU factorizations andincludes the same operations:
• fill-reducing ordering: this operation aims at computing a columnpermutation that minimizes the amount of fill-in introduced in Rduring the factorization. This is achieved by computing afill-reducing order for the Cholesky factor of ATA
• symbolic factorization: this operation computes, symbolically, thestructure of the Cholesky factor of ATA
• Elimination tree computation: computation of the elimination treefor the Cholesky factorization of ATA
• Tree amalgamation
• Tree traversal order
• etc.
Sparse Multifrontal QR: the factorization phase
The numerical factorization phase also proceeds as in multifrontal LUor Cholesky (a bottom up traversal of the tree) but the factorizationand assembly of a frontal matrix are different:
assembly : in the QR factorization the contribution blocks fromchildren nodes are not summed but simply appended tocoefficients from the original matrix. This can be donein two different ways depending on the factorizationstrategy (below)
factorization : the QR factorization of the frontal matrix can beeither partial or full
Sparse Multifrontal QR: the factorization phase
Assume a frontal matrix of size m× n with npiv fully assembledvariables. In the first strategy, only npiv out of min(m,n)Householder reflections are applied to the frontal matrix. This leadsto a rectangular contribution block.
R
V cb
Factorization
1cb
3cb
4cb
2cb
Assembly
Sparse Multifrontal QR: the factorization phase
Assume a frontal matrix of size m× n with npiv fully assembledvariables. In the second strategy, min(m,n) Householder reflectionsare applied to the frontal matrix which leads to a triangularcontribution block; this results in a relatively sparse frontal matrixwhich can be row-permuted in order to move most of the zeroes tothe bottom-left corner in order to spare the related computations.
R
Factorization
V
cb
Assembly
1cb
2cb
3cb
Sparse Multifrontal QR: the factorization phase
Strategy 1:
Sparse Multifrontal QR: the factorization phase
Strategy 1:
Sparse Multifrontal QR: the factorization phase
Strategy 1:
Sparse Multifrontal QR: the factorization phase
Strategy 1:
Sparse Multifrontal QR: the factorization phase
Strategy 1:
Sparse Multifrontal QR: the factorization phase
Strategy 2:
Sparse Multifrontal QR: the factorization phase
Strategy 2:
Sparse Multifrontal QR: the factorization phase
Strategy 2:
Sparse Multifrontal QR: the factorization phase
Strategy 2:
Sparse Multifrontal QR: the factorization phase
Strategy 2:
Sparse Multifrontal QR: the factorization phase
Strategy 2:
Sparse Multifrontal QR: the factorization phase
• Strategy 1 is much easier to implement
• Strategy 2 : leads to smaller fronts requires less storage minimizes the overall number of operations reduces communications
matrix Strat. Max front flops size Q facto time
medium21 6896 3197633920 4226923 1022.682 689 281359552 1065149 254.34
Nfac701 14284 1312200576 4269607 428.382 240 24316052 476636 60.34
WEST20211 26 409225 17796 7.782 26 390033 22439 8.84
Sparse Multifrontal QR
Example: the flower 4 4 matrix
Sparse Direct LU/Cholesky: resume
Sparse Direct LU/Cholesky: resume
1. Analysis: a preprocessing phase aiming at improving the structuralproperties of the problem and preparing the ground for thenumerical factorization. Only symbolic operations are done at thisphase: Fill-reducing ordering/permutations; Symbolic factorization; Elimination tree computation; Tree amalgamation; Tree traversal order (for memory minimization); Tree mapping (for parallel factorization).
2. Factorization: the actual numerical factorization is performedbased on the result of the analysis. This can be done in parallelend, eventually, with pivoting to improve numerical stability (nopivoting in the symmetric, positive definite case).
3. Solve: this step solves the problem against one or more right-handsides using the forward and backward substitutions. This can alsobe done in parallel following the same computational pattern of thefactorization phase.
Iterative vs Direct solvers
Iterative Solvers
N lower memory consumption
N fewer flops
N better scalability (if prec not too complex)
H less robust: may not converge or converge too slowly
Direct Solvers
N more robust
N ideal for multiple solutions
N works better with multiple rhs
H requires more memory and flops (due to fill-in) especially in 3D problems
H less scalable