sparse factorizations: towards optimal complexity and ......computing) ! fastmath institute...
TRANSCRIPT
![Page 1: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:](https://reader034.vdocuments.site/reader034/viewer/2022050606/5fadaca58a06ee7b7d1a5a75/html5/thumbnails/1.jpg)
Sparse factorizations: Towards optimal complexity and resilience at exascale
Xiaoye Sherry Li Lawrence Berkeley National Laboratory
Challenges in 21st Century Experimental Mathematical
Computation Workshop, ICERM, Brown Univ., July 21-25, 2014.
![Page 2: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:](https://reader034.vdocuments.site/reader034/viewer/2022050606/5fadaca58a06ee7b7d1a5a75/html5/thumbnails/2.jpg)
Introduction
! DOE SciDAC programs (Scientific Discovery through Advanced Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and
Scalable Technologies for Mathematics) • Software: SuperLU, PETSc, Trilinos, Chombo, mesh, …
(3 other Institutes) ! Science Applications (many … mostly Partial Differential
Equations) • CEMM (2009-2015, Center for Extended MHD Modeling, fusion
energy) • ComPASS (2008-2015, Community Petascale Project for
Accelerator Science and Simulation) ! LBNL focuses
! Direct solvers (SuperLU): scaling to 1000s cores ! Hybrid solvers (direct & iterative): scaling to 10,000 cores ! Low-rank HSS preconditioner: nearly linear complexity for certain
PDEs
2
![Page 3: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:](https://reader034.vdocuments.site/reader034/viewer/2022050606/5fadaca58a06ee7b7d1a5a75/html5/thumbnails/3.jpg)
Application 1: Burning plasma for fusion energy ! ITER – a new fusion reactor being constructed in Cadarache, France
• International collaboration: China, the European Union, India, Japan, Korea, Russia, and the United States
• Study how to harness fusion, creating clean energy using nearly inexhaustible hydrogen as the fuel. ITER promises to produce 10 times as much energy than it uses — but that success hinges on accurately designing the device.
! One major simulation goal is to predict microscopic MHD instabilities of burning plasma in ITER. This involves solving extended and nonlinear Magnetohydrodynamics equations.
3
![Page 4: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:](https://reader034.vdocuments.site/reader034/viewer/2022050606/5fadaca58a06ee7b7d1a5a75/html5/thumbnails/4.jpg)
Application 1: ITER modeling
! Center for Extended Magnetohydrodynamic Modeling (CEMM), PI: S. Jardin, PPPL.
! Develop simulation codes to predict microscopic MHD instabilities of burning magnetized plasma in a confinement device (e.g., tokamak used in ITER experiments). • Efficiency of the fusion configuration increases with the ratio of thermal
and magnetic pressures, but the MHD instabilities are more likely with higher ratio.
! Code suite includes M3D-C1, NIMROD
4
ϕ R
Z• At each ϕ = constant plane, scalar 2D data is represented using 18 degree of freedom quintic triangular finite elements Q18 • Coupling along toroidal direction
(S. Jardin)
![Page 5: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:](https://reader034.vdocuments.site/reader034/viewer/2022050606/5fadaca58a06ee7b7d1a5a75/html5/thumbnails/5.jpg)
ITER modeling: 2-Fluid 3D MHD Equations
∂n∂t+∇•(nV ) = 0 continuity
∂B∂t
= −∇×E, ∇•B = 0, µ0J = ∂×B Maxwell
nMt∂V∂t
+V •∇V%
&'
(
)*+∇p = J ×B−∇•ΠGV −∇•Πµ Momentum
E +V ×B =ηJ + 1ne
(J ×B−∇pe −∇•Πe ) Ohm's law
32∂pe∂t
+∇•32peV
%
&'
(
)*= −pe∇•∇+ηJ
2 −∇•qe +QΔ electron energy
32∂pi∂t
+∇•32piV
%
&'
(
)*= −pi∇•∇−Πµ •∇V −∇•qi −QΔ ion energy
5
The objective of the M3D-C1 code is to solve these equations as accurately as possible in 3D toroidal geometry with realistic B.C. and optimized for a low-β torus with a strong toroidal field.
![Page 6: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:](https://reader034.vdocuments.site/reader034/viewer/2022050606/5fadaca58a06ee7b7d1a5a75/html5/thumbnails/6.jpg)
Application 2: particle accelerator cavity design
6
• Community Petascale Project for Accelerator Science and Simulation (ComPASS), PI: P. Spentzouris, Fermilab • Development of a comprehensive computational infrastructure for accelerator modeling and optimization • RF cavity: Maxwell equations in electromagnetic field • FEM in frequency domain leads to large sparse eigenvalue problem; needs to solve shifted linear systems
bMx MK 002
0 )(problem eigenvaluelinear =−σ
ΓE Closed Cavity
ΓM
Open Cavity
Waveguide BC
Waveguide BC
Waveguide BC
(L.-Q. Lee)
bx M W - i K =+ )(problem eigenvaluecomplex nonlinear
02
0 σσ
RF unit in ILC
![Page 7: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:](https://reader034.vdocuments.site/reader034/viewer/2022050606/5fadaca58a06ee7b7d1a5a75/html5/thumbnails/7.jpg)
Sparse: lots of zeros in matrix
! fluid dynamics, structural mechanics, chemical process simulation, circuit simulation, electromagnetic fields, magneto-hydrodynamics, seismic-imaging, economic modeling, optimization, data analysis, statistics, . . .
! Example: A of dimension 106, 10~100 nonzeros per row ! Matlab: > spy(A)
7
Mallya/lhr01 (chemical eng.) Boeing/msc00726 (structural eng.)
![Page 8: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:](https://reader034.vdocuments.site/reader034/viewer/2022050606/5fadaca58a06ee7b7d1a5a75/html5/thumbnails/8.jpg)
§ Solving a system of linear equations Ax = b • Sparse: many zeros in A; worth special treatment
§ Iterative methods (CG, GMRES, …) § A is not changed (read-only) § Key kernel: sparse matrix-vector multiply § Easier to optimize and parallelize § Low algorithmic complexity, but may not converge
§ Direct methods § A is modified (factorized) § Harder to optimize and parallelize § Numerically robust, but higher algorithmic complexity
§ Often use direct method (factorization) to precondition iterative method § Solve an easy system: M-1Ax = M-1b
Strategies of sparse linear solvers
8
![Page 9: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:](https://reader034.vdocuments.site/reader034/viewer/2022050606/5fadaca58a06ee7b7d1a5a75/html5/thumbnails/9.jpg)
Gaussian Elimination (GE)
! Solving a system of linear equations Ax = b
! First step of GE
! Repeat GE on C ! Result in LU factorization (A = LU)
– L lower triangular with unit diagonal, U upper triangular
! Then, x is obtained by solving two triangular systems with L and U
⎥⎦
⎤⎢⎣
⎡⋅⎥⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡=
Cw
IvBvw
ATT
0/01 α
αα
9
α
TwvBC ⋅−=
![Page 10: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:](https://reader034.vdocuments.site/reader034/viewer/2022050606/5fadaca58a06ee7b7d1a5a75/html5/thumbnails/10.jpg)
Sparse factorization ! Store A explicitly … many sparse compressed formats ! “Fill-in” . . . new nonzeros in L & U ! Graph algorithms: directed/undirected graphs, bipartite graphs,
paths, elimination trees, depth-first search, heuristics for NP-hard problems, cliques, graph partitioning, . . .
! Unfriendly to high performance, parallel computing ! Irregular memory access, indirect addressing, strong task/data
dependency
10
1 2
3 4
6 7
5 L
U1
6
9
3
7 8
4 5 2 1
9
3 2
4 5
6 7 8
![Page 11: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:](https://reader034.vdocuments.site/reader034/viewer/2022050606/5fadaca58a06ee7b7d1a5a75/html5/thumbnails/11.jpg)
11
Graph tool: reachable set, fill-path
Edge (x,y) exists in filled graph G+ due to the path: x à 7 à 3 à 9 à y ! Finding fill-ins ßà finding transitive closure of G(A)
+
+
+
y
+
+
+
+
3
7
9
x
o
o o
![Page 12: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:](https://reader034.vdocuments.site/reader034/viewer/2022050606/5fadaca58a06ee7b7d1a5a75/html5/thumbnails/12.jpg)
Algorithmic phases in sparse GE
1. Minimize number of fill-ins, maximize parallelism ! Sparsity structure of L & U depends on that of A, which can be changed by
row/column permutations (vertex re-labeling of the underlying graph) ! Ordering (combinatorial algorithms; “NP-complete” to find optimum
[Yannakis ’83]; use heuristics)
2. Predict the fill-in positions in L & U ! Symbolic factorization (combinatorial algorithms)
3. Design efficient data structure for storage and quick retrieval of the nonzeros ! Compressed storage schemes
4. Perform factorization and triangular solutions ! Numerical algorithms (F.P. operations only on nonzeros) ! Usually dominate the total runtime
! For sparse Cholesky and QR, the steps can be separate; for sparse LU with pivoting, steps 2 and 4 my be interleaved.
12
![Page 13: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:](https://reader034.vdocuments.site/reader034/viewer/2022050606/5fadaca58a06ee7b7d1a5a75/html5/thumbnails/13.jpg)
Distributed-memory parallelization
13
! 2D block-cyclic matrix distribution
! Scalability challenges: ! High degree of data & task dependency (DAG) ! Irregular, indirect memory access ! Low Arithmetic Intensity
For j = 1, 2, 3 .. Number of Supernodes 1. Block LU factorization L(j, j) U(j, j) ß LU(A(j, j)) 2. L update : L(k, j) ß A(k, j) U(j, j)
-1 k>j 3. U update : U(j, k) ß L(j, j)
-1 A(j, k) k>j 4. Rank K Update : A(i, k) ßA(i, k) – L(I,j)U(j,k), i, k > j
![Page 14: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:](https://reader034.vdocuments.site/reader034/viewer/2022050606/5fadaca58a06ee7b7d1a5a75/html5/thumbnails/14.jpg)
SuperLU_DIST 2.5 on Cray XE6 (hopper@nersc)
! Profiling using IPM ! Synchronization dominates on a large number of cores
! up to 96% of factorization time
14
8 32 128 512 20480
5
10
15
20
25
30
35
40
45
50
Number of cores
Fact
oriza
tion tim
e(s
)
FactorizationCommunication
32 128 512 20480
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
Number of cores
Fa
cto
riza
tion
tim
e(s
)
FactorizationCommunication
Accelerator (sym), n=2.7M, fill-ratio=12 DNA, n = 445K, fill-ratio= 609
![Page 15: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:](https://reader034.vdocuments.site/reader034/viewer/2022050606/5fadaca58a06ee7b7d1a5a75/html5/thumbnails/15.jpg)
SuperLU_DIST 3.0: better DAG scheduling
! Implemented new static scheduling and flexible look-ahead algorithms that shortened the length of the critical path.
! Idle time was significantly reduced (speedup up to 2.6x) ! To further improve performance:
! more sophisticated scheduling schemes ! hybrid programming paradigms
15
0
3 4
0 1 2
3 4 5 3
0 2 0 1
3 4 5 3 4 5
0 1 2 0 1 2 0
1
1
2
2
5
0 1
4
0 1 2 0
3 4 5
2
5
0
0
3
3
3
look−ahead window
8 32 128 512 20480
5
10
15
20
25
30
35
40
45
50
Number of cores
Fa
cto
riza
tion
/Co
mm
un
ica
tion
tim
e (
s)
version 2.5version 3.0
32 128 512 20480
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
Number of cores
Fa
cto
riza
tion
/Co
mm
un
ica
tion
tim
e (
s)
version 2.5version 3.0
Accelerator, n=2.7M, fill-ratio=12 DNA, n = 445K, fill-ratio= 609
![Page 16: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:](https://reader034.vdocuments.site/reader034/viewer/2022050606/5fadaca58a06ee7b7d1a5a75/html5/thumbnails/16.jpg)
Performance of larger matrices
v Sparsity ordering: MeTis applied to structure of A’+A 16
Name Application Data type
N |A| / N Sparsity
|L\U| (10^6)
Fill-ratio
matrix211 Fusion, MHD eqns (M3D-C1)
Real 801,378 161 1276.0 9.9
cc_linear2
Fusion, MHD eqns (NIMROD)
Complex 259,203 109 199.7 7.1
matick Circuit sim. MNA method (IBM)
Complex 16,019 4005 64.3 1.0
cage13 DNA electrophoresis
Real 445,315 17 4550.9 608.5
![Page 17: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:](https://reader034.vdocuments.site/reader034/viewer/2022050606/5fadaca58a06ee7b7d1a5a75/html5/thumbnails/17.jpg)
Strong scaling: MPI, Cray XE6 (hopper@nersc)
v Up to 1.4 Tflops factorization rate
17
§ 2 x 12-core AMD 'MagnyCours’ per node, 2.1 GHz processor
![Page 18: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:](https://reader034.vdocuments.site/reader034/viewer/2022050606/5fadaca58a06ee7b7d1a5a75/html5/thumbnails/18.jpg)
Variety of node architectures
18
Cray XE6: dual-socket x 2-die x 6-core, 24 cores Cray XC30: dual-socket x 8-core, 16 cores
Cray XK7: 16-core AMD + K20X GPU Intel MIC: 16-core host + 60+ cores co-processor
![Page 19: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:](https://reader034.vdocuments.site/reader034/viewer/2022050606/5fadaca58a06ee7b7d1a5a75/html5/thumbnails/19.jpg)
Multicore / GPU-Aware SuperLU
! New hybrid programming code: MPI+OpenMP+CUDA, able to use all the CPUs and GPUs on manycore computers.
! Algorithmic changes: ! Aggregate small BLAS operations into larger ones. ! CPU multithreading Scatter/Gather operations. ! Hide long-latency operations.
! Results: using 100 nodes GPU clusters, up to 2.7x faster, 2x-5x memory saving.
! New SuperLU_DIST 4.0 release, August 2014.
19
![Page 20: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:](https://reader034.vdocuments.site/reader034/viewer/2022050606/5fadaca58a06ee7b7d1a5a75/html5/thumbnails/20.jpg)
CPU + GPU algorithm
20
�
�
�� � ��
① Aggregate small blocks ② GEMM of large blocks ③ Scatter
GPU acceleration: Software pipelining to overlap GPU execution with CPU Scatter, data transfer.
![Page 21: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:](https://reader034.vdocuments.site/reader034/viewer/2022050606/5fadaca58a06ee7b7d1a5a75/html5/thumbnails/21.jpg)
! Use preprocesing to produce 4 versions {s, d, c, z} ! Creating macro-enabled “basefile” at the first time is clumsy; later
maintenance is easier. ! “template” in C++ is better.
! Performance portability? ! Need adjust block size for each architecture
• Larger blocks better for uniprocessor • Smaller blocks better for parallellism and load balance
! Open problem: automatic tuning for block size? ! Flexible interface?
! Example: block diagonal preconditioner M-1A x = M-1 b
M = diag(A11, A22, A33) à use SuperLU_DIST for each diagonal block
! No explicit funding for user support. (other than SciDAC apps.)
Software issues
A22
A33
A11
![Page 22: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:](https://reader034.vdocuments.site/reader034/viewer/2022050606/5fadaca58a06ee7b7d1a5a75/html5/thumbnails/22.jpg)
! Use preprocesing to produce 4 versions {s, d, c, z} ! Creating macro-enabled “basefile” at the first time is clumsy; later
maintenance is easier. ! “template” in C++ is better.
! Performance portability? ! Need adjust block size for each architecture
• Larger blocks better for uniprocessor • Smaller blocks better for parallellism and load balance
! Open problem: automatic tuning for block size? ! Flexible interface?
! Example: block diagonal preconditioner M-1A x = M-1 b
M = diag(A11, A22, A33) à use SuperLU_DIST for each diagonal block
! No explicit funding for user support. (other than SciDAC apps.)
Software issues
0 1 2 3
4 5 6 7
8 9 10 11
![Page 23: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:](https://reader034.vdocuments.site/reader034/viewer/2022050606/5fadaca58a06ee7b7d1a5a75/html5/thumbnails/23.jpg)
Towards exascale
! Exascale machines will have hierarchical organization ! Hierarchical memory, NUMA nodes: multicore, manycore
! Exascale applications will encompass multiphysics (coupled PDEs) and multiscale (time and space)
! Hierarchical algorithms and parallelism match machines and applications features
Studying two classes of algorithms for sparse linear systems: 1. Domain decomposition hybrid method
! General algebraic solver 2. Low-rank factorization employing hierarchical matrices and
randomization ! Target PDE applications
23
![Page 24: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:](https://reader034.vdocuments.site/reader034/viewer/2022050606/5fadaca58a06ee7b7d1a5a75/html5/thumbnails/24.jpg)
1. Domain decomposition, Schur-complement (PDSLin : http://portal.nersc.gov/project/sparse/pdslin/)
1. Graph-partition into subdomains, A11 is block diagonal
2. Schur complement
S = interface (separator) variables, no need to form explicitly
3. Hybrid solution methods:
24
⎟⎟⎠
⎞⎜⎜⎝
⎛=⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛
2
1
2
1
2221
1211
bb
xx
AAAA
111111
22121
11211122121
112122
where ULAGWA)A (L)A – (U A A A – A AS -TT-T-
=
⋅−===
(1) x2 = S−1(b2 – A21 A11-1 b1) ← iterative solver
(2) x1 = A11-1(b1 – A12 x2 ) ← direct solver
A11 A12A21 A22
!
"
##
$
%
&& =
D1 E1D2 E2
Dk Ek
F1 F2 … Fk A22
!
"
#######
$
%
&&&&&&&
![Page 25: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:](https://reader034.vdocuments.site/reader034/viewer/2022050606/5fadaca58a06ee7b7d1a5a75/html5/thumbnails/25.jpg)
Hierarchical parallelism ! Multiple processors per subdomain
! one subdomain with 2x3 procs (e.g. SuperLU_DIST)
! Advantages: ! Constant #subdomains, Schur size, and convergence rate, regardless
of core count. ! Need only modest level of parallelism from direct solver.
25
P P(0 : 5)
P(6 : 11)
P(12 : 17)
P(18 : 23)
P(0 : 5)
P(6 : 11)
P(12 : 17)
P(18 : 23)
P(0 : 5) P(6 : 11) P(12 : 17) P(18 : 23)
D1
D2
D3
D4
E1
E2
E3
E4
F1 F2 F3 F4 A22
![Page 26: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:](https://reader034.vdocuments.site/reader034/viewer/2022050606/5fadaca58a06ee7b7d1a5a75/html5/thumbnails/26.jpg)
PDSLin in Omega3P: Cryomodule
PIP2 cryomodule consis1ng of 8 cavi1es
Computa(on parameters
§ 2.3M elements § First order finite element (p = 1)
- 39M non-‐zeroes, 2.5M DOFs - Solu1on 1me on hopper using 50 nodes and 600 cores: 863 ms (total)
§ Second order finite element (p = 2) - 590M non-‐zeroes, 14M DOFs - Solu1on 1me on edison using 400 nodes, 4800 cores: 5:40 min (wall) - Using MUMPS with 400 nodes, 800 cores, solu1on 1me: 6:46 min (wall)
![Page 27: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:](https://reader034.vdocuments.site/reader034/viewer/2022050606/5fadaca58a06ee7b7d1a5a75/html5/thumbnails/27.jpg)
New mathematical algorithms
! K-way, multi-constraint graph partitioning ! Small separator, similar subdomains, similar connectivity ! Both intra- and inter-group load balance
! Sparse triangular sol. with many sparse RHS (intra-subdomain)
! Sparse matrix–matrix multiplication (inter-subdomain)
27
W← sparsify(W, σ1); G← sparsify(G, σ1) T ( p) ← W ( p) ⋅ G( p)
S ( p) ← A22( p) − T (q) (p)
q∑ ; S← sparsify(S, σ 2 )
S = A22 – (Ul-T Fl
T )T (Ll-1El )
l∑ = Wl ⋅Gl
l∑ , where Dl = LlUl
I. Yamazali, F.-H. Rouet, X.S. Li, B. Ucar, “On partitioning and reordering problems in a hierarchically parallel hybrid linear solver”, IPDPS / PDSEC Workshop, May 24, 2013.
![Page 28: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:](https://reader034.vdocuments.site/reader034/viewer/2022050606/5fadaca58a06ee7b7d1a5a75/html5/thumbnails/28.jpg)
2. HSS-embedded sparse factorization
! Dense, but data-sparse): hierarchically semi-separable structure ! PDEs with smooth kernels, off-diagonal blocks are rank deficient ! Recursion leads to hierarchical partitioning ! Key to low complexity: nested bases
! Sparse: apply HSS to dense separators/supernodes
A≈
D1 U1B1V2T
U2B2V1T D2
"
#
$$
%
&
''
U1R1U2R2
"
#$$
%
&''B3 W4
TV4T W5
TV5T"
#$%&'
U4R4U5R5
"
#$$
%
&''B6 W1
TV1T W2
TV2T"
#$%&'
D4 U4B4V5T
U5B5V4T D5
"
#
$$
%
&
''
"
#
$$$$$$$
%
&
'''''''
HSS tree
Nested tree-parallelism: Outer tree: separator tree Inner tree: HSS tree
![Page 29: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:](https://reader034.vdocuments.site/reader034/viewer/2022050606/5fadaca58a06ee7b7d1a5a75/html5/thumbnails/29.jpg)
3D Helmholtz
! Helmholtz equation with PML boundary
! N = 3003 = 27M, procs = 1024 ! Max rank = 1391 (tolerance = 1e-4)
29
Times (s) Gflops (peak %)
Comm % Mem (GB)
MF 4206 2385 (27.7%)
32.6 % 3144
MF + HSS HSS-compr
2171 1388
2511 (29.2%)
41.2 % 15.3 %
1104
−Δ−ω 2
v(x)2#
$%
&
'( u(x,ω) = s(x,ω)
![Page 30: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:](https://reader034.vdocuments.site/reader034/viewer/2022050606/5fadaca58a06ee7b7d1a5a75/html5/thumbnails/30.jpg)
New compression kernel: Randomized Sampling
! Traditional methods: SVD, rank-revealing QR ! Difficult to scale up ! Extend-add HSS structures of different shapes
! Randomized sampling: 1. Pick random matrix Ωnx(k+p), p small, e.g. 10 2. Sample matrix S = A Ω, with slight oversampling p 3. Compute Q = ON-basis(S), orthonormal basis of S Accuracy: with high probability ≥ 1 – 6 p-p
! Benefits: kernel becomes dense matrix-matrix multiply ! Extend-add tall-skinny dense matrices of conforming shapes ! Scalable and resilient algorithms exist ! Even faster, if fast matrix-vector multiply available (e.g. FMM) ! Matrix-free solver, if only matrix-vector action available
30
A−QQ*A ≤ 1+11 k + p min(m,n)( ) ⋅σ k+1
![Page 31: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:](https://reader034.vdocuments.site/reader034/viewer/2022050606/5fadaca58a06ee7b7d1a5a75/html5/thumbnails/31.jpg)
Summary, forward looking . . .
! Direct solvers can scale to 1000s cores
! Domain-decomposition type of hybrid solvers can scale to 10,000s cores ! Can also maintain robustness
! Expect to scale more with low-rank structured factorization methods ! Extend to general solver framework, examine feasibility with wider
class of problems
31