A Scalable Algebraic Multigrid Implementation forNon-Linear Elasticity Problems
Aurel Neic
Institute of Biomechanics
Medical University Graz
September 17, 2014
Outline
AMG for elasticity problemsIntroductionAdaptation
Parallel ScalingRedistribution of AMG HierarchyHybrid MPI/OpenMP Parallelization
Aurel Neic (Medical University Graz) AMG for Non-Linear Elasticity Problems September 17, 2014 2 / 25
AMG for elasticity problems Introduction
AMG for elasticity problems
Introduction
Aurel Neic (Medical University Graz) AMG for Non-Linear Elasticity Problems September 17, 2014 3 / 25
AMG for elasticity problems Introduction
Introduction
I Solve K u = f using a hierarchy of equation systems
K ` u` = f ` , ` = 0, . . . , L
I Hierarchy is constructed from algebraic operator K
K `+1 = R`K `P`
Aurel Neic (Medical University Graz) AMG for Non-Linear Elasticity Problems September 17, 2014 4 / 25
AMG for elasticity problems Introduction
Exact Two-Grid Method
I Block System [KCC KCF
KFC KFF
] [uCuF
]=
[fCfF
]I Schur decomposition[
KCC KCF
KFC KFF
]=
[I KCFK
−1FF
I
] [SC
KFF
] [I
K−1FF KFC I
]SC = KCC − KCF K−1
FF KFC
I Exact two grid method
P :=
(I
−K−1FF KFC
), R :=
(I −KCFK
−1FF
), G :=
(0 00 K−1
FF
)SC = RKP, M = (I − GK)(I − P S−1
C R K) = 0
uk+1 := (I −M)K−1f + Muk , k ∈ N
Aurel Neic (Medical University Graz) AMG for Non-Linear Elasticity Problems September 17, 2014 5 / 25
AMG for elasticity problems Introduction
Introduction
I Strong connection criterion
|K [i , j ] | > ε |K [i , i ] |
I Construction of Prolongation operator
P :=
(ICCPFC
)
ni := | {j ∈ C : K [i , j ] > ε K [i , i ]} |
PFC [i , j ] :=
{1/ni if |K [i , j ] | > ε |K [i , i ] |0 otherwise
Aurel Neic (Medical University Graz) AMG for Non-Linear Elasticity Problems September 17, 2014 6 / 25
AMG for elasticity problems Adaptation
AMG for elasticity problems
Adaptation
Aurel Neic (Medical University Graz) AMG for Non-Linear Elasticity Problems September 17, 2014 7 / 25
AMG for elasticity problems Adaptation
Adaptation of AMG for Elasticity Problems
I Extension of the available AMG algorithm.
I Auxiliary matrix for coarsening based on ‖K 3×3i,j ‖ matrix norms
I Extended prolongation operator based on auxiliary matirx
I Future adaptations: Vanka smoothers, K-cycle AMG
Aurel Neic (Medical University Graz) AMG for Non-Linear Elasticity Problems September 17, 2014 8 / 25
Parallel Scaling Redistribution of AMG Hierarchy
Parallel Scaling
Redistribution of AMG Hierarchy
Aurel Neic (Medical University Graz) AMG for Non-Linear Elasticity Problems September 17, 2014 9 / 25
Parallel Scaling Redistribution of AMG Hierarchy
Introduction
I Multigrid methods efficient as solvers and preconditioners:Stable convergence w.r.t. grid refinement, spatial resolution
I Commonly used for elliptic PDEs.
I With the rise of multicore computing architectures:Concern about parallel scalability.Baker, Allison H. and Schulz, Martin and Yang, Ulrike M., On the performance of analgebraic multigrid solver on multicore clusters
I Main bottleneck is communication on intermediate grids
Aurel Neic (Medical University Graz) AMG for Non-Linear Elasticity Problems September 17, 2014 10 / 25
Parallel Scaling Redistribution of AMG Hierarchy
Scaling Problems
Compare AMG-PCG with Jacobi-PCG as a limit case.
0.01
0.1
1
10
16 32 64 96 128 256
Tim
e [sec]
Number of Processes
AMG-PCGJacobi-PCG
Aurel Neic (Medical University Graz) AMG for Non-Linear Elasticity Problems September 17, 2014 11 / 25
Parallel Scaling Redistribution of AMG Hierarchy
Scaling Problems
I Fine-GridI Big subdomains, interfaces, message sizesI Sparse communication patternI → Point-To-Point type
I Coarse-GridI Small subdomains, interfaces, message sizesI Few processes, accumulate interface valuesI → Scatter/Gather type
I Intermediate GridsI Small subdomains, interfaces, message sizesI Still most processes, accumulate interface valuesI → All-To-All type
Aurel Neic (Medical University Graz) AMG for Non-Linear Elasticity Problems September 17, 2014 12 / 25
Parallel Scaling Redistribution of AMG Hierarchy
Redistribution of MG hierarchy
I Common solution: Reduce parallelismI Redistribute parallel data on fewer processes
I Open questions:I When to redistribute: Once, multiple times, systematic?I Redistribution pattern: Based an which criteria?
Aurel Neic (Medical University Graz) AMG for Non-Linear Elasticity Problems September 17, 2014 13 / 25
Parallel Scaling Redistribution of AMG Hierarchy
Proposed Method
I Systematic reduction of processes, rate is configurable.
I Redistribution in block-pattern.
Reasoning:I Systematic:
I Logarithmic reduction of processes, reduction of unknowns should be much higherI Robust configuration, doesn’t need fine tuning.
I Block-Pattern:I Should match hardware layoutI Intermediate levels redistributed in local memory.
Aurel Neic (Medical University Graz) AMG for Non-Linear Elasticity Problems September 17, 2014 14 / 25
Parallel Scaling Redistribution of AMG Hierarchy
Block Pattern
Figure: Redistribution pattern for a blocksize of 2. Red: Active Ranks, Blue: Idle Ranks
I Can be computed based on process rank if the assignment to computing units wascontinuous.
I In case of 16 processes per compute node, the first 4 redistributions are in sharedmemory!
Aurel Neic (Medical University Graz) AMG for Non-Linear Elasticity Problems September 17, 2014 15 / 25
Parallel Scaling Redistribution of AMG Hierarchy
AMG Setup
1. Coarsening1.1 Coarse-grid selection1.2 Generation of
Prolongation/Restriction1.3 Triple-matrix-product
2. Redistribution2.1 Active-idle selection2.2 Data transfer2.3 Matrix accumulation
Aurel Neic (Medical University Graz) AMG for Non-Linear Elasticity Problems September 17, 2014 16 / 25
Parallel Scaling Redistribution of AMG Hierarchy
Adapted V-Cycle
Require: K`p , f`p , u
`p , r`p ,P
`p , J
`p ,G , a ∈ G ,Ap, activep ; ` = 1, . . . , L
if ` < L thenif activep [`] = true then
u`p ← 0
u`p ← J`
p (u`p , f`p) {Pre-smooth}
r`p ← f`p − K`p u
`p {Compute residual}
f`+1p ← (P`
p )T r`p {Restrict residual}f`+1a ←
∑i∈G
AaATi f`+1
i {Gather}
end ifmultigrid(` + 1) {Recursion}if activep [`] = true then
u`+1i ← Ai A
Ta u
`+1a i ∈ G {Scatter}
c`p ← P`p u`+1
p {Prolongate correction}u`p ← u`
p + c`p {Add correction}u`p ← J`
p (u`p , f`p) {Post-smooth}
end ifelse
if activep [L] = true then
uLp ← (KL
p)−1 fLp {Direct solve}end if
end if
Aurel Neic (Medical University Graz) AMG for Non-Linear Elasticity Problems September 17, 2014 17 / 25
Parallel Scaling Redistribution of AMG Hierarchy
Benchmarks
0.01
0.1
1
10
16 32 64 96 128 256
Tim
e [sec]
Number of Processes
Block-AMG-PCGDefault-AMG-PCG
Jacobi-PCG
Aurel Neic (Medical University Graz) AMG for Non-Linear Elasticity Problems September 17, 2014 18 / 25
Parallel Scaling Redistribution of AMG Hierarchy
Benchmarks
0.1
1
10
64 128 256 528 1024 2048
Tim
e [sec]
Number of Processes
Block-AMG g=2Block-AMG g=4
AMGJacobi
Aurel Neic (Medical University Graz) AMG for Non-Linear Elasticity Problems September 17, 2014 19 / 25
Parallel Scaling Hybrid MPI/OpenMP Parallelization
Parallel Scaling
Hybrid MPI/OpenMP Parallelization
Aurel Neic (Medical University Graz) AMG for Non-Linear Elasticity Problems September 17, 2014 20 / 25
Parallel Scaling Hybrid MPI/OpenMP Parallelization
Motivation
I Increase the number of OpenMP threads per MPI processI SuperMuc Thin Node: 16 cores, 32 GB RAMI SuperMuc Fat Node: 40 core, 256 GB RAM
I Pin OpenMP threads to computing tasksI Desktop: OpenMPI rankfilesI Server: OpenMPI rankfiles, Custom modules
I Advantage: More computing tasks with less communication
Aurel Neic (Medical University Graz) AMG for Non-Linear Elasticity Problems September 17, 2014 21 / 25
Parallel Scaling Hybrid MPI/OpenMP Parallelization
Motivation
Matrix-Vector benchmarks:
runtime speedup
# cores OMP MPI OMP MPI
1 4.579 4.336 1.00 1.002 2.306 2.161 1.99 2.014 1.347 1.123 3.40 3.866 1.133 1.009 4.04 4.3012 0.565 0.378 8.10 11.48
OpenMP kernels show reduced computational efficiency
Aurel Neic (Medical University Graz) AMG for Non-Linear Elasticity Problems September 17, 2014 22 / 25
Parallel Scaling Hybrid MPI/OpenMP Parallelization
Implementation
I OpenMP parallelization only for computational kernels
I Independent of MPI parallelizationI Scheduler of choice: guided
I Background load should be assumed → dynamic scheduling usually betterI Optimal chunksize unknown or varying → guided scheduling decreases the chunksize
linearly to a defined minimum size
I Optimal number of threads per process is hardware and problem dependent
I Parallelization across multiple CPUs disadvantageous
Aurel Neic (Medical University Graz) AMG for Non-Linear Elasticity Problems September 17, 2014 23 / 25
Parallel Scaling Hybrid MPI/OpenMP Parallelization
Benchmarks
Reduced number of MPI processes is advantageous for a large number of threads.
0.1
1
64 128 256 528 1024 2048
Tim
e [sec]
# Cores
HYB8HYB4
MPI
Aurel Neic (Medical University Graz) AMG for Non-Linear Elasticity Problems September 17, 2014 24 / 25
Parallel Scaling Hybrid MPI/OpenMP Parallelization
Thank you!
The research was funded by the Austrian Science Fund (FWF)
Aurel Neic (Medical University Graz) AMG for Non-Linear Elasticity Problems September 17, 2014 25 / 25