parallel high-order geometric multigrid methods on adaptive …jrudi/site_data/presentations/... ·...

1
Parallel High-Order Geometric Multigrid Methods on Adaptive Meshes for Highly Heterogeneous Nonlinear Stokes Flow Simulations of Earth’s Mantle Johann Rudi 1 , Hari Sundar 2 , Tobin Isaac 1 , Georg Stadler 3 , Michael Gurnis 4 , and Omar Ghattas 1,5 1 Institute for Computational Engineering and Sciences (ICES), The University of Texas at Austin, USA 2 School of Computing, The University of Utah, USA 3 Courant Institute of Mathematical Sciences, New York University, USA 4 Seismological Laboratory, California Institute of Technology, USA 5 Jackson School of Geosciences and Department of Mechanical Engineering, The University of Texas at Austin, USA COMPUTATIONAL ENGINEERING SCIENCES INSTITUTE FOR & Summary of main results I. Efficient methods/algorithms I High-order finite elements I Adaptive meshes, resolving viscosity variations I Geometric multigrid (GMG) preconditioners for elliptic operators I Novel GMG based BFBT/LSC pressure Schur complement preconditioner I Inexact Newton-Krylov method I H -1 -norm for velocity residual in Newton line search II. Scalable parallel implementation I Matrix-free stiffness/mass application and GMG smoothing I Tensor product structure of finite element shape functions I Octree algorithms for handling adaptive meshes in parallel I Algebraic multigrid (AMG) as coarse solver for GMG avoids full AMG setup cost and large matrix assembly I Parallel scalability results up to 16,384 CPU cores (MPI) 1. Mantle flow Model equations for Earth mantle flow Rock in the mantle moves like a viscous, incompressible fluid on time scales of millions of years. From conservation of mass and momentum, we obtain that the flow velocity can be modeled as a nonlinear Stokes system. -∇ · μ(T, u)(u + u > ) + p = f (T ) ∇· u =0 u . . . velocity p . . . pressure T . . . temperature μ . . . viscosity The viscosity depends exponentially on the temperature (via an Arrhenius re- lationship), on a power of the second invariant of the strain rate tensor, and incorporates plastic yielding. The right-hand side forcing, f , is derived from the Boussinesq approximation and depends on the temperature. Effective viscosity field and adaptive mesh resolving narrow plate boundaries (shown in red ). (Visualization by L. Alisic) Solver challenges What causes the demand for scalable solvers for high-order discretizations on adaptive grids? — The severe nonlinearity, heterogeneity, and anisotropy of the Earth’s rheology: I Up to 6 orders of magnitude viscosity contrast; sharp viscosity gradients due to decoupling at plate boundaries I Wide range of spatial scales and highly localized features with respect to Earth radius (6371 km): plate thickness 50 km & shearing zones at plate boundaries 5 km I Desired resolution of 1 km results in O(10 12 ) degrees of freedom on a uni- form mesh of Earth’s mantle, therefore adaptive mesh refinement is essential I Demand for high accuracy leads to high-order discretizations 2. Scalable parallel Stokes solver Parallel octree-based adaptive mesh refinement (p4est) I Identify octree leaves with hexahedral elements I Octree structure enables fast parallel adaptive oc- tree/mesh refinement and coarsening I Octrees and space filling curves enable fast neighbor search, repartitioning, and 2 : 1 balancing in parallel I Algebraic constraints on non-conforming element faces with hanging nodes enforce global continuity of the velocity basis functions I Demonstrated scalability to O(500K) cores (MPI) High-order finite element discretization of the Stokes system -∇ · μ (u + u > ) + p = f ∇· u =0 discretize with --------→ high-order FE AB > B 0 u p = f 0 I High-order finite element shape functions I Inf-sup stable velocity-pressure pairings: Q k × P disc k-1 with 2 k I Locally mass conservative due to discontinuous pressure space I Fast, matrix-free application of stiffness and mass matrices I Hexahedral elements allow exploiting the tensor product structure of basis functions for a high floating point to memory operations ratio Linear solver: Preconditioned Krylov subspace method Fully coupled iterative solver: GMRES with upper triangular block precondition- ing AB > B 0 | {z } Stokes operator ˜ AB > 0 - ˜ S -1 | {z } preconditioner u 0 p 0 = f 0 Approximating the inverse of the viscous stress block, ˜ A -1 A -1 , is well suited for multigrid methods. Next, find an approximation for the inverse Schur com- plement, ˜ S -1 S -1 := (BA -1 B > ) -1 . BFBT/LSC Schur complement approximation ˜ S -1 Improved BFBT / Least Squares Commutator (LSC) method: ˜ S -1 =(BD -1 B > ) -1 (BD -1 AD -1 B > )(BD -1 B > ) -1 with diagonal scaling, D := diag(A). Derived from the least squares minimizer of a commutation relationship of A and B > . Here, approximating the inverse of the discrete pressure Laplacian, (BD -1 B > ), is well suited for multigrid methods. 3. Stokes solver robustness with scaled BFBT Schur complement approximation The subducting plate model problem on a cross section of the spherical Earth domain serves as a benchmark for solver robustness. Subduction model viscosity field. Multigrid parameters: GMG for ˜ A: 1 V-cycle, 3+3 smooth AMG (PETSc’s GAMG) for (BD -1 B > ): 3 V-cycles, 3+3 smooth Robustness with respect to plate coupling strength 10 -4 (mild decoupling) 0 50 100 150 200 250 300 10 -10 10 -8 10 -6 10 -4 10 -2 10 0 GMRES restart GMRES iteration l 2 norm of ||residual|| / ||initial residual|| 1e-4_viscous_stress 1e-4_Stokes_with_BFBT 1e-4_Stokes_with_mass 10 -5 0 50 100 150 200 250 300 10 -10 10 -8 10 -6 10 -4 10 -2 10 0 GMRES restart GMRES iteration l 2 norm of ||residual|| / ||initial residual|| 1e-5_viscous_stress 1e-5_Stokes_with_BFBT 1e-5_Stokes_with_mass 10 -6 (strong decoupling) 0 50 100 150 200 250 300 10 -10 10 -8 10 -6 10 -4 10 -2 10 0 GMRES restart GMRES iteration l 2 norm of ||residual|| / ||initial residual|| 1e-6_viscous_stress 1e-6_Stokes_with_BFBT 1e-6_Stokes_with_mass Convergence for solving Au = f (red ), Stokes system with BFBT (blue), Stokes system with viscosity weighted mass matrix as Schur complement approximation (green) for comparison to conventional preconditioning. Robustness with respect to plate boundary thickness 10 km 0 50 100 150 200 250 300 10 -10 10 -8 10 -6 10 -4 10 -2 10 0 GMRES restart GMRES iteration l 2 norm of ||residual|| / ||initial residual|| 10km_viscous_stress 10km_Stokes_with_BFBT 10km_Stokes_with_mass 5 km 0 50 100 150 200 250 300 10 -10 10 -8 10 -6 10 -4 10 -2 10 0 GMRES restart GMRES iteration l 2 norm of ||residual|| / ||initial residual|| 5km_viscous_stress 5km_Stokes_with_BFBT 5km_Stokes_with_mass 2 km 0 50 100 150 200 250 300 10 -10 10 -8 10 -6 10 -4 10 -2 10 0 GMRES restart GMRES iteration l 2 norm of ||residual|| / ||initial residual|| 2km_viscous_stress 2km_Stokes_with_BFBT 2km_Stokes_with_mass Convergence for solving Au = f (red ), Stokes system with BFBT (blue), Stokes system with viscosity weighted mass matrix as Schur complement approximation (green) for comparison to conventional preconditioning. 4. Parallel adaptive high-order geometric multigrid The multigrid hierarchy of nested meshes is generated from an adaptively re- fined octree-based mesh via geometric coarsening: I Parallel repartitioning of coarser meshes for load-balancing; repartitioning of sufficiently coarse meshes on subsets of cores I High-order L 2 -projection of coefficients onto coarser levels; re-discretization of differential equations at each coarser geometric multigrid level I As the coarse solver for geometric multigrid, AMG (PETSc’s GAMG) is in- voked on only small core counts I Geometric multigrid for the pressure Laplacian is problematic due to the dis- continuous modal pressure discretization P disc k-1 ; here, a novel approach is taken by re-discretizing with continuous nodal Q k basis functions Multigrid hierarchy of viscous stress ˜ A GMG high- order linear 0 0 1 1 2 2 AMG 2 2 3 3 4 direct solve Multigrid hierarchy of pressure Laplacian smoothing with (BD -1 B > ) 0 0 GMG high- order linear 0 0 1 1 2 2 AMG 2 2 3 3 4 direct solve I GMG smoother: Chebyshev accelerated Jacobi (PETSc) with matrix-free high-order stiffness apply, assembly of high-order diagonal only I GMG smoother for (BD -1 B > ), discontinuous modal: Chebyshev accel- erated Jacobi (PETSc) with matrix-free apply and assembled diagonal I GMG restriction & interpolation: High-order L 2 -projection; restriction and interpolation operators are adjoints of each other in L 2 -sense I GMG restriction & interpolation for (BD -1 B > ): L 2 -projection between dis- continuous modal and continuous nodal spaces I No collective communication in GMG cycles needed 5. Convergence dependence on mesh size and discretization order h-dependence using geometric multigrid for ˜ A and (BD -1 B > ) The mesh is increasingly refined while the discretization stays fixed to Q 2 × P disc 1 . Performed with subducting plate model problem (see above). (Multigrid param- eters: GMG for ˜ A: 1 V-cycle, 3+3 smoothing; GMG for (BD -1 B > ): 1 V-cycle, 3+3 smoothing) Solve Au = f 0 50 100 150 200 250 300 10 -6 10 -4 10 -2 10 0 GMRES restart GMRES iteration l 2 norm of ||residual|| / ||initial residual|| velocity_DOF_4.6M velocity_DOF_13.4M velocity_DOF_32.5M Solve ( BD -1 B > ) p = g 0 50 100 150 200 250 300 10 -6 10 -4 10 -2 10 0 GMRES restart GMRES iteration l 2 norm of ||residual|| / ||initial residual|| pressure_DOF_0.9M pressure_DOF_2.6M pressure_DOF_6.3M Solve Stokes system 0 50 100 150 200 250 300 10 -6 10 -4 10 -2 10 0 GMRES restart GMRES iteration l 2 norm of ||residual|| / ||initial residual|| velocity_pressure_DOF_5.5M velocity_pressure_DOF_16.0M velocity_pressure_DOF_38.8M p -dependence using geometric multigrid for ˜ A and (BD -1 B > ) The discretization order of the finite element space increases while the mesh stays fixed. Performed with subducting plate model problem (see above). (Multi- grid parameters: GMG for ˜ A: 1 V-cycle, 3+3 smoothing; GMG for (BD -1 B > ): 1 V-cycle, 3+3 smoothing) Solve Au = f 0 50 100 150 200 250 300 10 -6 10 -4 10 -2 10 0 GMRES restart GMRES iteration l 2 norm of ||residual|| / ||initial residual|| Q2 Q3 Q4 Q5 Q6 Solve ( BD -1 B > ) p = g 0 50 100 150 200 250 300 10 -6 10 -4 10 -2 10 0 GMRES restart GMRES iteration l 2 norm of ||residual|| / ||initial residual|| P1 P2 P3 P4 P5 Solve Stokes system 0 50 100 150 200 250 300 10 -6 10 -4 10 -2 10 0 GMRES restart GMRES iteration l 2 norm of ||residual|| / ||initial residual|| Q2-P1 Q3-P2 Q4-P3 Q5-P4 Q6-P5 Remark: The deteriorating Stokes convergence with increasing order is due to a deteriorating approximation of the Schur complement by the BFBT method and not the multigrid components. 6. Parallel scalability of geometric multigrid Global problem on adaptive mesh of the Earth I Viscosity is generated from real Earth data I Heterogeneous viscosity field exhibits 6 orders of magnitude variation I Adaptively refined mesh (p4est library) down to 0.5 km local resolution; Q 2 × P disc 1 discretization I Distributed memory parallelization with MPI Stampede at the Texas Advanced Computing Center 16 CPU cores per node (2 × 8 core Intel Xeon E5-2680) 32GB main memory per node (8 × 4GB DDR3-1600MHz) 6,400 nodes, 102,400 cores total, InfiniBand FDR network Weak scalability using adaptively refined Earth mesh Normalized time * based on the setup and solve times for solving for velocity u in: Au = f Normalized time * based on the setup and solve times for solving for pressure p in: ( BD -1 B > ) p = g 2048 4096 8192 16384 0 0.5 1 1.5 1 1.2 0.97 0.93 1 1.24 1.15 1.34 number of cores Normalized time * relative to 2048 cores T/(N/P ) T/(N/P )/G 2048 4096 8192 16384 0 0.5 1 1.5 1 0.83 0.84 0.64 1 1.09 1.08 1.18 number of cores Normalized time * relative to 2048 cores T/(N/P ) T/(N/P )/G * Normalization explanation: T/(N/P ) ...... Scalability of algorithms & implementation T/(N/P )/G . . . Scalability of implementation T . . . setup + solve time N . . . degrees of freedom (DOF) P . . . number of CPU cores G . . . number of GMRES iterations Detailed timings for solving Au = f #cores velocity DOF #levels GMG, AMG setup time (s) GMG, AMG, total solve time (s) #iter 2048 637M 7, 4 10.2, 14.3, 24.6 2298.0 402 4096 1155M 7, 4 12.8, 28.6, 41.4 2482.5 389 8192 2437M 8, 4 15.2, 15.6, 30.8 2129.5 339 16384 5371M 8, 4 29.2, 51.0, 80.2 2198.4 279 Detailed timings for solving ( BD -1 B > ) p = g #cores pressure DOF #levels GMG, AMG setup time (s) GMG, AMG, total solve time (s) #iter 2048 125M 7, 3 11.3, 0.9, 12.2 857.2 125 4096 227M 7, 4 12.3, 2.2, 14.5 638.0 95 8192 482M 8, 3 18.1, 1.5, 19.7 684.0 97 16384 1042M 8, 4 26.6, 9.1, 35.7 546.2 68 Remark: The number of GMRES iterations until convergence is reducing as the mesh is refined. This is due to an increasingly better resolution of the variations in the viscosity. Strong scalability using fixed adaptive Earth mesh Efficiency based on the setup and solve times for solving for velocity u in: Au = f Efficiency based on the setup and solve times for solving for pressure p in: ( BD -1 B > ) p = g 2K 4K 8K 16K 0 0.5 1 1 0.85 0.59 0.5 number of cores Efficiency rel. to 2K 4K 8K 16K 0 0.5 1 1 0.91 0.68 number of cores Efficiency rel. to 4K 2K 4K 8K 16K 0 0.5 1 1 0.88 0.81 0.69 number of cores Efficiency rel. to 2K 4K 8K 16K 0 0.5 1 1 0.72 0.57 number of cores Efficiency rel. to 4K Problem size: 637M #iterations: 401 (±1) # setup time (s) GMG, AMG, total solve time (s) 2K 10.2, 14.3, 24.6 2298.0 4K 8.6, 16.4, 25.0 1327.5 8K 12.5, 27.2, 39.7 938.4 16K 11.4, 24.3, 35.8 544.9 Problem size: 1155M #iterations: 388 (±1) # setup time (s) GMG, AMG, total solve time (s) 2K 4K 12.8, 28.6, 41.4 2482.5 8K 10.2, 38.5, 48.7 1325.9 16K 17.0, 47.9, 64.9 858.6 Problem size: 125M #iterations: 125 (±2) # setup time (s) GMG, AMG, total solve time (s) 2K 11.3, 0.9, 12.2 857.2 4K 7.8, 1.1, 8.9 487.1 8K 7.7, 1.5, 9.2 256.3 16K 7.8, 2.4, 10.2 147.6 Problem size: 227M #iterations: 96 (±1) # setup time (s) GMG, AMG, total solve time (s) 2K 4K 12.3, 2.2, 14.5 638.0 8K 17.0, 9.9, 26.9 430.8 16K 13.5, 4.4, 17.9 269.1 7. Scalable nonlinear Stokes solver: Inexact Newton-Krylov method I Newton update is computed inexactly via Krylov subspace iterative method I Krylov tolerance decreases with subsequent Newton steps to guarantee su- perlinear convergence I Number of Newton steps is independent of the mesh size I Velocity residual is measured in H -1 -norm for backtracking line search; this avoids overly conservative update steps 1 (evaluation of residual norm requires 3 scalar constant coefficient Laplace solves, which are performed by PCG with GMG preconditioning) I Grid continuation at initial Newton steps: Adaptive mesh refinement to re- solve increasing viscosity variations arising from the nonlinear dependence on the velocity Convergence of inexact Newton-Krylov (4096 cores) 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 10 -12 10 -9 10 -6 10 -3 10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 GMRES iteration l 2 norm of ||residual|| / ||initial residual|| nonlinear residual Krylov residual Plate velocities at nonlinear solution. Note: Adaptive mesh refinements after the first four Newton steps are indicated by black vertical lines. 642M velocity & pressure DOF at solution, 473 min total runtime on 4096 cores. SC14 Supercomputing Conference New Orleans, Louisiana, USA November 16–21, 2014

Upload: others

Post on 23-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Parallel High-Order Geometric Multigrid Methods on Adaptive …jrudi/site_data/presentations/... · 2014. 11. 28. · Parallel High-Order Geometric Multigrid Methods on Adaptive Meshes

Parallel High-Order Geometric Multigrid Methods on Adaptive Meshes forHighly Heterogeneous Nonlinear Stokes Flow Simulations of Earth’s Mantle

Johann Rudi1, Hari Sundar2, Tobin Isaac1, Georg Stadler3, Michael Gurnis4, and Omar Ghattas1,5

1Institute for Computational Engineering and Sciences (ICES), The University of Texas at Austin, USA2School of Computing, The University of Utah, USA

3Courant Institute of Mathematical Sciences, New York University, USA4Seismological Laboratory, California Institute of Technology, USA

5Jackson School of Geosciences and Department of Mechanical Engineering, The University of Texas at Austin, USA

COMPUTATIONAL ENGINEERING SCIENCES

INSTITUTE

FOR

&

Summary of main resultsI. Efficient methods/algorithms

I High-order finite elements

I Adaptive meshes, resolvingviscosity variations

I Geometric multigrid (GMG)preconditioners for ellipticoperators

I Novel GMG based BFBT/LSCpressure Schur complementpreconditioner

I Inexact Newton-Krylov method

I H−1-norm for velocity residual inNewton line search

II. Scalable parallel implementation

I Matrix-free stiffness/massapplication and GMG smoothing

I Tensor product structure of finiteelement shape functions

I Octree algorithms for handlingadaptive meshes in parallel

I Algebraic multigrid (AMG) ascoarse solver for GMG avoids fullAMG setup cost and large matrixassembly

I Parallel scalability results up to16,384 CPU cores (MPI)

1. Mantle flow

Model equations for Earth mantle flowRock in the mantle moves like a viscous, incompressible fluid on time scales ofmillions of years. From conservation of mass and momentum, we obtain thatthe flow velocity can be modeled as a nonlinear Stokes system.

−∇ ·[µ(T,u) (∇u +∇u>)

]+∇p = f (T )

∇ · u = 0

u . . . velocityp . . . pressureT . . . temperatureµ . . . viscosity

The viscosity depends exponentially on the temperature (via an Arrhenius re-lationship), on a power of the second invariant of the strain rate tensor, andincorporates plastic yielding. The right-hand side forcing, f , is derived from theBoussinesq approximation and depends on the temperature.

Effective viscosity field and adaptive mesh resolving narrow plate boundaries (shown in red).

(Visualization by L. Alisic)

Solver challengesWhat causes the demand for scalable solvers for high-order discretizations onadaptive grids? — The severe nonlinearity, heterogeneity, and anisotropy of theEarth’s rheology:I Up to 6 orders of magnitude viscosity contrast; sharp viscosity gradients due

to decoupling at plate boundariesI Wide range of spatial scales and highly localized features with respect to

Earth radius (∼6371 km): plate thickness ∼50 km & shearing zones at plateboundaries ∼5 km

I Desired resolution of ∼1 km results in O(1012) degrees of freedom on a uni-form mesh of Earth’s mantle, therefore adaptive mesh refinement is essential

I Demand for high accuracy leads to high-order discretizations

2. Scalable parallel Stokes solver

Parallel octree-based adaptive mesh refinement (p4est)I Identify octree leaves with hexahedral elements

I Octree structure enables fast parallel adaptive oc-tree/mesh refinement and coarsening

I Octrees and space filling curves enable fast neighborsearch, repartitioning, and 2 : 1 balancing in parallel

I Algebraic constraints on non-conforming elementfaces with hanging nodes enforce global continuityof the velocity basis functions

I Demonstrated scalability to O(500K) cores (MPI)

High-order finite element discretization of the Stokes system{−∇ ·

[µ (∇u +∇u>)

]+∇p = f

∇ · u = 0

discretize with−−−−−−−−→high-order FE

[A B>

B 0

] [up

]=

[f0

]I High-order finite element shape functionsI Inf-sup stable velocity-pressure pairings: Qk × Pdisc

k−1 with 2 ≤ k

I Locally mass conservative due to discontinuous pressure spaceI Fast, matrix-free application of stiffness and mass matricesI Hexahedral elements allow exploiting the tensor product structure of basis

functions for a high floating point to memory operations ratio

Linear solver: Preconditioned Krylov subspace methodFully coupled iterative solver: GMRES with upper triangular block precondition-ing [

A B>

B 0

]︸ ︷︷ ︸

Stokes operator

[A B>

0 −S

]−1︸ ︷︷ ︸preconditioner

[u′

p′

]=

[f0

]

Approximating the inverse of the viscous stress block, A−1 ≈ A−1, is well suitedfor multigrid methods. Next, find an approximation for the inverse Schur com-plement, S−1 ≈ S−1 := (BA−1B>)−1.

BFBT/LSC Schur complement approximation S−1

Improved BFBT / Least Squares Commutator (LSC) method:

S−1 = (BD−1B>)−1(BD−1AD−1B>)(BD−1B>)−1

with diagonal scaling, D := diag(A). Derived from the least squares minimizerof a commutation relationship of A and B>. Here, approximating the inverse ofthe discrete pressure Laplacian, (BD−1B>), is well suited for multigrid methods.

3. Stokes solver robustness with scaled BFBTSchur complement approximation

The subducting platemodel problem on across section of thespherical Earth domainserves as a benchmarkfor solver robustness.

Subduction model viscosity field.

Multigrid parameters:GMG for A: 1 V-cycle,3+3 smoothAMG (PETSc’s GAMG)for (BD−1B>):3 V-cycles, 3+3 smooth

Robustness with respect to plate coupling strength

10−4

(mild decoupling)

0 50 100 150 200 250 30010

−10

10−8

10−6

10−4

10−2

100

GMRESrestart

GMRES iteration

l2 n

orm

of

||re

sid

ual|| / ||in

itia

l re

sid

ual||

1e−4_viscous_stress

1e−4_Stokes_with_BFBT

1e−4_Stokes_with_mass

10−5

0 50 100 150 200 250 30010

−10

10−8

10−6

10−4

10−2

100

GMRESrestart

GMRES iteration

l2 n

orm

of

||re

sid

ual|| / ||in

itia

l re

sid

ual||

1e−5_viscous_stress

1e−5_Stokes_with_BFBT

1e−5_Stokes_with_mass

10−6

(strongdecoupling)

0 50 100 150 200 250 30010

−10

10−8

10−6

10−4

10−2

100

GMRESrestart

GMRES iteration

l2 n

orm

of

||re

sid

ual|| / ||in

itia

l re

sid

ual||

1e−6_viscous_stress

1e−6_Stokes_with_BFBT

1e−6_Stokes_with_mass

Convergence for solving Au = f (red), Stokes system with BFBT (blue), Stokes system with viscosity weighted

mass matrix as Schur complement approximation (green) for comparison to conventional preconditioning.

Robustness with respect to plate boundary thickness

10 km

0 50 100 150 200 250 30010

−10

10−8

10−6

10−4

10−2

100

GMRESrestart

GMRES iteration

l2 n

orm

of

||re

sid

ual|| / ||in

itia

l re

sid

ual||

10km_viscous_stress

10km_Stokes_with_BFBT

10km_Stokes_with_mass

5 km

0 50 100 150 200 250 30010

−10

10−8

10−6

10−4

10−2

100

GMRESrestart

GMRES iteration

l2 n

orm

of

||re

sid

ual|| / ||in

itia

l re

sid

ual||

5km_viscous_stress

5km_Stokes_with_BFBT

5km_Stokes_with_mass

2 km

0 50 100 150 200 250 30010

−10

10−8

10−6

10−4

10−2

100

GMRESrestart

GMRES iteration

l2 n

orm

of

||re

sid

ual|| / ||in

itia

l re

sid

ual||

2km_viscous_stress

2km_Stokes_with_BFBT

2km_Stokes_with_mass

Convergence for solving Au = f (red), Stokes system with BFBT (blue), Stokes system with viscosity weighted

mass matrix as Schur complement approximation (green) for comparison to conventional preconditioning.

4. Parallel adaptive high-order geometric multigridThe multigrid hierarchy of nested meshes is generated from an adaptively re-fined octree-based mesh via geometric coarsening:I Parallel repartitioning of coarser meshes for load-balancing; repartitioning of

sufficiently coarse meshes on subsets of coresI High-order L2-projection of coefficients onto coarser levels; re-discretization

of differential equations at each coarser geometric multigrid level

I As the coarse solver for geometric multigrid, AMG (PETSc’s GAMG) is in-voked on only small core counts

I Geometric multigrid for the pressure Laplacian is problematic due to the dis-continuous modal pressure discretization Pdisc

k−1; here, a novel approach istaken by re-discretizing with continuous nodal Qk basis functions

Multigrid hierarchy of viscous stress A

GMGhigh-orderlinear

0 0

1 1

2 2

AMG2 2

3 3

4

direct solve

Multigrid hierarchy of pressure Laplacian

smoothing with (BD−1B>)0 0

GMGhigh-orderlinear

0 0

1 1

2 2

AMG2 2

3 3

4

direct solve

I GMG smoother: Chebyshev accelerated Jacobi (PETSc) with matrix-freehigh-order stiffness apply, assembly of high-order diagonal only

I GMG smoother for (BD−1B>), discontinuous modal: Chebyshev accel-erated Jacobi (PETSc) with matrix-free apply and assembled diagonal

I GMG restriction & interpolation: High-order L2-projection; restriction andinterpolation operators are adjoints of each other in L2-sense

I GMG restriction & interpolation for (BD−1B>): L2-projection between dis-continuous modal and continuous nodal spaces

I No collective communication in GMG cycles needed

5. Convergence dependence on mesh size anddiscretization order

h-dependence using geometric multigrid for A and (BD−1B>)

The mesh is increasingly refined while the discretization stays fixed to Q2×Pdisc1 .

Performed with subducting plate model problem (see above). (Multigrid param-eters: GMG for A: 1 V-cycle, 3+3 smoothing; GMG for (BD−1B>): 1 V-cycle,3+3 smoothing)

Solve Au = f

0 50 100 150 200 250 30010

−6

10−4

10−2

100

GMRESrestart

GMRES iteration

l2 n

orm

of

||re

sid

ual|| / ||in

itia

l re

sid

ual||

velocity_DOF_4.6M

velocity_DOF_13.4M

velocity_DOF_32.5M

Solve(BD−1B>

)p = g

0 50 100 150 200 250 30010

−6

10−4

10−2

100

GMRESrestart

GMRES iteration

l2 n

orm

of

||re

sid

ual|| / ||in

itia

l re

sid

ual||

pressure_DOF_0.9M

pressure_DOF_2.6M

pressure_DOF_6.3M

Solve Stokes system

0 50 100 150 200 250 30010

−6

10−4

10−2

100

GMRESrestart

GMRES iteration

l2 n

orm

of

||re

sid

ual|| / ||in

itia

l re

sid

ual||

velocity_pressure_DOF_5.5M

velocity_pressure_DOF_16.0M

velocity_pressure_DOF_38.8M

p-dependence using geometric multigrid for A and (BD−1B>)

The discretization order of the finite element space increases while the meshstays fixed. Performed with subducting plate model problem (see above). (Multi-grid parameters: GMG for A: 1 V-cycle, 3+3 smoothing; GMG for (BD−1B>):1 V-cycle, 3+3 smoothing)

Solve Au = f

0 50 100 150 200 250 30010

−6

10−4

10−2

100

GMRESrestart

GMRES iteration

l2 n

orm

of

||re

sid

ual|| / ||in

itia

l re

sid

ual||

Q2

Q3

Q4

Q5

Q6

Solve(BD−1B>

)p = g

0 50 100 150 200 250 30010

−6

10−4

10−2

100

GMRESrestart

GMRES iteration

l2 n

orm

of

||re

sid

ual|| / ||in

itia

l re

sid

ual||

P1

P2

P3

P4

P5

Solve Stokes system

0 50 100 150 200 250 30010

−6

10−4

10−2

100

GMRESrestart

GMRES iteration

l2 n

orm

of

||re

sid

ual|| / ||in

itia

l re

sid

ual||

Q2−P1

Q3−P2

Q4−P3

Q5−P4

Q6−P5

Remark: The deteriorating Stokes convergence with increasing order is due to a deteriorating approximation of the

Schur complement by the BFBT method and not the multigrid components.

6. Parallel scalability of geometric multigrid

Global problem on adaptive mesh of the Earth

I Viscosity is generated from real Earth data

I Heterogeneous viscosity field exhibits 6 orders ofmagnitude variation

I Adaptively refined mesh (p4est library) down to∼0.5 km local resolution; Q2 × Pdisc

1 discretization

I Distributed memory parallelization with MPI

Stampede at the Texas Advanced Computing Center

16 CPU cores per node (2 × 8 core Intel Xeon E5-2680)32GB main memory per node (8 × 4GB DDR3-1600MHz)6,400 nodes, 102,400 cores total, InfiniBand FDR network

Weak scalability using adaptively refined Earth meshNormalized time∗ based on the setup andsolve times for solving for velocity u in:

Au = f

Normalized time∗ based on the setup andsolve times for solving for pressure p in:(

BD−1B>)p = g

2048 4096 8192 163840

0.5

1

1.5

11.2

0.97 0.9311.24 1.15

1.34

number of cores

Normalized time∗ relative to 2048 cores

T/(N/P ) T/(N/P )/G

2048 4096 8192 163840

0.5

1

1.5

10.83 0.84

0.64

1 1.09 1.08 1.18

number of cores

Normalized time∗ relative to 2048 cores

T/(N/P ) T/(N/P )/G

∗Normalization explanation:

T/(N/P ) . . . . . . Scalability of algorithms & implementationT/(N/P )/G . . . Scalability of implementation

T . . . setup+ solve timeN . . . degrees of freedom (DOF)P . . . number of CPU coresG . . . number of GMRES iterations

Detailed timings for solving Au = f

#coresvelocity

DOF#levels

GMG, AMGsetup time (s)GMG, AMG, total

solvetime (s) #iter

2048 637M 7, 4 10.2, 14.3, 24.6 2298.0 4024096 1155M 7, 4 12.8, 28.6, 41.4 2482.5 3898192 2437M 8, 4 15.2, 15.6, 30.8 2129.5 339

16384 5371M 8, 4 29.2, 51.0, 80.2 2198.4 279

Detailed timings for solving(BD−1B>

)p = g

#corespressure

DOF#levels

GMG, AMGsetup time (s)GMG, AMG, total

solvetime (s) #iter

2048 125M 7, 3 11.3, 0.9, 12.2 857.2 1254096 227M 7, 4 12.3, 2.2, 14.5 638.0 958192 482M 8, 3 18.1, 1.5, 19.7 684.0 97

16384 1042M 8, 4 26.6, 9.1, 35.7 546.2 68

Remark: The number of GMRES iterations until convergence is reducing as the mesh is refined. This is due to an

increasingly better resolution of the variations in the viscosity.

Strong scalability using fixed adaptive Earth meshEfficiency based on the setup and solvetimes for solving for velocity u in:

Au = f

Efficiency based on the setup and solvetimes for solving for pressure p in:(

BD−1B>)p = g

2K 4K 8K 16K0

0.5

11

0.850.59 0.5

number of cores

Efficiency rel. to 2K

4K 8K 16K0

0.5

11 0.91

0.68

number of cores

Efficiency rel. to 4K

2K 4K 8K 16K0

0.5

11

0.880.810.69

number of cores

Efficiency rel. to 2K

4K 8K 16K0

0.5

11

0.720.57

number of cores

Efficiency rel. to 4K

Problem size: 637M#iterations: 401 (±1)

#setup time (s)GMG, AMG, total

solvetime (s)

2K 10.2, 14.3, 24.6 2298.04K 8.6, 16.4, 25.0 1327.58K 12.5, 27.2, 39.7 938.4

16K 11.4, 24.3, 35.8 544.9

Problem size: 1155M#iterations: 388 (±1)

#setup time (s)GMG, AMG, total

solvetime (s)

2K — —4K 12.8, 28.6, 41.4 2482.58K 10.2, 38.5, 48.7 1325.9

16K 17.0, 47.9, 64.9 858.6

Problem size: 125M#iterations: 125 (±2)

#setup time (s)GMG, AMG, total

solvetime (s)

2K 11.3, 0.9, 12.2 857.24K 7.8, 1.1, 8.9 487.18K 7.7, 1.5, 9.2 256.3

16K 7.8, 2.4, 10.2 147.6

Problem size: 227M#iterations: 96 (±1)

#setup time (s)GMG, AMG, total

solvetime (s)

2K — —4K 12.3, 2.2, 14.5 638.08K 17.0, 9.9, 26.9 430.8

16K 13.5, 4.4, 17.9 269.1

7. Scalable nonlinear Stokes solver:Inexact Newton-Krylov method

I Newton update is computed inexactly via Krylov subspace iterative method

I Krylov tolerance decreases with subsequent Newton steps to guarantee su-perlinear convergence

I Number of Newton steps is independent of the mesh size

I Velocity residual is measured in H−1-norm for backtracking line search; thisavoids overly conservative update steps � 1 (evaluation of residual normrequires 3 scalar constant coefficient Laplace solves, which are performedby PCG with GMG preconditioning)

I Grid continuation at initial Newton steps: Adaptive mesh refinement to re-solve increasing viscosity variations arising from the nonlinear dependenceon the velocity

Convergence of inexact Newton-Krylov (4096 cores)

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

10−12

10−9

10−6

10−3

100

12

34 5

6

78

91011

1213

141516

1718

1920

212223

242526

27

GMRES iteration

l2 n

orm

of

||re

sid

ua

l|| /

||in

itia

l re

sid

ua

l||

nonlinear residual

Krylov residual

Plate velocities at nonlinear solution.

Note: Adaptive mesh refinements after the first four Newton steps are indicated by black vertical lines. 642M

velocity & pressure DOF at solution, 473 min total runtime on 4096 cores.

SC14 Supercomputing Conference New Orleans, Louisiana, USA November 16–21, 2014