solving pdes on supercomputers i: modern supercomputer ... · solving pdes on supercomputers i:...
TRANSCRIPT
Supercomputer architecture
Solving PDEs on Supercomputers I:modern supercomputer architecture
Patrick Farrell
MMSC: Python in Scientific Computing
May 17, 2015
P. E. Farrell (Oxford) SPS I May 17, 2015 1 / 17
Supercomputer architecture
Moore’s Law
Moore’s Law
The number of transistors per unit area on integrated circuitsdoubles every two years. (1965)
P. E. Farrell (Oxford) SPS I May 17, 2015 2 / 17
Supercomputer architecture
Moore’s Law
The consequence
Individual computers aren’t getting faster: we’re getting more of them.
P. E. Farrell (Oxford) SPS I May 17, 2015 3 / 17
Supercomputer architecture
A modern supercomputer
In this lecture we will give a brief overview of modern supercomputerarchitecture.
ARCHER is composed of 4920 nodes, each with 24 cores, for a total of118,080 cores.
P. E. Farrell (Oxford) SPS I May 17, 2015 4 / 17
Supercomputer architecture
A node
Algorithmic consequence
Extreme pressure on memory and memory bandwidth.
P. E. Farrell (Oxford) SPS I May 17, 2015 5 / 17
Supercomputer architecture
A node
Algorithmic consequence
Extreme pressure on memory and memory bandwidth.
P. E. Farrell (Oxford) SPS I May 17, 2015 5 / 17
Supercomputer architecture
A socket
Algorithmic consequence
Want to have multiple cores working on the same data.
P. E. Farrell (Oxford) SPS I May 17, 2015 6 / 17
Supercomputer architecture
A socket
Algorithmic consequence
Want to have multiple cores working on the same data.
P. E. Farrell (Oxford) SPS I May 17, 2015 6 / 17
Supercomputer architecture
A core
Algorithmic consequence
Vectorisation essential for maximum floating point performance.
P. E. Farrell (Oxford) SPS I May 17, 2015 7 / 17
Supercomputer architecture
A core
Algorithmic consequence
Vectorisation essential for maximum floating point performance.
P. E. Farrell (Oxford) SPS I May 17, 2015 7 / 17
Supercomputer architecture Hardware properties
Some relative timings
On a 3.0 GHz Intel Core 2 Duo E8400:
I One clock cycle: ∼ 1/3 nanoseconds (∼ 10 light-cm!).I Accessing L1 data cache (32 KB): 3 cyclesI Accessing L2 cache (6 MB): 14 cyclesI Accessing main memory: ∼ 250 cyclesI Accessing disk: ∼ 40 million cycles
P. E. Farrell (Oxford) SPS I May 17, 2015 8 / 17
Supercomputer architecture Hardware properties
Some relative timings
On a 3.0 GHz Intel Core 2 Duo E8400:
I One clock cycle: ∼ 1/3 nanoseconds (∼ 10 light-cm!).I Accessing L1 data cache (32 KB): 3 cyclesI Accessing L2 cache (6 MB): 14 cyclesI Accessing main memory: ∼ 250 cyclesI Accessing disk: ∼ 40 million cycles
Analogy
I Register: the data is on your working paper.
I L1 cache: the data is on your desk (3 seconds).
I L2 cache: the data is on your bookshelf (14 seconds).
I Main memory: the data is in the library (a 4 minute walk).
I Disk: go backpacking for 1.2 years.
P. E. Farrell (Oxford) SPS I May 17, 2015 8 / 17
Supercomputer architecture Hardware properties
Some relative timings
On a 3.0 GHz Intel Core 2 Duo E8400:
I One clock cycle: ∼ 1/3 nanoseconds (∼ 10 light-cm!).I Accessing L1 data cache (32 KB): 3 cyclesI Accessing L2 cache (6 MB): 14 cyclesI Accessing main memory: ∼ 250 cyclesI Accessing disk: ∼ 40 million cycles
Analogy
I Register: the data is on your working paper.
I L1 cache: the data is on your desk (3 seconds).
I L2 cache: the data is on your bookshelf (14 seconds).
I Main memory: the data is in the library (a 4 minute walk).
I Disk: go backpacking for 1.2 years.
P. E. Farrell (Oxford) SPS I May 17, 2015 8 / 17
Supercomputer architecture Hardware properties
The interconnect
P. E. Farrell (Oxford) SPS I May 17, 2015 9 / 17
Supercomputer architecture Hardware properties
Some more timings
On the Cray Aries interconnect, to send a message:
I Within a socket: 800 cycles
I Within a node: 1600 cycles
I Across the machine: 8000 cycles
Algorithmic consequence
Interleave communication and computation.
P. E. Farrell (Oxford) SPS I May 17, 2015 10 / 17
Supercomputer architecture Hardware properties
Some more timings
On the Cray Aries interconnect, to send a message:
I Within a socket: 800 cycles
I Within a node: 1600 cycles
I Across the machine: 8000 cycles
Algorithmic consequence
Interleave communication and computation.
P. E. Farrell (Oxford) SPS I May 17, 2015 10 / 17
MPI and OpenMP
Domain decomposition
The coarsest level of parallelism used is domain decomposition over MPI.
from dolfin import *
mesh = UnitCubeMesh(32, 32, 32)
partitioning = CellFunction("size_t", mesh)
partitioning.set_all(MPI.rank(mpi_comm_world()))
File("output/partitioning.xdmf") << partitioning
$ mpiexec -n 4 python partition.py
P. E. Farrell (Oxford) SPS I May 17, 2015 11 / 17
MPI and OpenMP
MPI: basic model
MPI
Separate processes with separate memory spaces communicate viamessage passing.
MPI concepts:
I communicator
I collective
I rank
I blocking and nonblocking communication
I reductions
Each subdomain is assigned to one MPI rank.
P. E. Farrell (Oxford) SPS I May 17, 2015 12 / 17
MPI and OpenMP
Main communication patterns in finite elements
Assembly
Assembly requires exchanging halo data with your neighbours.
processor 0
processor 1
core
ow
ned
exec
no
n-e
xec
core
ow
ned
exec
no
n-e
xec
halos
P. E. Farrell (Oxford) SPS I May 17, 2015 13 / 17
MPI and OpenMP
Main communication patterns in finite elements
Krylov solvers
I Neighbour communications for sparse matrix-vector product.
I Global reductions (allreduce for dot products)
I Preconditioner application
I Multigrid: extremely complicated.
P. E. Farrell (Oxford) SPS I May 17, 2015 14 / 17
MPI and OpenMP
OpenMP: basic model
OpenMP
Separate threads operate on the same memory space.
I Less overhead in parallelexecution
I Multiple cores can act on thesame data
I Less pressure on memory andmemory bandwidth
I Easier load balancing
I Extremely difficult to programcorrectly
I Subtle race conditions possible
I Colouring and locks required tosynchronise
P. E. Farrell (Oxford) SPS I May 17, 2015 15 / 17
MPI and OpenMP
DOLFIN can also run in OpenMP mode for assembly:
from dolfin import *
parameters["num_threads"] = 4
# ...
solve(F == 0, u) # must use a threaded solver
# (e.g. pastix)!
You can’t use MPI and OpenMP at the same time (yet).
P. E. Farrell (Oxford) SPS I May 17, 2015 16 / 17
Algorithmic consequences
General algorithmic consequences
I Need algorithms with high arithmetical intensity.
I Caches greatly dislike unstructured memory accesses.
I Flops are (approximately) free.
I Large stencils induce extra communication.
I Must overlap communication and computation.
I Solver algorithms must be O(n) or O(nlogn).
General algorithmic trends
I Domain-decomposed high-order FE on semi-structured meshes.
I Multigrid/multilevel solvers with Krylov accelerators.
I Hybrid parallelism strategies (MPI/OpenMP/AVX).
P. E. Farrell (Oxford) SPS I May 17, 2015 17 / 17
Algorithmic consequences
General algorithmic consequences
I Need algorithms with high arithmetical intensity.
I Caches greatly dislike unstructured memory accesses.
I Flops are (approximately) free.
I Large stencils induce extra communication.
I Must overlap communication and computation.
I Solver algorithms must be O(n) or O(nlogn).
General algorithmic trends
I Domain-decomposed high-order FE on semi-structured meshes.
I Multigrid/multilevel solvers with Krylov accelerators.
I Hybrid parallelism strategies (MPI/OpenMP/AVX).
P. E. Farrell (Oxford) SPS I May 17, 2015 17 / 17
Algorithmic consequences
General algorithmic consequences
I Need algorithms with high arithmetical intensity.
I Caches greatly dislike unstructured memory accesses.
I Flops are (approximately) free.
I Large stencils induce extra communication.
I Must overlap communication and computation.
I Solver algorithms must be O(n) or O(nlogn).
General algorithmic trends
I Domain-decomposed high-order FE on semi-structured meshes.
I Multigrid/multilevel solvers with Krylov accelerators.
I Hybrid parallelism strategies (MPI/OpenMP/AVX).
P. E. Farrell (Oxford) SPS I May 17, 2015 17 / 17
Algorithmic consequences
General algorithmic consequences
I Need algorithms with high arithmetical intensity.
I Caches greatly dislike unstructured memory accesses.
I Flops are (approximately) free.
I Large stencils induce extra communication.
I Must overlap communication and computation.
I Solver algorithms must be O(n) or O(nlogn).
General algorithmic trends
I Domain-decomposed high-order FE on semi-structured meshes.
I Multigrid/multilevel solvers with Krylov accelerators.
I Hybrid parallelism strategies (MPI/OpenMP/AVX).
P. E. Farrell (Oxford) SPS I May 17, 2015 17 / 17
Algorithmic consequences
General algorithmic consequences
I Need algorithms with high arithmetical intensity.
I Caches greatly dislike unstructured memory accesses.
I Flops are (approximately) free.
I Large stencils induce extra communication.
I Must overlap communication and computation.
I Solver algorithms must be O(n) or O(nlogn).
General algorithmic trends
I Domain-decomposed high-order FE on semi-structured meshes.
I Multigrid/multilevel solvers with Krylov accelerators.
I Hybrid parallelism strategies (MPI/OpenMP/AVX).
P. E. Farrell (Oxford) SPS I May 17, 2015 17 / 17
Algorithmic consequences
General algorithmic consequences
I Need algorithms with high arithmetical intensity.
I Caches greatly dislike unstructured memory accesses.
I Flops are (approximately) free.
I Large stencils induce extra communication.
I Must overlap communication and computation.
I Solver algorithms must be O(n) or O(nlogn).
General algorithmic trends
I Domain-decomposed high-order FE on semi-structured meshes.
I Multigrid/multilevel solvers with Krylov accelerators.
I Hybrid parallelism strategies (MPI/OpenMP/AVX).
P. E. Farrell (Oxford) SPS I May 17, 2015 17 / 17
Algorithmic consequences
General algorithmic consequences
I Need algorithms with high arithmetical intensity.
I Caches greatly dislike unstructured memory accesses.
I Flops are (approximately) free.
I Large stencils induce extra communication.
I Must overlap communication and computation.
I Solver algorithms must be O(n) or O(nlogn).
General algorithmic trends
I Domain-decomposed high-order FE on semi-structured meshes.
I Multigrid/multilevel solvers with Krylov accelerators.
I Hybrid parallelism strategies (MPI/OpenMP/AVX).
P. E. Farrell (Oxford) SPS I May 17, 2015 17 / 17
Solving PDEs on Supercomputers II:practical matters of using supercomputers
Patrick Farrell
MMSC: Python in Scientific Computing
May 17, 2015
P. E. Farrell (Oxford) SPS 2 May 17, 2015 1 / 7
Logging on
Supercomputers are accessed by sshing to the login nodes.
$ ssh [email protected]
You configure your environment with modules:
$ module list
No Modulefiles Currently Loaded.
$ module avail
...
$ module use -a /data/math-farrellp/crichardson/modules
$ module load fenics/1.5.0
$ module list
Modules are generally awful, but nothing better exists yet.
P. E. Farrell (Oxford) SPS 2 May 17, 2015 2 / 7
Running jobs interactively
The simplest way to run a job is interactively. This is mainly used fordebugging.
$ qsub -I -l nodes=1:ppn=16 -l walltime=0:10:00 -q develq
qsub: waiting for job 312485.headnode1.arcus.osc.local to start
# wait until PBS allocates us the resources we asked for ...
qsub: job 312485.headnode1.arcus.osc.local ready
$ cd $PBS_O_WORKDIR
$ module use -a /data/math-farrellp/crichardson/modules
$ module load fenics/1.5.0
$ mpirun $MPI_HOSTS python poisson.py
P. E. Farrell (Oxford) SPS 2 May 17, 2015 3 / 7
Running jobs in batch mode
ARCUS-A and ARCHER are managed using PBS, the Portable BatchSystem. Users submit jobs to the batch system which decides when andwhere they get executed.
The main PBS commands:
I qsub
I qdel
I qstat
The argument to qsub is a PBS script.
P. E. Farrell (Oxford) SPS 2 May 17, 2015 4 / 7
Running jobs in batch mode
#!/bin/bash
# set the number of nodes and processes per node
#PBS -l nodes=1:ppn=16
# set max wallclock time
#PBS -l walltime=1:00:00
# set name of job
#PBS -N poisson
# mail alert at start, end and abortion of execution
#PBS -m bea
# send mail to this address
#PBS -M [email protected]
# start job from the directory it was submitted
cd $PBS_O_WORKDIR
module use -a /data/math-farrellp/crichardson/modules
module load fenics/1.5.0
. enable_arcus_mpi.sh
mpirun $MPI_HOSTS python poisson.py | tee poisson.log
P. E. Farrell (Oxford) SPS 2 May 17, 2015 5 / 7
HPC 02 Challenge!
Investigate the weak scaling of the 2D Poisson solver with parallel LU thatyou developed last week:
I Have the code refine the mesh once each time the number of coresquadruples.Hint:size = MPI.size(mpi_comm_world())
...
for i in nrefine:
mesh = refine(mesh, redistribute=False)
I Run the code on 1, 4 and 16 cores. What happens to the runtime asthe problem is scaled weakly?
I . . .
P. E. Farrell (Oxford) SPS 2 May 17, 2015 6 / 7
HPC 02 Challenge!
Which components of the solver are taking the longest?Profile the code with
I DOLFIN timing system: list timings()
I PETSc timing system:
import petsc4py
petsc4py.init("-log_summary summary.log".split())
from dolfin import *
I Now switch to HYPRE algebraic multigrid and compare the timingsagain. Hint: to get more details about the AMG solve, call
PETScOptions.set("pc_hypre_boomeramg_print_statistics", 1)
P. E. Farrell (Oxford) SPS 2 May 17, 2015 7 / 7
Solving PDEs on Supercomputers III:an introduction to PETSc
Patrick Farrell
MMSC: Python in Scientific Computing
May 17, 2015
P. E. Farrell (Oxford) SPS 3 May 17, 2015 1 / 5
PETSc
PETSc is a library of linear and nonlinear solvers for sparse PDEs.
It has won most awards going:
I SIAM/ACM Prize in Computational Science and Engineering, 2015
I 2009 R&D Award
I Gordon Bell Prizes in 2009, 2004, 2003, 1999
I . . .
PETSc makes it easy to express complex hierarchical composed solvers ascompactly as possible.
P. E. Farrell (Oxford) SPS 3 May 17, 2015 2 / 5
Fundamental objects
[Vec, Mat, PC, KSP, SNES]
Vec
Vec represents a dense vector, decomposed in parallel.
Example
ierr = VecCreateMPI(PETSC COMM WORLD, local, global, &x);
ierr = VecDuplicate(x, &y);
ierr = VecDotBegin(x, y, &xTy);
/* other computations */
ierr = VecDotEnd(x, y, &xTy);
P. E. Farrell (Oxford) SPS 3 May 17, 2015 3 / 5
Fundamental objects
[Vec, Mat, PC, KSP, SNES]
Mat
Mat represents a sparse matrix, decomposed in parallel.
Example
ierr = MatCreateAIJ(PETSC COMM WORLD, ..., &mat);
for (i = 0; i < local rows; i++)
ierr = MatSetValues(mat, ...);
ierr = MatAssemblyBegin(mat, MAT FINAL ASSEMBLY);
ierr = MatAssemblyEnd(mat, MAT FINAL ASSEMBLY);
ierr = MatMult(mat, x, y);
P. E. Farrell (Oxford) SPS 3 May 17, 2015 3 / 5
Fundamental objects
[Vec, Mat, PC, KSP, SNES]
PC
PC represents a linear preconditioner (Jacobi, Gauss-Seidel, ILU, ICC,AMG, additive Schwarz, ...)
Example
ierr = PCCreate(PETSC COMM WORLD, &pc);
ierr = PCSetOperators(pc, A, P);
ierr = PCSetType(pc, PCILU);
ierr = PCSetUp(pc);
ierr = PCApply(pc, x, y);
P. E. Farrell (Oxford) SPS 3 May 17, 2015 3 / 5
Fundamental objects
[Vec, Mat, PC, KSP, SNES]
KSP
KSP represents a linear solver (CG, GMRES, TFQMR, BICGSTAB,MINRES, GCR, Richardson, Chebyshev, ...)
Example
ierr = KSPCreate(PETSC COMM WORLD, &ksp);
ierr = KSPSetOperators(ksp, A, P);
ierr = KSPSetType(ksp, KSPCG);
ierr = KSPSetUp(ksp);
ierr = KSPSolve(ksp, b, x);
P. E. Farrell (Oxford) SPS 3 May 17, 2015 3 / 5
Fundamental objects
[Vec, Mat, PC, KSP, SNES]
SNES
SNES represents a nonlinear solver (Newton, reduced-space Newton,NGMRES, NCG, Anderson acceleration, FAS, ...)
Example
ierr = SNESCreate(PETSC COMM WORLD, &snes);
ierr = SNESSetFunction(snes, r, residual);
ierr = SNESSetJacobian(snes, J, P, jacobian);
ierr = SNESSetType(snes, SNESVINEWTONRSLS);
ierr = SNESSetVariableBounds(snes, xl, xu);
ierr = SNESSetUp(snes);
ierr = SNESSolve(snes, b, x);
P. E. Farrell (Oxford) SPS 3 May 17, 2015 3 / 5
Hierarchical composition
Principle
All objects are composable.
Principle
All objects are configurable.
(example from variational fracture mechanics)
P. E. Farrell (Oxford) SPS 3 May 17, 2015 4 / 5
Hierarchical composition
Principle
All objects are composable.
Principle
All objects are configurable.
(example from variational fracture mechanics)
P. E. Farrell (Oxford) SPS 3 May 17, 2015 4 / 5
Hierarchical composition
Principle
All objects are composable.
Principle
All objects are configurable.
(example from variational fracture mechanics)
P. E. Farrell (Oxford) SPS 3 May 17, 2015 4 / 5
Wiring PETSc and FEniCS
We’re going to need fine control to design our solvers.
A simple interface between FEniCS and PETSc:
$ git clone https://bitbucket.org/pefarrell/dolfin-snes-interface.git
P. E. Farrell (Oxford) SPS 3 May 17, 2015 5 / 5
Solving PDEs on Supercomputers IV:algebraic multigrid
Patrick Farrell
MMSC: Python in Scientific Computing
May 18, 2015
P. E. Farrell (Oxford) SPS 4 May 18, 2015 1 / 13
Multilevel solvers
At the core of most PDE solvers is the solution of a linear system
Linear system
Ax = b
The most powerful solvers for PDEs exploit the fact that there exists aninfinite hierarchy of discretisations, all approximating the same problem:
Hierarchy of linear systems
· · ·Ahxh = bh
A2hx2h = b2h
A4hx4h = b4h
· · ·P. E. Farrell (Oxford) SPS 4 May 18, 2015 2 / 13
Geometric multigrid: review
Geometric multigrid algorithm
I Begin with an initial guess.
I Apply a relaxation method to smooth the error.
I Solve for the smooth error on a coarse grid.
P. E. Farrell (Oxford) SPS 4 May 18, 2015 3 / 13
Geometric multigrid: review
Geometric multigrid algorithm
I Begin with an initial guess.
I Apply a relaxation method to smooth the error.
I Solve for the smooth error on a coarse grid.
P. E. Farrell (Oxford) SPS 4 May 18, 2015 3 / 13
Geometric multigrid: review
Geometric multigrid algorithm
I Begin with an initial guess.
I Apply a relaxation method to smooth the error.
I Solve for the smooth error on a coarse grid.
P. E. Farrell (Oxford) SPS 4 May 18, 2015 3 / 13
Why did geometric multigrid work?
Geometric multigrid worked on the Laplacian because:
I simple relaxation methods yielded geometrically smooth errors;
I those errors could be well-represented on coarse grids.
What about problems where the error isn’t smooth after relaxation?
P. E. Farrell (Oxford) SPS 4 May 18, 2015 4 / 13
Why did geometric multigrid work?
Geometric multigrid worked on the Laplacian because:
I simple relaxation methods yielded geometrically smooth errors;
I those errors could be well-represented on coarse grids.
What about problems where the error isn’t smooth after relaxation?
Anisotropic Laplacian
−auxx − buyy = f in Ω = [0, 1]2
u = g on ∂Ω
a = b if x < 1/2
a b if x ≥ 1/2.
P. E. Farrell (Oxford) SPS 4 May 18, 2015 4 / 13
Why did geometric multigrid work?
Geometric multigrid worked on the Laplacian because:
I simple relaxation methods yielded geometrically smooth errors;
I those errors could be well-represented on coarse grids.
What about problems where the error isn’t smooth after relaxation?
P. E. Farrell (Oxford) SPS 4 May 18, 2015 4 / 13
Two responses
GMG:
I design increasingly arcane relaxation methods that do smooth;
I semi-coarsening, multi-coarsening, etc.
AMG:
I fix a simple relaxation method;
I algebraically construct coarse grids and interpolation operators;
I demand that these can well represent the error after relaxation.
A nice side effect: AMG requires much less infrastructure:
I No need to supply coarse grids
I No need to supplyinterpolation operators
I Only applies to linear problems
I Requires global linearisation(memory)
I Requires near-nullspace ofoperator
P. E. Farrell (Oxford) SPS 4 May 18, 2015 5 / 13
Two responses
GMG:
I design increasingly arcane relaxation methods that do smooth;
I semi-coarsening, multi-coarsening, etc.
AMG:
I fix a simple relaxation method;
I algebraically construct coarse grids and interpolation operators;
I demand that these can well represent the error after relaxation.
A nice side effect: AMG requires much less infrastructure:
I No need to supply coarse grids
I No need to supplyinterpolation operators
I Only applies to linear problems
I Requires global linearisation(memory)
I Requires near-nullspace ofoperator
P. E. Farrell (Oxford) SPS 4 May 18, 2015 5 / 13
Two responses
GMG:
I design increasingly arcane relaxation methods that do smooth;
I semi-coarsening, multi-coarsening, etc.
AMG:
I fix a simple relaxation method;
I algebraically construct coarse grids and interpolation operators;
I demand that these can well represent the error after relaxation.
A nice side effect: AMG requires much less infrastructure:
I No need to supply coarse grids
I No need to supplyinterpolation operators
I Only applies to linear problems
I Requires global linearisation(memory)
I Requires near-nullspace ofoperator
P. E. Farrell (Oxford) SPS 4 May 18, 2015 5 / 13
Anisotropic Laplacian again
P. E. Farrell (Oxford) SPS 4 May 18, 2015 6 / 13
Anisotropic Laplacian again
P. E. Farrell (Oxford) SPS 4 May 18, 2015 6 / 13
Fundamental principles of AMG I: relaxation and error
Recall Richardson iteration with a preconditioner P :
Richardson iteration
xk+1 = xk + P−1 (b−Axk) .
A simple error analysis shows
Error analysis of Richardson iteration
ek+1 =(I − P−1A
)ek
Now if ek+1 ≈ ek then
Near-nullspace of A
P−1Aek ≈ 0 =⇒ Aek ≈ 0.
P. E. Farrell (Oxford) SPS 4 May 18, 2015 7 / 13
Fundamental principles of AMG I: relaxation and error
Recall Richardson iteration with a preconditioner P :
Richardson iteration
xk+1 = xk + P−1 (b−Axk) .
A simple error analysis shows
Error analysis of Richardson iteration
ek+1 =(I − P−1A
)ek
Now if ek+1 ≈ ek then
Near-nullspace of A
P−1Aek ≈ 0 =⇒ Aek ≈ 0.
P. E. Farrell (Oxford) SPS 4 May 18, 2015 7 / 13
Fundamental principles of AMG I: relaxation and error
Recall Richardson iteration with a preconditioner P :
Richardson iteration
xk+1 = xk + P−1 (b−Axk) .
A simple error analysis shows
Error analysis of Richardson iteration
ek+1 =(I − P−1A
)ek
Now if ek+1 ≈ ek then
Near-nullspace of A
P−1Aek ≈ 0 =⇒ Aek ≈ 0.
P. E. Farrell (Oxford) SPS 4 May 18, 2015 7 / 13
Fundamental principles of AMG I: relaxation and error
Error after relaxation
The error after relaxation is related to the near-nullspace of the operator.
P. E. Farrell (Oxford) SPS 4 May 18, 2015 8 / 13
Fundamental principles of AMG II: interpolation
Recall that in one multigrid cycle we approximate the fine error as
Approximation of fine error
eh ≈ PhHeH
Thus, we want the near-nullspace to be in the range of PhH .
P. E. Farrell (Oxford) SPS 4 May 18, 2015 9 / 13
Coarse grid generation: an example
Classical AMG: coarse-grid generation
1. Select C-point with maximal measure2. Select neighbours as F-points3. Update measures of neighbours
P. E. Farrell (Oxford) SPS 4 May 18, 2015 10 / 13
Coarse grid generation: an example
Classical AMG: coarse-grid generation
1. Select C-point with maximal measure2. Select neighbours as F-points3. Update measures of neighbours
P. E. Farrell (Oxford) SPS 4 May 18, 2015 10 / 13
Coarse grid generation: an example
Classical AMG: coarse-grid generation
1. Select C-point with maximal measure2. Select neighbours as F-points3. Update measures of neighbours
P. E. Farrell (Oxford) SPS 4 May 18, 2015 10 / 13
Coarse grid generation: an example
Classical AMG: coarse-grid generation
1. Select C-point with maximal measure2. Select neighbours as F-points3. Update measures of neighbours
P. E. Farrell (Oxford) SPS 4 May 18, 2015 10 / 13
Coarse grid generation: an example
Classical AMG: coarse-grid generation
1. Select C-point with maximal measure2. Select neighbours as F-points3. Update measures of neighbours
P. E. Farrell (Oxford) SPS 4 May 18, 2015 10 / 13
Coarse grid generation: an example
Classical AMG: coarse-grid generation
1. Select C-point with maximal measure2. Select neighbours as F-points3. Update measures of neighbours
P. E. Farrell (Oxford) SPS 4 May 18, 2015 10 / 13
Coarse grid generation: an example
Classical AMG: coarse-grid generation
1. Select C-point with maximal measure2. Select neighbours as F-points3. Update measures of neighbours
P. E. Farrell (Oxford) SPS 4 May 18, 2015 10 / 13
Coarse grid generation: an example
Classical AMG: coarse-grid generation
1. Select C-point with maximal measure2. Select neighbours as F-points3. Update measures of neighbours
P. E. Farrell (Oxford) SPS 4 May 18, 2015 10 / 13
Coarse grid generation: an example
Classical AMG: coarse-grid generation
1. Select C-point with maximal measure2. Select neighbours as F-points3. Update measures of neighbours
P. E. Farrell (Oxford) SPS 4 May 18, 2015 10 / 13
Coarse grid generation: an example
Classical AMG: coarse-grid generation
1. Select C-point with maximal measure2. Select neighbours as F-points3. Update measures of neighbours
P. E. Farrell (Oxford) SPS 4 May 18, 2015 10 / 13
Coarse grid generation: an example
Classical AMG: coarse-grid generation
1. Select C-point with maximal measure2. Select neighbours as F-points3. Update measures of neighbours
P. E. Farrell (Oxford) SPS 4 May 18, 2015 10 / 13
Coarse grid generation: an example
Smoothed-aggregation AMG: coarse-grid generation
Phase 1:1. Pick a root point not adjacent to an aggregation2. Aggregate root and neighboursPhase 2: Move points into nearby aggregations
P. E. Farrell (Oxford) SPS 4 May 18, 2015 11 / 13
Coarse grid generation: an example
Smoothed-aggregation AMG: coarse-grid generation
Phase 1:1. Pick a root point not adjacent to an aggregation2. Aggregate root and neighboursPhase 2: Move points into nearby aggregations
P. E. Farrell (Oxford) SPS 4 May 18, 2015 11 / 13
Coarse grid generation: an example
Smoothed-aggregation AMG: coarse-grid generation
Phase 1:1. Pick a root point not adjacent to an aggregation2. Aggregate root and neighboursPhase 2: Move points into nearby aggregations
P. E. Farrell (Oxford) SPS 4 May 18, 2015 11 / 13
Coarse grid generation: an example
Smoothed-aggregation AMG: coarse-grid generation
Phase 1:1. Pick a root point not adjacent to an aggregation2. Aggregate root and neighboursPhase 2: Move points into nearby aggregations
P. E. Farrell (Oxford) SPS 4 May 18, 2015 11 / 13
Coarse grid generation: an example
Smoothed-aggregation AMG: coarse-grid generation
Phase 1:1. Pick a root point not adjacent to an aggregation2. Aggregate root and neighboursPhase 2: Move points into nearby aggregations
P. E. Farrell (Oxford) SPS 4 May 18, 2015 11 / 13
Coarse grid generation: an example
Smoothed-aggregation AMG: coarse-grid generation
Phase 1:1. Pick a root point not adjacent to an aggregation2. Aggregate root and neighboursPhase 2: Move points into nearby aggregations
P. E. Farrell (Oxford) SPS 4 May 18, 2015 11 / 13
Coarse grid generation: an example
Smoothed-aggregation AMG: coarse-grid generation
Phase 1:1. Pick a root point not adjacent to an aggregation2. Aggregate root and neighboursPhase 2: Move points into nearby aggregations
P. E. Farrell (Oxford) SPS 4 May 18, 2015 11 / 13
Coarse grid generation: an example
Smoothed-aggregation AMG: coarse-grid generation
Phase 1:1. Pick a root point not adjacent to an aggregation2. Aggregate root and neighboursPhase 2: Move points into nearby aggregations
P. E. Farrell (Oxford) SPS 4 May 18, 2015 11 / 13
Coarse grid generation: an example
Smoothed-aggregation AMG: coarse-grid generation
Phase 1:1. Pick a root point not adjacent to an aggregation2. Aggregate root and neighboursPhase 2: Move points into nearby aggregations
P. E. Farrell (Oxford) SPS 4 May 18, 2015 11 / 13
HPC 04 Challenge!
Consider the linear elasticity equation
−∇ · σ(u) = f in Ω
u = 0 on ∂ΩD
σ · n = 0 on ∂ΩN
on the pulley mesh, where
ε(u) =1
2
(∇u+∇uT
),
σ(u) = 2µε(u) + λtr(ε(u))I,f = (ρω2x, ρω2y, 0),
∂ΩD = (x, y, z) ∈ ∂Ω | x2 + y2 < (3.75− 0.17z)2∂ΩN = ∂Ω \ ∂ΩD,
E = 109, ν = 0.3, ρ = 10, ω = 300.
P. E. Farrell (Oxford) SPS 4 May 18, 2015 12 / 13
HPC 04 Challenge!
Solve this problem using only smoothed aggregation algebraic multigrid(no Krylov accelerator, -ksp type richardson
-ksp monitor true residual -pc type gamg).
How many iterations does it take to converge to atol 10−12
(a) without the near-nullspace
(b) with the near-nullspace?
Here the near-nullspace is the rigid body translations and rotations.
Now investigate the configuration of the smoothed aggregation AMGsolver and the Krylov accelerator. (Hint: -help -snes view). By tuningthe solver, can you achieve faster convergence?
P. E. Farrell (Oxford) SPS 4 May 18, 2015 13 / 13
Solving PDEs on Supercomputers V:algebraic multigrid on nonsymmetric problems
Patrick Farrell
MMSC: Python in Scientific Computing
May 19, 2015
P. E. Farrell (Oxford) SPS 5 May 19, 2015 1 / 4
HPC 05 Challenge! (1/3)
Implement a solver for the Yamabe equation
−8∇2u+1
r3u5 − 1
10u = 0
on the doughnut mesh with boundary conditions u = 1.
Initialise Newton with the initial guess u = 1.
P. E. Farrell (Oxford) SPS 5 May 19, 2015 2 / 4
HPC 05 Challenge! (2/3)
Next, develop an efficient linear solver:
1. First use Newton + LU.
2. Next, try GMRES + GAMG. Does it work well?
3. Try increasing the maximum size of the coarse grid(pc gamg coarse eq limit)
4. Ah! Now we’re getting somewhere. Does changing the smoother help(mg levels ksp monitor true residual)?
5. Increase the quality of the smoothed aggregation basis(pc gamg agg nsmooths).
P. E. Farrell (Oxford) SPS 5 May 19, 2015 3 / 4
HPC 05 Challenge! (3/3)
Profile the code. Where is it spending most of its time?
How can the preconditioner construction cost be reduced?
Once that is done, compare the memory usage of GMRES, FGMRES, GCRand CGS.
P. E. Farrell (Oxford) SPS 5 May 19, 2015 4 / 4
Solving PDEs on Supercomputers VI:fieldsplit preconditioners
Patrick Farrell
MMSC: Python in Scientific Computing
May 19, 2015
P. E. Farrell (Oxford) SPS 6 May 19, 2015 1 / 8
Block triangular factorisations
A block matrix with nonsingular A−1 has a block triangularfactorisation:
Block triangular factorisation
J =
(A BC D
)=
(I 0
CA−1 I
)(A 00 S
)(I A−1B0 I
).
where S = D − CA−1B is the (dense!) Schur complement.
This gives us an expression for its inverse:
Block triangular inverse
(A BC D
)−1
=
(I −A−1B0 I
)(A−1 0
0 S−1
)(I 0
−CA−1 I
).
P. E. Farrell (Oxford) SPS 6 May 19, 2015 2 / 8
Fieldsplit preconditioners
This gives rise to four related theorems.
Theorem (full)
The choice
P =
(I 0
CA−1 I
)(A 00 S
)(I A−1B0 I
)will induce Krylov convergence in 1 iteration.
How do you use this?
Cheaply approximate A−1 and S−1 (problem specific)!
P. E. Farrell (Oxford) SPS 6 May 19, 2015 3 / 8
Fieldsplit preconditioners
This gives rise to four related theorems.
Theorem (lower)
The choice
P =
(I 0
CA−1 I
)(A 00 S
)will induce Krylov convergence in 2 iterations.
How do you use this?
Cheaply approximate A−1 and S−1 (problem specific)!
P. E. Farrell (Oxford) SPS 6 May 19, 2015 3 / 8
Fieldsplit preconditioners
This gives rise to four related theorems.
Theorem (upper)
The choice
P =
(A 00 S
)(I A−1B0 I
)will induce Krylov convergence in 2 iterations.
How do you use this?
Cheaply approximate A−1 and S−1 (problem specific)!
P. E. Farrell (Oxford) SPS 6 May 19, 2015 3 / 8
Fieldsplit preconditioners
This gives rise to four related theorems.
Theorem (diag)
The choice
P =
(A 00 −S
)will induce Krylov convergence in 3 iterations, if D = 0.
How do you use this?
Cheaply approximate A−1 and S−1 (problem specific)!
P. E. Farrell (Oxford) SPS 6 May 19, 2015 3 / 8
Fieldsplit preconditioners
This gives rise to four related theorems.
Theorem (diag)
The choice
P =
(A 00 −S
)will induce Krylov convergence in 3 iterations, if D = 0.
How do you use this?
Cheaply approximate A−1 and S−1 (problem specific)!
P. E. Farrell (Oxford) SPS 6 May 19, 2015 3 / 8
Spectral equivalence
Definition (spectral equivalence)
Ah and Bh ∈ Rn×n are spectrally equivalent, Ah ∼ Bh, iff there existsconstants c, C independent of h such that
c ≤ λ(B−1h Ah) ≤ C.
Solving block-structured systems
Find an approximation S ∼ S or S−1 ∼ S−1.
P. E. Farrell (Oxford) SPS 6 May 19, 2015 4 / 8
Spectral equivalence
Definition (spectral equivalence)
Ah and Bh ∈ Rn×n are spectrally equivalent, Ah ∼ Bh, iff there existsconstants c, C independent of h such that
c ≤ λ(B−1h Ah) ≤ C.
Solving block-structured systems
Find an approximation S ∼ S or S−1 ∼ S−1.
P. E. Farrell (Oxford) SPS 6 May 19, 2015 4 / 8
Stokes equations
The Stokes equations are
−ν∇2u+∇p = 0,
∇ · u = 0.
P. E. Farrell (Oxford) SPS 6 May 19, 2015 5 / 8
Stokes equations
The Stokes equations are
−ν∇2u+∇p = 0,
∇ · u = 0.
A stable discretisation yields
J =
(A BT
B 0
).
with S = −BA−1BT .
P. E. Farrell (Oxford) SPS 6 May 19, 2015 5 / 8
Stokes equations
The Stokes equations are
−ν∇2u+∇p = 0,
∇ · u = 0.
Spectral equivalence (e.g. Elman, Silvester and Wathen, 2005)
Let Q be the viscosity-weighted pressure mass matrix
Qij =
∫Ω
1
νφiφj .
ThenS ∼ Q.
P. E. Farrell (Oxford) SPS 6 May 19, 2015 5 / 8
Coding tools
Creating PETSc index sets to extract dofs:
u_dofs = SubSpace(Z, 0).dofmap().dofs()
u_is = PETSc.IS().createGeneral(u_dofs)
Configuring the dofs to split:
fields = [("0", u_is), ("1", p_is)]
snes.ksp.pc.setFieldSplitIS(*fields)
Setting the matrix for building a preconditioner for the Schur complement:
schur = (1.0/nu) * inner(p, q)*dx
schur_full = assemble(schur)
schur_fmat = as_backend_type(schur_full).mat()
schur_mat = schur_fmat.getSubMatrix(p_is, p_is)
snes.ksp.pc.setFieldSplitSchurPreType(PETSc.PC.SchurPreType.USER, schur_mat)
P. E. Farrell (Oxford) SPS 6 May 19, 2015 6 / 8
Coding tools
Creating PETSc index sets to extract dofs:
u_dofs = SubSpace(Z, 0).dofmap().dofs()
u_is = PETSc.IS().createGeneral(u_dofs)
Configuring the dofs to split:
fields = [("0", u_is), ("1", p_is)]
snes.ksp.pc.setFieldSplitIS(*fields)
Setting the matrix for building a preconditioner for the Schur complement:
schur = (1.0/nu) * inner(p, q)*dx
schur_full = assemble(schur)
schur_fmat = as_backend_type(schur_full).mat()
schur_mat = schur_fmat.getSubMatrix(p_is, p_is)
snes.ksp.pc.setFieldSplitSchurPreType(PETSc.PC.SchurPreType.USER, schur_mat)
P. E. Farrell (Oxford) SPS 6 May 19, 2015 6 / 8
Coding tools
Creating PETSc index sets to extract dofs:
u_dofs = SubSpace(Z, 0).dofmap().dofs()
u_is = PETSc.IS().createGeneral(u_dofs)
Configuring the dofs to split:
fields = [("0", u_is), ("1", p_is)]
snes.ksp.pc.setFieldSplitIS(*fields)
Setting the matrix for building a preconditioner for the Schur complement:
schur = (1.0/nu) * inner(p, q)*dx
schur_full = assemble(schur)
schur_fmat = as_backend_type(schur_full).mat()
schur_mat = schur_fmat.getSubMatrix(p_is, p_is)
snes.ksp.pc.setFieldSplitSchurPreType(PETSc.PC.SchurPreType.USER, schur_mat)
P. E. Farrell (Oxford) SPS 6 May 19, 2015 6 / 8
Configuring fieldsplit
--petsc.ksp_converged_reason
--petsc.ksp_type fgmres
--petsc.ksp_monitor_true_residual
--petsc.ksp_atol 1.0e-10
--petsc.ksp_rtol 0.0
--petsc.pc_type fieldsplit
--petsc.pc_fieldsplit_type schur
--petsc.pc_fieldsplit_schur_factorization_type full
--petsc.pc_fieldsplit_schur_precondition user
--petsc.fieldsplit_0_ksp_type richardson
--petsc.fieldsplit_0_ksp_max_it 1
--petsc.fieldsplit_0_pc_type lu
--petsc.fieldsplit_0_pc_factor_mat_solver_package mumps
--petsc.fieldsplit_1_ksp_type bcgs
--petsc.fieldsplit_1_ksp_rtol 1.0e-10
--petsc.fieldsplit_1_ksp_monitor_true_residual
--petsc.fieldsplit_1_pc_type lu
--petsc.fieldsplit_1_pc_factor_mat_solver_package mumps
P. E. Farrell (Oxford) SPS 6 May 19, 2015 7 / 8
HPC 06 Challenge!
Solve the Stokes equations with ν = 1/100 on the dolphin.xml mesh,with boundary conditions
u = (0, 0) on ∂Ω0
u = (− sinπy, 0) on ∂Ω1
ν∇u · n = pn on ∂Ω2,
with colours taken from dolphin subdomains.xml.
0. Discretise the equation with a stable finite element pair. Integrateboth terms in the momentum equation by parts.
1. Solve the problem with LU (UMFPACK/MUMPS).
2. Implement the fieldsplit preconditioner with ideal inner solvers (LU).
3. Now replace the inner solvers with Krylov solvers (CG/ML/5 for A,BCGS/HYPRE/5 for S).
4. What configuration is fastest? full with strong inner solvers? diag
with weak inner solvers?P. E. Farrell (Oxford) SPS 6 May 19, 2015 8 / 8
Solving PDEs on Supercomputers VII:PDE-constrained optimisation
Patrick Farrell
MMSC: Python in Scientific Computing
May 17, 2015
P. E. Farrell (Oxford) SPS 7 May 17, 2015 1 / 9
The mother problem
Consider again the mother problem of PDE-constrained optimisation:
miny,u
1
2
∫Ω
(y−yd)2 dx+β
2
∫Ωu2 dx
subject to
−∆y = u in Ω
y = 0 on ∂Ω
We form the Lagrangian:
L(y, u, λ) =1
2
∫Ω
(y − yd)2 dx+β
2
∫Ωu2 dx+
∫Ω∇λ · ∇y − λudx
P. E. Farrell (Oxford) SPS 7 May 17, 2015 2 / 9
The mother problem
Consider again the mother problem of PDE-constrained optimisation:
miny,u
1
2
∫Ω
(y−yd)2 dx+β
2
∫Ωu2 dx
subject to
−∆y = u in Ω
y = 0 on ∂Ω
We form the Lagrangian:
L(y, u, λ) =1
2
∫Ω
(y − yd)2 dx+β
2
∫Ωu2 dx+
∫Ω∇λ · ∇y − λudx
P. E. Farrell (Oxford) SPS 7 May 17, 2015 2 / 9
The optimality conditions
Taking the optimality conditions yields the system: find(y, u, λ) ∈ H1
0 × L2 ×H10 such that∫
Ωy(y − yd) +
∫Ω∇λ · ∇y = 0,
β
∫Ωuu−
∫Ωλu = 0,∫
Ω∇λ · ∇y −
∫Ωλu = 0.
On discretisation, this yields the systemM 0 K0 βM −MK −M 0
yuλ
=
z00
.
P. E. Farrell (Oxford) SPS 7 May 17, 2015 3 / 9
The optimality conditions
Taking the optimality conditions yields the system: find(y, u, λ) ∈ H1
0 × L2 ×H10 such that∫
Ωy(y − yd) +
∫Ω∇λ · ∇y = 0,
β
∫Ωuu−
∫Ωλu = 0,∫
Ω∇λ · ∇y −
∫Ωλu = 0.
On discretisation, this yields the systemM 0 K0 βM −MK −M 0
yuλ
=
z00
.
P. E. Farrell (Oxford) SPS 7 May 17, 2015 3 / 9
Ingredients of a fieldsplit
Remember, to fieldsplit you need two things:
1. A diagonal block you can cheaply invert
2. A Schur complement you can cheaply approximate
If we take A = [[M, 0], [0, βM ]], the first is satisfied.
How about the Schur complement? Calculating, we find
S = KM−1K +1
βM.
Bad news
Approximating the inverse of sums is hard.
P. E. Farrell (Oxford) SPS 7 May 17, 2015 4 / 9
Ingredients of a fieldsplit
Remember, to fieldsplit you need two things:
1. A diagonal block you can cheaply invert
2. A Schur complement you can cheaply approximate
If we take A = [[M, 0], [0, βM ]], the first is satisfied.
How about the Schur complement? Calculating, we find
S = KM−1K +1
βM.
Bad news
Approximating the inverse of sums is hard.
P. E. Farrell (Oxford) SPS 7 May 17, 2015 4 / 9
Ingredients of a fieldsplit
Remember, to fieldsplit you need two things:
1. A diagonal block you can cheaply invert
2. A Schur complement you can cheaply approximate
If we take A = [[M, 0], [0, βM ]], the first is satisfied.
How about the Schur complement? Calculating, we find
S = KM−1K +1
βM.
Bad news
Approximating the inverse of sums is hard.
P. E. Farrell (Oxford) SPS 7 May 17, 2015 4 / 9
Ingredients of a fieldsplit
Remember, to fieldsplit you need two things:
1. A diagonal block you can cheaply invert
2. A Schur complement you can cheaply approximate
If we take A = [[M, 0], [0, βM ]], the first is satisfied.
How about the Schur complement? Calculating, we find
S = KM−1K +1
βM.
Bad news
Approximating the inverse of sums is hard.
P. E. Farrell (Oxford) SPS 7 May 17, 2015 4 / 9
Two approaches
Approach one: ignore one of terms (Rees, Dollar, Wathen 2010).
S = KM−1K +1
βM ≈ KM−1K
with inverseS−1 ≈ K−1MK−1.
Approach two: approximate the sum with a product (Pearson and Wathen,2012).
S =
(K +
1√βM
)M−1
(K +
1√βM
)− 2√
βM
≈(K +
1√βM
)M−1
(K +
1√βM
)with inverse
S−1 ≈ K−1MK−1.
P. E. Farrell (Oxford) SPS 7 May 17, 2015 5 / 9
Two approaches
Approach one: ignore one of terms (Rees, Dollar, Wathen 2010).
S = KM−1K +1
βM ≈ KM−1K
with inverseS−1 ≈ K−1MK−1.
Approach two: approximate the sum with a product (Pearson and Wathen,2012).
S =
(K +
1√βM
)M−1
(K +
1√βM
)− 2√
βM
≈(K +
1√βM
)M−1
(K +
1√βM
)with inverse
S−1 ≈ K−1MK−1.
P. E. Farrell (Oxford) SPS 7 May 17, 2015 5 / 9
Coding tools
No need to pass index sets with scalar fields:
"""
--petsc.pc_fieldsplit_0_fields 0,1
--petsc.pc_fieldsplit_1_fields 2
"""
You do need index sets to extract submatrices:
trial = split(TrialFunction(Z))[0]
test = split(TestFunction(Z))[0]
bc = DirichletBC(Z.sub(0), 0.0, "on_boundary")
mass_full = assemble(inner(trial, test)*dx)
bc.apply(mass_full)
...
mass_mat = mass_fmat.getSubMatrix(is_0, is_0)
P. E. Farrell (Oxford) SPS 7 May 17, 2015 6 / 9
Coding tools
No need to pass index sets with scalar fields:
"""
--petsc.pc_fieldsplit_0_fields 0,1
--petsc.pc_fieldsplit_1_fields 2
"""
You do need index sets to extract submatrices:
trial = split(TrialFunction(Z))[0]
test = split(TestFunction(Z))[0]
bc = DirichletBC(Z.sub(0), 0.0, "on_boundary")
mass_full = assemble(inner(trial, test)*dx)
bc.apply(mass_full)
...
mass_mat = mass_fmat.getSubMatrix(is_0, is_0)
P. E. Farrell (Oxford) SPS 7 May 17, 2015 6 / 9
Coding tools
Creating a KSP to handle the solve:
ksp_kbm = PETSc.KSP()
ksp_kbm.create()
ksp_kbm.setType("richardson")
ksp_kbm.pc.setType("lu")
ksp_kbm.setOperators(kbm)
ksp_kbm.setOptionsPrefix("fieldsplit_1_kbm_")
ksp_kbm.setFromOptions()
ksp_kbm.setUp()
P. E. Farrell (Oxford) SPS 7 May 17, 2015 7 / 9
Coding tools
Using an approximate inverse action with PCMAT:
"""
--petsc.fieldsplit_1_pc_type mat
"""
Configuring a shell matrix:
class SchurInv(object):
def mult(self, mat, x, y):
ksp_kbm.solve(x, tmp1)
mass.mult(tmp1, tmp2)
ksp_kbm.solve(tmp2, y)
schur = PETSc.Mat()
schur.createPython(mass.getSizes(), SchurInv())
schur.setUp()
P. E. Farrell (Oxford) SPS 7 May 17, 2015 8 / 9
Coding tools
Using an approximate inverse action with PCMAT:
"""
--petsc.fieldsplit_1_pc_type mat
"""
Configuring a shell matrix:
class SchurInv(object):
def mult(self, mat, x, y):
ksp_kbm.solve(x, tmp1)
mass.mult(tmp1, tmp2)
ksp_kbm.solve(tmp2, y)
schur = PETSc.Mat()
schur.createPython(mass.getSizes(), SchurInv())
schur.setUp()
P. E. Farrell (Oxford) SPS 7 May 17, 2015 8 / 9
HPC 07 Challenge!
Solve the mother problem on Ω = [0, 1]2 with
yd(x, y) =
1 if (x, y) ∈ [0, 0.5]2
0 otherwise
and homogeneous Dirichlet boundary conditions.
0. Discretise the equation with [P1]3.
1. Solve the problem with LU.
2. Implement the two fieldsplit preconditioners with ideal inner solvers.
3. Which performs best as β → 0?
4. Now choose scalable inner solvers.
5. Which configuration is fastest on the machine?
P. E. Farrell (Oxford) SPS 7 May 17, 2015 9 / 9
Solving PDEs on Supercomputers VIII:advanced nonlinear solvers
Patrick Farrell
MMSC: Python in Scientific Computing
May 18, 2015
P. E. Farrell (Oxford) SPS 8 May 18, 2015 1 / 13
Globalisation of Newton’s method
Consider again the p-Laplace equation
−∇ · (γ(u)∇u) = f in Ω
u = g on ∂Ω
where
γ(u) = (ε2 +1
2|∇u|2)(p−2)/2.
The configuration we considered (p = 5) took 121 iterations to converge. Why?
P. E. Farrell (Oxford) SPS 8 May 18, 2015 2 / 13
Newton steps near singular Jacobians
Recall that at our initial guess u = 0, our Jacobian is nearly singular.
IfJ = UΣV T ,
thenJ−1 = V Σ−1UT ,
and if σmin → 0, then
‖δu‖ = ‖J−1F‖ → ∞.
This explains
0 SNES Function norm 3.027343750000e-02
1 SNES Function norm 3.708799037955e+56
2 SNES Function norm 1.173487195603e+56
P. E. Farrell (Oxford) SPS 8 May 18, 2015 3 / 13
Newton steps near singular Jacobians
Recall that at our initial guess u = 0, our Jacobian is nearly singular.
IfJ = UΣV T ,
thenJ−1 = V Σ−1UT ,
and if σmin → 0, then
‖δu‖ = ‖J−1F‖ → ∞.
This explains
0 SNES Function norm 3.027343750000e-02
1 SNES Function norm 3.708799037955e+56
2 SNES Function norm 1.173487195603e+56
P. E. Farrell (Oxford) SPS 8 May 18, 2015 3 / 13
Responses
A few possible responses:
1. Start with a better initial guess (continuation)
2. Regularise further (undesirable)
3. Take a smaller step (damping with α 6= 1)!
P. E. Farrell (Oxford) SPS 8 May 18, 2015 4 / 13
Responses
A few possible responses:
1. Start with a better initial guess (continuation)
2. Regularise further (undesirable)
3. Take a smaller step (damping with α 6= 1)!
P. E. Farrell (Oxford) SPS 8 May 18, 2015 4 / 13
Responses
A few possible responses:
1. Start with a better initial guess (continuation)
2. Regularise further (undesirable)
3. Take a smaller step (damping with α 6= 1)!
P. E. Farrell (Oxford) SPS 8 May 18, 2015 4 / 13
Responses
A few possible responses:
1. Start with a better initial guess (continuation)
2. Regularise further (undesirable)
3. Take a smaller step (damping with α 6= 1)!
Newton fractal for z3 − 1 = 0 with α = 1.
P. E. Farrell (Oxford) SPS 8 May 18, 2015 4 / 13
Responses
A few possible responses:
1. Start with a better initial guess (continuation)
2. Regularise further (undesirable)
3. Take a smaller step (damping with α 6= 1)!
Newton fractal for z3 − 1 = 0 with α = 0.75.
P. E. Farrell (Oxford) SPS 8 May 18, 2015 4 / 13
Responses
A few possible responses:
1. Start with a better initial guess (continuation)
2. Regularise further (undesirable)
3. Take a smaller step (damping with α 6= 1)!
Newton fractal for z3 − 1 = 0 with α = 0.5.
P. E. Farrell (Oxford) SPS 8 May 18, 2015 4 / 13
Responses
A few possible responses:
1. Start with a better initial guess (continuation)
2. Regularise further (undesirable)
3. Take a smaller step (damping with α 6= 1)!
Newton fractal for z3 − 1 = 0 with α = 0.25.
P. E. Farrell (Oxford) SPS 8 May 18, 2015 4 / 13
Responses
A few possible responses:
1. Start with a better initial guess (continuation)
2. Regularise further (undesirable)
3. Take a smaller step (damping with α 6= 1)!
Newton fractal for z3 − 1 = 0 with α = 0.1.
P. E. Farrell (Oxford) SPS 8 May 18, 2015 4 / 13
Linesearch schemes in PETSc
Backtracking linesearch (bt)
I Finds the minimum of a polynomial fit to the l2 norm in [0, 1].
I Demands monotonic and sufficient decrease.
I If decrease is insufficient, the interval is reduced.
Good for: convex problems, occasional near-singular Jacobians.
Bad for: nonconvex problems where the residual must increase beforeconvergence.
P. E. Farrell (Oxford) SPS 8 May 18, 2015 5 / 13
Linesearch schemes in PETSc
Backtracking linesearch (bt)
I Finds the minimum of a polynomial fit to the l2 norm in [0, 1].
I Demands monotonic and sufficient decrease.
I If decrease is insufficient, the interval is reduced.
Good for: convex problems, occasional near-singular Jacobians.
Bad for: nonconvex problems where the residual must increase beforeconvergence.
P. E. Farrell (Oxford) SPS 8 May 18, 2015 5 / 13
Linesearch schemes in PETSc
Backtracking linesearch (bt)
I Finds the minimum of a polynomial fit to the l2 norm in [0, 1].
I Demands monotonic and sufficient decrease.
I If decrease is insufficient, the interval is reduced.
Good for: convex problems, occasional near-singular Jacobians.
Bad for: nonconvex problems where the residual must increase beforeconvergence.
P. E. Farrell (Oxford) SPS 8 May 18, 2015 5 / 13
Linesearch schemes in PETSc
Critical point linesearch (cp)
I Many PDEs have an energy function to be minimised.
I Suppose F (u) is the gradient of some (unknown) E(u).
I E(u+ αdu) can be minimised by looking for roots of
duTF (u+ αdu) = 0
with a secant method.
Good for: problems with an energy functional.
P. E. Farrell (Oxford) SPS 8 May 18, 2015 5 / 13
Linesearch schemes in PETSc
Critical point linesearch (cp)
I Many PDEs have an energy function to be minimised.
I Suppose F (u) is the gradient of some (unknown) E(u).
I E(u+ αdu) can be minimised by looking for roots of
duTF (u+ αdu) = 0
with a secant method.
Good for: problems with an energy functional.
P. E. Farrell (Oxford) SPS 8 May 18, 2015 5 / 13
Linesearch schemes in PETSc
Affine-covariant linesearch (nleqerr)
I Undamped Newton’s method is affine covariant.
I This observation fundamentally changes convergence theorems forNewton (Deuflhard, 2011).
I Convergence criteria are expressed in terms of affine-covariantLipschitz constants.
I This linesearch estimates these constants and uses it to decide steplengths.
Good for: problems where you can start within singular manifolds; thehardest nonlinear problems.
P. E. Farrell (Oxford) SPS 8 May 18, 2015 5 / 13
Linesearch schemes in PETSc
Affine-covariant linesearch (nleqerr)
I Undamped Newton’s method is affine covariant.
I This observation fundamentally changes convergence theorems forNewton (Deuflhard, 2011).
I Convergence criteria are expressed in terms of affine-covariantLipschitz constants.
I This linesearch estimates these constants and uses it to decide steplengths.
Good for: problems where you can start within singular manifolds; thehardest nonlinear problems.
P. E. Farrell (Oxford) SPS 8 May 18, 2015 5 / 13
Nonlinear preconditioning
For a linear problemAx = b
we apply an approximate solver P−1 on the left:
P−1Ax = P−1b.
Write one step of a nonlinear solver for
F (x) = b
asxi+1 = N(F, xi, b).
P. E. Farrell (Oxford) SPS 8 May 18, 2015 6 / 13
Nonlinear preconditioning
For a linear problemAx = b
we apply an approximate solver P−1 on the left:
P−1Ax = P−1b.
Write one step of a nonlinear solver for
F (x) = b
asxi+1 = N(F, xi, b).
P. E. Farrell (Oxford) SPS 8 May 18, 2015 6 / 13
Nonlinear preconditioning
In nonlinear left preconditioning, we define a new residual
R(x) = x−N(F, x, b)
and apply an outer nonlinear solver to R.
In the linear case this is equivalent, since
R(x) = x−N(F, x, b)
= x+ P−1(Ax− b)− x= P−1(Ax− b)
Can accelerate an inner solver with an outer solver!
P. E. Farrell (Oxford) SPS 8 May 18, 2015 7 / 13
Nonlinear preconditioning
In nonlinear left preconditioning, we define a new residual
R(x) = x−N(F, x, b)
and apply an outer nonlinear solver to R.
In the linear case this is equivalent, since
R(x) = x−N(F, x, b)
= x+ P−1(Ax− b)− x= P−1(Ax− b)
Can accelerate an inner solver with an outer solver!
P. E. Farrell (Oxford) SPS 8 May 18, 2015 7 / 13
Nonlinear preconditioning
In nonlinear left preconditioning, we define a new residual
R(x) = x−N(F, x, b)
and apply an outer nonlinear solver to R.
In the linear case this is equivalent, since
R(x) = x−N(F, x, b)
= x+ P−1(Ax− b)− x= P−1(Ax− b)
Can accelerate an inner solver with an outer solver!
P. E. Farrell (Oxford) SPS 8 May 18, 2015 7 / 13
Examples of nonlinear preconditioning
Hyperelasticity (Brune et al, 2013)
Inner solver: Newton. Outer solver: nonlinear conjugate gradients.
High-Reynolds number Navier–Stokes (Cai and Keyes, 2002)
Inner solver: nonlinear additive Schwarz. Outer solver: Newton–Krylov.
High-Prandtl number Navier–Stokes (Brune et al, 2013)
Inner solver: nonlinear multigrid. Outer solver: nonlinear GMRES.
P. E. Farrell (Oxford) SPS 8 May 18, 2015 8 / 13
Examples of nonlinear preconditioning
Hyperelasticity (Brune et al, 2013)
Inner solver: Newton. Outer solver: nonlinear conjugate gradients.
High-Reynolds number Navier–Stokes (Cai and Keyes, 2002)
Inner solver: nonlinear additive Schwarz. Outer solver: Newton–Krylov.
High-Prandtl number Navier–Stokes (Brune et al, 2013)
Inner solver: nonlinear multigrid. Outer solver: nonlinear GMRES.
P. E. Farrell (Oxford) SPS 8 May 18, 2015 8 / 13
Examples of nonlinear preconditioning
Hyperelasticity (Brune et al, 2013)
Inner solver: Newton. Outer solver: nonlinear conjugate gradients.
High-Reynolds number Navier–Stokes (Cai and Keyes, 2002)
Inner solver: nonlinear additive Schwarz. Outer solver: Newton–Krylov.
High-Prandtl number Navier–Stokes (Brune et al, 2013)
Inner solver: nonlinear multigrid. Outer solver: nonlinear GMRES.
P. E. Farrell (Oxford) SPS 8 May 18, 2015 8 / 13
Nonlinear preconditioning: a remark
The design space for nonlinear solvers is vast.
At the moment we have very little theory to guide us.
There are very large potential gains, however.
P. E. Farrell (Oxford) SPS 8 May 18, 2015 9 / 13
Nonlinear multigrid
The main bottleneck for massive problems is the linear system.
What if we didn’t have to solve (large) linear systems?
FAS uses fine-grid residuals to correct coarse-grid equations.
P. E. Farrell (Oxford) SPS 8 May 18, 2015 10 / 13
Nonlinear multigrid
The main bottleneck for massive problems is the linear system.
What if we didn’t have to solve (large) linear systems?
FAS uses fine-grid residuals to correct coarse-grid equations.
P. E. Farrell (Oxford) SPS 8 May 18, 2015 10 / 13
Nonlinear multigrid
The main bottleneck for massive problems is the linear system.
What if we didn’t have to solve (large) linear systems?
FAS uses fine-grid residuals to correct coarse-grid equations.
P. E. Farrell (Oxford) SPS 8 May 18, 2015 10 / 13
Full Approximation Scheme (FAS)
Given:
I a problem (F h, xh, bh)I a smoother S and coarse solver MI restriction, prolongation and injection operators R,P and R.
while not converged:
xhs = S(F h, xhi , bh)
xH = Rxhs
bH = R[b− F h(xh)] + FH(xH)
xHc = M(FH , xH , bH)
xhc = xhs + P [xHc − xH ]
xhi+1 = S(F h, xhc , bh)
P. E. Farrell (Oxford) SPS 8 May 18, 2015 11 / 13
Nonlinear multigrid
You can use
I a high-flop smoother on the fine grids,
I and Newton-LU on the coarse grids!
(see firedrake Yamabe demo)
P. E. Farrell (Oxford) SPS 8 May 18, 2015 12 / 13
Nonlinear multigrid
You can use
I a high-flop smoother on the fine grids,
I and Newton-LU on the coarse grids!
(see firedrake Yamabe demo)
P. E. Farrell (Oxford) SPS 8 May 18, 2015 12 / 13
HPC 08 Challenge!
Consider again the p-Laplace equation (FEniCS lecture III).
1. Investigate the performance of different linesearch schemes on thep-Laplace problem.
2. Using only basic for the inner solver, accelerate the convergence ofNewton’s method with left-preconditioning with ncg/cp.
3. Now use the optimal inner linesearch to beat the unaccelerated solver.
4. Choose sensible Krylov solvers and scale the code on ARCUS.
P. E. Farrell (Oxford) SPS 8 May 18, 2015 13 / 13
Solving PDEs on Supercomputers IV:a final challenge
Patrick Farrell
MMSC: Python in Scientific Computing
May 17, 2015
P. E. Farrell (Oxford) SPS 8 May 17, 2015 1 / 3
HPC 09 Challenge! (1/2)
Consider the Cahn–Hilliard equation
∂c
∂t−∇ ·M
(∇(df
dc− λ∇2c
))= 0 in Ω,
M
(∇(df
dc− λ∇2c
))= 0 on ∂Ω,
Mλ∇c · n = 0 on ∂Ω.
where c is the unknown field, f(c) = 100c2(c− 1)2, n is the unit normal,and M is a scalar parameter.
To solve this with standard C0 elements, write it as two coupledsecond-order problems.
P. E. Farrell (Oxford) SPS 8 May 17, 2015 2 / 3
HPC 09 Challenge! (2/2)
Discretise and solve the equation on Ω = [0, 1]2 for M = 1, λ = 10−2, andinitial condition
class InitialConditions(Expression):
def __init__(self):
random.seed(2 + MPI.rank(mpi_comm_world()))
def eval(self, values, x):
values[0] = 0.63 + 0.02*(0.5 - random.random())
values[1] = 0.0
def value_shape(self):
return (2,)
Make sure your scheme is at least second-order. Sensible values are∆t = 5× 10−6, θ = 0.5.
An excellent preconditioner is discussed in doi:10.1137/130921842.
P. E. Farrell (Oxford) SPS 8 May 17, 2015 3 / 3