solving pdes on supercomputers i: modern supercomputer ... · solving pdes on supercomputers i:...

Supercomputer architecture

Solving PDEs on Supercomputers I:modern supercomputer architecture

Patrick Farrell

MMSC: Python in Scientific Computing

May 17, 2015

P. E. Farrell (Oxford) SPS I May 17, 2015 1 / 17

Moore’s Law

The number of transistors per unit area on integrated circuitsdoubles every two years. (1965)

Moore’s Law

The consequence

Individual computers aren’t getting faster: we’re getting more of them.

A modern supercomputer

In this lecture we will give a brief overview of modern supercomputerarchitecture.

ARCHER is composed of 4920 nodes, each with 24 cores, for a total of118,080 cores.

A node

Algorithmic consequence

Extreme pressure on memory and memory bandwidth.

A node

Extreme pressure on memory and memory bandwidth.

A socket

Want to have multiple cores working on the same data.

A socket

Want to have multiple cores working on the same data.

A core

Vectorisation essential for maximum floating point performance.

A core

Vectorisation essential for maximum floating point performance.

Supercomputer architecture Hardware properties

Some relative timings

On a 3.0 GHz Intel Core 2 Duo E8400:

I One clock cycle: ∼ 1/3 nanoseconds (∼ 10 light-cm!).I Accessing L1 data cache (32 KB): 3 cyclesI Accessing L2 cache (6 MB): 14 cyclesI Accessing main memory: ∼ 250 cyclesI Accessing disk: ∼ 40 million cycles

Analogy

I Register: the data is on your working paper.

I L1 cache: the data is on your desk (3 seconds).

I L2 cache: the data is on your bookshelf (14 seconds).

I Main memory: the data is in the library (a 4 minute walk).

I Disk: go backpacking for 1.2 years.

Analogy

I Register: the data is on your working paper.

I L1 cache: the data is on your desk (3 seconds).

I L2 cache: the data is on your bookshelf (14 seconds).

I Main memory: the data is in the library (a 4 minute walk).

I Disk: go backpacking for 1.2 years.

The interconnect

Some more timings

On the Cray Aries interconnect, to send a message:

I Within a socket: 800 cycles

I Within a node: 1600 cycles

I Across the machine: 8000 cycles

Interleave communication and computation.

Some more timings

On the Cray Aries interconnect, to send a message:

I Within a socket: 800 cycles

I Within a node: 1600 cycles

I Across the machine: 8000 cycles

Interleave communication and computation.

MPI and OpenMP

Domain decomposition

The coarsest level of parallelism used is domain decomposition over MPI.

from dolfin import *

mesh = UnitCubeMesh(32, 32, 32)

partitioning = CellFunction("size_t", mesh)

partitioning.set_all(MPI.rank(mpi_comm_world()))

File("output/partitioning.xdmf") << partitioning

$ mpiexec -n 4 python partition.py

MPI and OpenMP

MPI: basic model

Separate processes with separate memory spaces communicate viamessage passing.

MPI concepts:

I communicator

I collective

I rank

I blocking and nonblocking communication

I reductions

Each subdomain is assigned to one MPI rank.

MPI and OpenMP

Main communication patterns in finite elements

Assembly

Assembly requires exchanging halo data with your neighbours.

processor 0

processor 1

MPI and OpenMP

Main communication patterns in finite elements

Krylov solvers

I Neighbour communications for sparse matrix-vector product.

I Global reductions (allreduce for dot products)

I Preconditioner application

I Multigrid: extremely complicated.

MPI and OpenMP

OpenMP: basic model

OpenMP

Separate threads operate on the same memory space.

I Less overhead in parallelexecution

I Multiple cores can act on thesame data

I Less pressure on memory andmemory bandwidth

I Easier load balancing

I Extremely difficult to programcorrectly

I Subtle race conditions possible

I Colouring and locks required tosynchronise

MPI and OpenMP

DOLFIN can also run in OpenMP mode for assembly:

parameters["num_threads"] = 4

solve(F == 0, u) # must use a threaded solver

# (e.g. pastix)!

You can’t use MPI and OpenMP at the same time (yet).

Algorithmic consequences

General algorithmic consequences

I Need algorithms with high arithmetical intensity.

I Caches greatly dislike unstructured memory accesses.

I Flops are (approximately) free.

I Large stencils induce extra communication.

I Must overlap communication and computation.

I Solver algorithms must be O(n) or O(nlogn).

General algorithmic trends

I Domain-decomposed high-order FE on semi-structured meshes.

I Multigrid/multilevel solvers with Krylov accelerators.

I Hybrid parallelism strategies (MPI/OpenMP/AVX).

Solving PDEs on Supercomputers II:practical matters of using supercomputers

Patrick Farrell

May 17, 2015

P. E. Farrell (Oxford) SPS 2 May 17, 2015 1 / 7

Logging on

Supercomputers are accessed by sshing to the login nodes.

$ ssh [email protected]

You configure your environment with modules:

$ module list

No Modulefiles Currently Loaded.

$ module avail

$ module use -a /data/math-farrellp/crichardson/modules

$ module load fenics/1.5.0

$ module list

Modules are generally awful, but nothing better exists yet.

Running jobs interactively

The simplest way to run a job is interactively. This is mainly used fordebugging.

$ qsub -I -l nodes=1:ppn=16 -l walltime=0:10:00 -q develq

qsub: waiting for job 312485.headnode1.arcus.osc.local to start

# wait until PBS allocates us the resources we asked for ...

qsub: job 312485.headnode1.arcus.osc.local ready

$ cd $PBS_O_WORKDIR

$ module use -a /data/math-farrellp/crichardson/modules

$ module load fenics/1.5.0

$ mpirun $MPI_HOSTS python poisson.py

Running jobs in batch mode

ARCUS-A and ARCHER are managed using PBS, the Portable BatchSystem. Users submit jobs to the batch system which decides when andwhere they get executed.

The main PBS commands:

I qsub

I qdel

I qstat

The argument to qsub is a PBS script.

Running jobs in batch mode

#!/bin/bash

# set the number of nodes and processes per node

#PBS -l nodes=1:ppn=16

# set max wallclock time

#PBS -l walltime=1:00:00

# set name of job

#PBS -N poisson

# mail alert at start, end and abortion of execution

#PBS -m bea

# send mail to this address

#PBS -M [email protected]

# start job from the directory it was submitted

cd $PBS_O_WORKDIR

module use -a /data/math-farrellp/crichardson/modules

module load fenics/1.5.0

. enable_arcus_mpi.sh

mpirun $MPI_HOSTS python poisson.py | tee poisson.log

HPC 02 Challenge!

Investigate the weak scaling of the 2D Poisson solver with parallel LU thatyou developed last week:

I Have the code refine the mesh once each time the number of coresquadruples.Hint:size = MPI.size(mpi_comm_world())

for i in nrefine:

mesh = refine(mesh, redistribute=False)

I Run the code on 1, 4 and 16 cores. What happens to the runtime asthe problem is scaled weakly?

I . . .

HPC 02 Challenge!

Which components of the solver are taking the longest?Profile the code with

I DOLFIN timing system: list timings()

I PETSc timing system:

import petsc4py

petsc4py.init("-log_summary summary.log".split())

I Now switch to HYPRE algebraic multigrid and compare the timingsagain. Hint: to get more details about the AMG solve, call

PETScOptions.set("pc_hypre_boomeramg_print_statistics", 1)

Solving PDEs on Supercomputers III:an introduction to PETSc

Patrick Farrell

May 17, 2015

PETSc is a library of linear and nonlinear solvers for sparse PDEs.

It has won most awards going:

I SIAM/ACM Prize in Computational Science and Engineering, 2015

I 2009 R&D Award

I Gordon Bell Prizes in 2009, 2004, 2003, 1999

I . . .

PETSc makes it easy to express complex hierarchical composed solvers ascompactly as possible.

Fundamental objects

[Vec, Mat, PC, KSP, SNES]

Vec represents a dense vector, decomposed in parallel.

Example

ierr = VecCreateMPI(PETSC COMM WORLD, local, global, &x);

ierr = VecDuplicate(x, &y);

ierr = VecDotBegin(x, y, &xTy);

/* other computations */

ierr = VecDotEnd(x, y, &xTy);

Fundamental objects

Mat represents a sparse matrix, decomposed in parallel.

Example

ierr = MatCreateAIJ(PETSC COMM WORLD, ..., &mat);

for (i = 0; i < local rows; i++)

ierr = MatSetValues(mat, ...);

ierr = MatAssemblyBegin(mat, MAT FINAL ASSEMBLY);

ierr = MatAssemblyEnd(mat, MAT FINAL ASSEMBLY);

ierr = MatMult(mat, x, y);

Fundamental objects

PC represents a linear preconditioner (Jacobi, Gauss-Seidel, ILU, ICC,AMG, additive Schwarz, ...)

Example

ierr = PCCreate(PETSC COMM WORLD, &pc);

ierr = PCSetOperators(pc, A, P);

ierr = PCSetType(pc, PCILU);

ierr = PCSetUp(pc);

ierr = PCApply(pc, x, y);

Fundamental objects

KSP represents a linear solver (CG, GMRES, TFQMR, BICGSTAB,MINRES, GCR, Richardson, Chebyshev, ...)

Example

ierr = KSPCreate(PETSC COMM WORLD, &ksp);

ierr = KSPSetOperators(ksp, A, P);

ierr = KSPSetType(ksp, KSPCG);

ierr = KSPSetUp(ksp);

ierr = KSPSolve(ksp, b, x);

Fundamental objects

SNES represents a nonlinear solver (Newton, reduced-space Newton,NGMRES, NCG, Anderson acceleration, FAS, ...)

Example

ierr = SNESCreate(PETSC COMM WORLD, &snes);

ierr = SNESSetFunction(snes, r, residual);

ierr = SNESSetJacobian(snes, J, P, jacobian);

ierr = SNESSetType(snes, SNESVINEWTONRSLS);

ierr = SNESSetVariableBounds(snes, xl, xu);

ierr = SNESSetUp(snes);

ierr = SNESSolve(snes, b, x);

Hierarchical composition

Principle

All objects are composable.

Principle

All objects are configurable.

(example from variational fracture mechanics)

Principle

Wiring PETSc and FEniCS

We’re going to need fine control to design our solvers.

A simple interface between FEniCS and PETSc:

$ git clone https://bitbucket.org/pefarrell/dolfin-snes-interface.git

Solving PDEs on Supercomputers IV:algebraic multigrid

Patrick Farrell

May 18, 2015

Multilevel solvers

At the core of most PDE solvers is the solution of a linear system

Linear system

Ax = b

The most powerful solvers for PDEs exploit the fact that there exists aninfinite hierarchy of discretisations, all approximating the same problem:

Hierarchy of linear systems

· · ·Ahxh = bh

A2hx2h = b2h

A4hx4h = b4h

· · ·P. E. Farrell (Oxford) SPS 4 May 18, 2015 2 / 13

Geometric multigrid: review

Geometric multigrid algorithm

I Begin with an initial guess.

I Apply a relaxation method to smooth the error.

I Solve for the smooth error on a coarse grid.

Why did geometric multigrid work?

Geometric multigrid worked on the Laplacian because:

I simple relaxation methods yielded geometrically smooth errors;

I those errors could be well-represented on coarse grids.

What about problems where the error isn’t smooth after relaxation?

Anisotropic Laplacian

−auxx − buyy = f in Ω = [0, 1]2

u = g on ∂Ω

a = b if x < 1/2

a b if x ≥ 1/2.

Two responses

I design increasingly arcane relaxation methods that do smooth;

I semi-coarsening, multi-coarsening, etc.

I fix a simple relaxation method;

I algebraically construct coarse grids and interpolation operators;

I demand that these can well represent the error after relaxation.

A nice side effect: AMG requires much less infrastructure:

I No need to supply coarse grids

I No need to supplyinterpolation operators

I Only applies to linear problems

I Requires global linearisation(memory)

I Requires near-nullspace ofoperator

Two responses

Anisotropic Laplacian again

Fundamental principles of AMG I: relaxation and error

Recall Richardson iteration with a preconditioner P :

Richardson iteration

xk+1 = xk + P−1 (b−Axk) .

A simple error analysis shows

Error analysis of Richardson iteration

ek+1 =(I − P−1A

Now if ek+1 ≈ ek then

Near-nullspace of A

P−1Aek ≈ 0 =⇒ Aek ≈ 0.

xk+1 = xk + P−1 (b−Axk) .

ek+1 =(I − P−1A

Near-nullspace of A

P−1Aek ≈ 0 =⇒ Aek ≈ 0.

xk+1 = xk + P−1 (b−Axk) .

ek+1 =(I − P−1A

Near-nullspace of A

P−1Aek ≈ 0 =⇒ Aek ≈ 0.

Error after relaxation

The error after relaxation is related to the near-nullspace of the operator.

Fundamental principles of AMG II: interpolation

Recall that in one multigrid cycle we approximate the fine error as

Approximation of fine error

eh ≈ PhHeH

Thus, we want the near-nullspace to be in the range of PhH .

Coarse grid generation: an example

Classical AMG: coarse-grid generation

1. Select C-point with maximal measure2. Select neighbours as F-points3. Update measures of neighbours

Smoothed-aggregation AMG: coarse-grid generation

Phase 1:1. Pick a root point not adjacent to an aggregation2. Aggregate root and neighboursPhase 2: Move points into nearby aggregations

HPC 04 Challenge!

Consider the linear elasticity equation

−∇ · σ(u) = f in Ω

u = 0 on ∂ΩD

σ · n = 0 on ∂ΩN

on the pulley mesh, where

ε(u) =1

(∇u+∇uT

σ(u) = 2µε(u) + λtr(ε(u))I,f = (ρω2x, ρω2y, 0),

∂ΩD = (x, y, z) ∈ ∂Ω | x2 + y2 < (3.75− 0.17z)2∂ΩN = ∂Ω \ ∂ΩD,

E = 109, ν = 0.3, ρ = 10, ω = 300.

HPC 04 Challenge!

Solve this problem using only smoothed aggregation algebraic multigrid(no Krylov accelerator, -ksp type richardson

-ksp monitor true residual -pc type gamg).

How many iterations does it take to converge to atol 10−12

(a) without the near-nullspace

(b) with the near-nullspace?

Here the near-nullspace is the rigid body translations and rotations.

Now investigate the configuration of the smoothed aggregation AMGsolver and the Krylov accelerator. (Hint: -help -snes view). By tuningthe solver, can you achieve faster convergence?

Solving PDEs on Supercomputers V:algebraic multigrid on nonsymmetric problems

Patrick Farrell

May 19, 2015

HPC 05 Challenge! (1/3)

Implement a solver for the Yamabe equation

−8∇2u+1

r3u5 − 1

10u = 0

on the doughnut mesh with boundary conditions u = 1.

Initialise Newton with the initial guess u = 1.

Next, develop an efficient linear solver:

1. First use Newton + LU.

2. Next, try GMRES + GAMG. Does it work well?

3. Try increasing the maximum size of the coarse grid(pc gamg coarse eq limit)

4. Ah! Now we’re getting somewhere. Does changing the smoother help(mg levels ksp monitor true residual)?

5. Increase the quality of the smoothed aggregation basis(pc gamg agg nsmooths).

Profile the code. Where is it spending most of its time?

How can the preconditioner construction cost be reduced?

Once that is done, compare the memory usage of GMRES, FGMRES, GCRand CGS.

Solving PDEs on Supercomputers VI:fieldsplit preconditioners

Patrick Farrell

May 19, 2015

Block triangular factorisations

A block matrix with nonsingular A−1 has a block triangularfactorisation:

Block triangular factorisation

(A BC D

CA−1 I

)(A 00 S

)(I A−1B0 I

where S = D − CA−1B is the (dense!) Schur complement.

This gives us an expression for its inverse:

Block triangular inverse

(A BC D

(I −A−1B0 I

)(A−1 0

0 S−1

−CA−1 I

Fieldsplit preconditioners

This gives rise to four related theorems.

Theorem (full)

The choice

CA−1 I

)(A 00 S

)(I A−1B0 I

)will induce Krylov convergence in 1 iteration.

How do you use this?

Cheaply approximate A−1 and S−1 (problem specific)!

Theorem (lower)

The choice

CA−1 I

)(A 00 S

)will induce Krylov convergence in 2 iterations.

Theorem (upper)

The choice

(A 00 S

)(I A−1B0 I

)will induce Krylov convergence in 2 iterations.

Theorem (diag)

The choice

(A 00 −S

)will induce Krylov convergence in 3 iterations, if D = 0.

Theorem (diag)

The choice

(A 00 −S

)will induce Krylov convergence in 3 iterations, if D = 0.

Spectral equivalence

Definition (spectral equivalence)

Ah and Bh ∈ Rn×n are spectrally equivalent, Ah ∼ Bh, iff there existsconstants c, C independent of h such that

c ≤ λ(B−1h Ah) ≤ C.

Solving block-structured systems

Find an approximation S ∼ S or S−1 ∼ S−1.

Spectral equivalence

Definition (spectral equivalence)

Ah and Bh ∈ Rn×n are spectrally equivalent, Ah ∼ Bh, iff there existsconstants c, C independent of h such that

c ≤ λ(B−1h Ah) ≤ C.

Solving block-structured systems

Find an approximation S ∼ S or S−1 ∼ S−1.

Stokes equations

The Stokes equations are

−ν∇2u+∇p = 0,

∇ · u = 0.

Stokes equations

−ν∇2u+∇p = 0,

∇ · u = 0.

A stable discretisation yields

with S = −BA−1BT .

Stokes equations

−ν∇2u+∇p = 0,

∇ · u = 0.

Spectral equivalence (e.g. Elman, Silvester and Wathen, 2005)

Let Q be the viscosity-weighted pressure mass matrix

νφiφj .

ThenS ∼ Q.

Coding tools

Creating PETSc index sets to extract dofs:

u_dofs = SubSpace(Z, 0).dofmap().dofs()

u_is = PETSc.IS().createGeneral(u_dofs)

Configuring the dofs to split:

fields = [("0", u_is), ("1", p_is)]

snes.ksp.pc.setFieldSplitIS(*fields)

Setting the matrix for building a preconditioner for the Schur complement:

schur = (1.0/nu) * inner(p, q)*dx

schur_full = assemble(schur)

schur_fmat = as_backend_type(schur_full).mat()

schur_mat = schur_fmat.getSubMatrix(p_is, p_is)

snes.ksp.pc.setFieldSplitSchurPreType(PETSc.PC.SchurPreType.USER, schur_mat)

Coding tools

fields = [("0", u_is), ("1", p_is)]

Coding tools

fields = [("0", u_is), ("1", p_is)]

Configuring fieldsplit

--petsc.ksp_converged_reason

--petsc.ksp_type fgmres

--petsc.ksp_monitor_true_residual

--petsc.ksp_atol 1.0e-10

--petsc.ksp_rtol 0.0

--petsc.pc_type fieldsplit

--petsc.pc_fieldsplit_type schur

--petsc.pc_fieldsplit_schur_factorization_type full

--petsc.pc_fieldsplit_schur_precondition user

--petsc.fieldsplit_0_ksp_type richardson

--petsc.fieldsplit_0_ksp_max_it 1

--petsc.fieldsplit_0_pc_type lu

--petsc.fieldsplit_0_pc_factor_mat_solver_package mumps

--petsc.fieldsplit_1_ksp_type bcgs

--petsc.fieldsplit_1_ksp_rtol 1.0e-10

--petsc.fieldsplit_1_ksp_monitor_true_residual

--petsc.fieldsplit_1_pc_type lu

--petsc.fieldsplit_1_pc_factor_mat_solver_package mumps

HPC 06 Challenge!

Solve the Stokes equations with ν = 1/100 on the dolphin.xml mesh,with boundary conditions

u = (0, 0) on ∂Ω0

u = (− sinπy, 0) on ∂Ω1

ν∇u · n = pn on ∂Ω2,

with colours taken from dolphin subdomains.xml.

0. Discretise the equation with a stable finite element pair. Integrateboth terms in the momentum equation by parts.

1. Solve the problem with LU (UMFPACK/MUMPS).

2. Implement the fieldsplit preconditioner with ideal inner solvers (LU).

3. Now replace the inner solvers with Krylov solvers (CG/ML/5 for A,BCGS/HYPRE/5 for S).

4. What configuration is fastest? full with strong inner solvers? diag

with weak inner solvers?P. E. Farrell (Oxford) SPS 6 May 19, 2015 8 / 8

Solving PDEs on Supercomputers VII:PDE-constrained optimisation

Patrick Farrell

May 17, 2015

The mother problem

Consider again the mother problem of PDE-constrained optimisation:

miny,u

(y−yd)2 dx+β

∫Ωu2 dx

subject to

−∆y = u in Ω

y = 0 on ∂Ω

We form the Lagrangian:

L(y, u, λ) =1

(y − yd)2 dx+β

∫Ωu2 dx+

∫Ω∇λ · ∇y − λudx

The mother problem

Consider again the mother problem of PDE-constrained optimisation:

miny,u

(y−yd)2 dx+β

∫Ωu2 dx

subject to

−∆y = u in Ω

y = 0 on ∂Ω

We form the Lagrangian:

L(y, u, λ) =1

(y − yd)2 dx+β

∫Ωu2 dx+

∫Ω∇λ · ∇y − λudx

The optimality conditions

Taking the optimality conditions yields the system: find(y, u, λ) ∈ H1

0 × L2 ×H10 such that∫

Ωy(y − yd) +

∫Ω∇λ · ∇y = 0,

∫Ωuu−

∫Ωλu = 0,∫

Ω∇λ · ∇y −

∫Ωλu = 0.

On discretisation, this yields the systemM 0 K0 βM −MK −M 0

The optimality conditions

Taking the optimality conditions yields the system: find(y, u, λ) ∈ H1

0 × L2 ×H10 such that∫

Ωy(y − yd) +

∫Ω∇λ · ∇y = 0,

∫Ωuu−

∫Ωλu = 0,∫

Ω∇λ · ∇y −

∫Ωλu = 0.

On discretisation, this yields the systemM 0 K0 βM −MK −M 0

Ingredients of a fieldsplit

Remember, to fieldsplit you need two things:

1. A diagonal block you can cheaply invert

2. A Schur complement you can cheaply approximate

If we take A = [[M, 0], [0, βM ]], the first is satisfied.

How about the Schur complement? Calculating, we find

S = KM−1K +1

Bad news

Approximating the inverse of sums is hard.

S = KM−1K +1

Bad news

S = KM−1K +1

Bad news

S = KM−1K +1

Bad news

Two approaches

Approach one: ignore one of terms (Rees, Dollar, Wathen 2010).

S = KM−1K +1

βM ≈ KM−1K

with inverseS−1 ≈ K−1MK−1.

Approach two: approximate the sum with a product (Pearson and Wathen,2012).

1√βM

)M−1

1√βM

)− 2√

≈(K +

1√βM

)M−1

1√βM

)with inverse

S−1 ≈ K−1MK−1.

Two approaches

Approach one: ignore one of terms (Rees, Dollar, Wathen 2010).

S = KM−1K +1

βM ≈ KM−1K

with inverseS−1 ≈ K−1MK−1.

Approach two: approximate the sum with a product (Pearson and Wathen,2012).

1√βM

)M−1

1√βM

)− 2√

≈(K +

1√βM

)M−1

1√βM

)with inverse

S−1 ≈ K−1MK−1.

Coding tools

No need to pass index sets with scalar fields:

--petsc.pc_fieldsplit_0_fields 0,1

--petsc.pc_fieldsplit_1_fields 2

You do need index sets to extract submatrices:

trial = split(TrialFunction(Z))[0]

test = split(TestFunction(Z))[0]

bc = DirichletBC(Z.sub(0), 0.0, "on_boundary")

mass_full = assemble(inner(trial, test)*dx)

bc.apply(mass_full)

mass_mat = mass_fmat.getSubMatrix(is_0, is_0)

Coding tools

No need to pass index sets with scalar fields:

--petsc.pc_fieldsplit_0_fields 0,1

--petsc.pc_fieldsplit_1_fields 2

You do need index sets to extract submatrices:

trial = split(TrialFunction(Z))[0]

test = split(TestFunction(Z))[0]

bc = DirichletBC(Z.sub(0), 0.0, "on_boundary")

mass_full = assemble(inner(trial, test)*dx)

bc.apply(mass_full)

mass_mat = mass_fmat.getSubMatrix(is_0, is_0)

Coding tools

Creating a KSP to handle the solve:

ksp_kbm = PETSc.KSP()

ksp_kbm.create()

ksp_kbm.setType("richardson")

ksp_kbm.pc.setType("lu")

ksp_kbm.setOperators(kbm)

ksp_kbm.setOptionsPrefix("fieldsplit_1_kbm_")

ksp_kbm.setFromOptions()

ksp_kbm.setUp()

Coding tools

Using an approximate inverse action with PCMAT:

--petsc.fieldsplit_1_pc_type mat

Configuring a shell matrix:

class SchurInv(object):

def mult(self, mat, x, y):

ksp_kbm.solve(x, tmp1)

mass.mult(tmp1, tmp2)

ksp_kbm.solve(tmp2, y)

schur = PETSc.Mat()

schur.createPython(mass.getSizes(), SchurInv())

schur.setUp()

Coding tools

Using an approximate inverse action with PCMAT:

--petsc.fieldsplit_1_pc_type mat

Configuring a shell matrix:

class SchurInv(object):

def mult(self, mat, x, y):

ksp_kbm.solve(x, tmp1)

mass.mult(tmp1, tmp2)

ksp_kbm.solve(tmp2, y)

schur = PETSc.Mat()

schur.createPython(mass.getSizes(), SchurInv())

schur.setUp()

HPC 07 Challenge!

Solve the mother problem on Ω = [0, 1]2 with

yd(x, y) =

1 if (x, y) ∈ [0, 0.5]2

0 otherwise

and homogeneous Dirichlet boundary conditions.

0. Discretise the equation with [P1]3.

1. Solve the problem with LU.

2. Implement the two fieldsplit preconditioners with ideal inner solvers.

3. Which performs best as β → 0?

4. Now choose scalable inner solvers.

5. Which configuration is fastest on the machine?

Solving PDEs on Supercomputers VIII:advanced nonlinear solvers

Patrick Farrell

May 18, 2015

Globalisation of Newton’s method

Consider again the p-Laplace equation

−∇ · (γ(u)∇u) = f in Ω

u = g on ∂Ω

γ(u) = (ε2 +1

2|∇u|2)(p−2)/2.

The configuration we considered (p = 5) took 121 iterations to converge. Why?

Newton steps near singular Jacobians

Recall that at our initial guess u = 0, our Jacobian is nearly singular.

IfJ = UΣV T ,

thenJ−1 = V Σ−1UT ,

and if σmin → 0, then

‖δu‖ = ‖J−1F‖ → ∞.

This explains

0 SNES Function norm 3.027343750000e-02

1 SNES Function norm 3.708799037955e+56

Newton steps near singular Jacobians

Recall that at our initial guess u = 0, our Jacobian is nearly singular.

IfJ = UΣV T ,

thenJ−1 = V Σ−1UT ,

and if σmin → 0, then

‖δu‖ = ‖J−1F‖ → ∞.

This explains

0 SNES Function norm 3.027343750000e-02

Responses

A few possible responses:

1. Start with a better initial guess (continuation)

2. Regularise further (undesirable)

3. Take a smaller step (damping with α 6= 1)!

Responses

Newton fractal for z3 − 1 = 0 with α = 1.

Responses

Newton fractal for z3 − 1 = 0 with α = 0.75.

Responses

Linesearch schemes in PETSc

Backtracking linesearch (bt)

I Finds the minimum of a polynomial fit to the l2 norm in [0, 1].

I Demands monotonic and sufficient decrease.

I If decrease is insufficient, the interval is reduced.

Good for: convex problems, occasional near-singular Jacobians.

Bad for: nonconvex problems where the residual must increase beforeconvergence.

Critical point linesearch (cp)

I Many PDEs have an energy function to be minimised.

I Suppose F (u) is the gradient of some (unknown) E(u).

I E(u+ αdu) can be minimised by looking for roots of

duTF (u+ αdu) = 0

with a secant method.

Good for: problems with an energy functional.

Critical point linesearch (cp)

I Many PDEs have an energy function to be minimised.

I Suppose F (u) is the gradient of some (unknown) E(u).

I E(u+ αdu) can be minimised by looking for roots of

duTF (u+ αdu) = 0

with a secant method.

Good for: problems with an energy functional.

Affine-covariant linesearch (nleqerr)

I Undamped Newton’s method is affine covariant.

I This observation fundamentally changes convergence theorems forNewton (Deuflhard, 2011).

I Convergence criteria are expressed in terms of affine-covariantLipschitz constants.

I This linesearch estimates these constants and uses it to decide steplengths.

Good for: problems where you can start within singular manifolds; thehardest nonlinear problems.

Affine-covariant linesearch (nleqerr)

I Undamped Newton’s method is affine covariant.

I This observation fundamentally changes convergence theorems forNewton (Deuflhard, 2011).

I Convergence criteria are expressed in terms of affine-covariantLipschitz constants.

I This linesearch estimates these constants and uses it to decide steplengths.

Good for: problems where you can start within singular manifolds; thehardest nonlinear problems.

Nonlinear preconditioning

For a linear problemAx = b

we apply an approximate solver P−1 on the left:

P−1Ax = P−1b.

Write one step of a nonlinear solver for

F (x) = b

asxi+1 = N(F, xi, b).

For a linear problemAx = b

we apply an approximate solver P−1 on the left:

P−1Ax = P−1b.

Write one step of a nonlinear solver for

F (x) = b

asxi+1 = N(F, xi, b).

In nonlinear left preconditioning, we define a new residual

R(x) = x−N(F, x, b)

and apply an outer nonlinear solver to R.

In the linear case this is equivalent, since

R(x) = x−N(F, x, b)

= x+ P−1(Ax− b)− x= P−1(Ax− b)

Can accelerate an inner solver with an outer solver!

R(x) = x−N(F, x, b)

= x+ P−1(Ax− b)− x= P−1(Ax− b)

R(x) = x−N(F, x, b)

= x+ P−1(Ax− b)− x= P−1(Ax− b)

Examples of nonlinear preconditioning

Hyperelasticity (Brune et al, 2013)

Inner solver: Newton. Outer solver: nonlinear conjugate gradients.

High-Reynolds number Navier–Stokes (Cai and Keyes, 2002)

Inner solver: nonlinear additive Schwarz. Outer solver: Newton–Krylov.

High-Prandtl number Navier–Stokes (Brune et al, 2013)

Inner solver: nonlinear multigrid. Outer solver: nonlinear GMRES.

Nonlinear preconditioning: a remark

The design space for nonlinear solvers is vast.

At the moment we have very little theory to guide us.

There are very large potential gains, however.

Nonlinear multigrid

The main bottleneck for massive problems is the linear system.

What if we didn’t have to solve (large) linear systems?

FAS uses fine-grid residuals to correct coarse-grid equations.

Nonlinear multigrid

Full Approximation Scheme (FAS)

Given:

I a problem (F h, xh, bh)I a smoother S and coarse solver MI restriction, prolongation and injection operators R,P and R.

while not converged:

xhs = S(F h, xhi , bh)

xH = Rxhs

bH = R[b− F h(xh)] + FH(xH)

xHc = M(FH , xH , bH)

xhc = xhs + P [xHc − xH ]

xhi+1 = S(F h, xhc , bh)

Nonlinear multigrid

You can use

I a high-flop smoother on the fine grids,

I and Newton-LU on the coarse grids!

(see firedrake Yamabe demo)

Nonlinear multigrid

You can use

I a high-flop smoother on the fine grids,

I and Newton-LU on the coarse grids!

(see firedrake Yamabe demo)

HPC 08 Challenge!

Consider again the p-Laplace equation (FEniCS lecture III).

1. Investigate the performance of different linesearch schemes on thep-Laplace problem.

2. Using only basic for the inner solver, accelerate the convergence ofNewton’s method with left-preconditioning with ncg/cp.

3. Now use the optimal inner linesearch to beat the unaccelerated solver.

4. Choose sensible Krylov solvers and scale the code on ARCUS.

Solving PDEs on Supercomputers IV:a final challenge

Patrick Farrell

May 17, 2015

Consider the Cahn–Hilliard equation

∂t−∇ ·M

(∇(df

dc− λ∇2c

))= 0 in Ω,

(∇(df

dc− λ∇2c

))= 0 on ∂Ω,

Mλ∇c · n = 0 on ∂Ω.

where c is the unknown field, f(c) = 100c2(c− 1)2, n is the unit normal,and M is a scalar parameter.

To solve this with standard C0 elements, write it as two coupledsecond-order problems.

Discretise and solve the equation on Ω = [0, 1]2 for M = 1, λ = 10−2, andinitial condition

class InitialConditions(Expression):

def __init__(self):

random.seed(2 + MPI.rank(mpi_comm_world()))

def eval(self, values, x):

values[0] = 0.63 + 0.02*(0.5 - random.random())

values[1] = 0.0

def value_shape(self):

return (2,)

Make sure your scheme is at least second-order. Sensible values are∆t = 5× 10−6, θ = 0.5.

An excellent preconditioner is discussed in doi:10.1137/130921842.

solving pdes on supercomputers i: modern supercomputer ... · solving pdes on supercomputers i:...

Documents