scalable solvers and software for pde applications david e. keyes department of applied physics...

Scalable Solvers and Software for PDE Applications

David E. Keyes

Department of Applied Physics & Applied Mathematics

Columbia University

Institute for Scientific Computing Research

Lawrence Livermore National Laboratory

www . tops-scidac . org

IMI Lecture, 26 Jan 2004www . tops-scidac . org

Motivation Solver performance is a major concern for parallel

simulations based on PDE formulations … including many of those of the U.S. DOE Scientific Discovery

through Advanced Computing (SciDAC) program

For target applications, implicit solvers may require 50% to 95% of execution time … at least, before “expert” overhaul for algorithmic optimality and

implementation performance

Even after a “best manual practice” overhaul, the solver may still require 20% to 50% of execution time

The solver may hit up against both the processor scalability limit and the memory bandwidth limitation of a PDE-based application, before any other part of the code … the first of these is not fundamental, though the second one is


Presentation plan Overview of the SciDAC initiative

Brief review of scalable implicit methods (domain decomposed multilevel iterative methods) algorithms software components: PETSc, Hypre, etc.

Overview of the Terascale Optimal PDE Simulations project (TOPS)

Three “war stories” from the SciDAC magnetically confined fusion energy portfolio

Some advanced research directions physics-based preconditioning nonlinear Schwarz

On the horizon


SciDAC apps and infrastructure4 projects

in high energy and

nuclear physics

5 projects in fusion

energy science

14 projects in biological and environmental research

10 projects will in basic energy sciences

18 projects in scientific

software and network

infrastructure


“Enabling technologies” groups to develop reusable software and partner with application groups

From 2001 start-up, 51 projects share $57M/year Approximately one-third for applications A third for “integrated software infrastructure

centers” A third for grid infrastructure and collaboratories

Plus, multi-Tflop/s IBM SP machines at NERSC and ORNL available for SciDAC researchers


Unclassified resources for DOE science

IBM Power4 Regatta

32 procs per node

864 procs total

4.5 Tflop/s

“Cheetah”

IBM Power3+ SMP

16 procs per node

6656 procs total

10 Tflop/s

“Seaborg”

Berkeley

Oak Ridge


Designing a simulation code(from 2001 SciDAC report)

V&V loop

Performance loop


Hardware Infrastructure

ARCHITECTURES

Applications

A “perfect storm” for simulation

scientific models

numerical algorithms

computer architecture

scientific software engineering

1686

1947

1976

“Computational science is undergoing a phase transition.” – D. Hitchcock, DOE

(dates are symbolic)


Imperative: multiple-scale applications Multiple spatial scales

interfaces, fronts, layers thin relative to domain size,

<< L Multiple temporal scales

fast waves small transit times relative to

convection or diffusion, << T

Analyst must isolate dynamics of interest and model the rest in a system that can be discretized over more modest range of scales

May lead to infinitely “stiff” subsystem requiring special treatment by the solution method

Richtmeyer-Meshkov instability, c/o A. Mirin, LLNL


Examples: multiple-scale applications Biopolymers, nanotechnology

1012 range in time, from 10-15 sec (quantum fluctuation) to 10-3 sec (molecular folding time)

typical computational model ignores smallest scales, works on classical dynamics only, but scientists increasingly want both

Galaxy formation 1020 range in space from binary star interactions

to diameter of universe heroic computational model handles all scales

with localized adaptive meshing

Supernova simulation, c/o A. Mezzacappa, ORNL

Supernovae simulation massive ranges in time and

space scales for radiation, turbulent convection, diffusion, chemical reaction, nuclear reaction


SciDAC portfolio characteristics Multiple temporal scales Multiple spatial scales Linear ill conditioning Complex geometry and severe anisotropy Coupled physics, with essential nonlinearities Ambition for uncertainty quantification,

parameter estimation, and design

Need toolkit of portable, extensible, tunable implicit solvers, not “one-size fits all”


TOPS starting point codes PETSc (ANL) Hypre (LLNL) Sundials (LLNL) SuperLU (LBNL) PARPACK (LBNL*) TAO (ANL) Veltisto (CMU) Many interoperability connections between these

packages that predated SciDAC Many application collaborators that predated SciDAC


TOPS participants

ODU

UC-B/LBNLANL

UT-K

TOPS lab (3)

CU

LLNL

TOPS university (7)

CMU

CU-B

NYU


In the old days, see “Templates” guides …www.netlib.org/etemplateswww.netlib.org/templates

124 pp. 410 pp.… these are good starts, but not adequate for SciDAC scales!


34 applications groups

7 ISIC groups (4 CS, 3 Math)

10 grid, data collaboratory groups

adaptive gridding discretization

solvers (TOPS)

systems software component architecture performance engineering data management

0),,,( ptxxf

0),( pxF

bAx BxAx

..),(min tsuxu

0),( uxF

software integration

performance optimization

“integrated software infrastructure centers”


Keyword: “Optimal” Convergence rate nearly

independent of discretization parameters

multilevel schemes for rapid linear convergence of linear problems

Newton-like schemes for quadratic convergence of nonlinear problems

Convergence rate as independent as possible of physical parameters

continuation schemes physics-based preconditioning

Optimal convergence plus scalable loop body yields scalable solver

unscalable

scalable

Problem Size (increasing with number of processors)

Tim

e to

So

luti

on

200

150

50

0

100

10 100 10001


But where to go past O(N) ? Since O(N) is already optimal, there is nowhere further

“upward” to go in efficiency, but one must extend optimality “outward,” to more general problems

Hence, for instance, algebraic multigrid (AMG) to seek to obtain O(N) in indefinite, anisotropic, or inhomogeneous problems on irregular grids

AMG FrameworkR n

Choose coarse grids, transfer operators, and smoothers to

eliminate these “bad” components within a smaller dimensional space, and recur

error easily damped by pointwise relaxation

algebraically smooth error


Toolchain for PDE solvers in TOPS project Design and implementation of “solvers”

Time integrators

Nonlinear solvers

Constrained optimizers

Linear solvers

Eigensolvers

Software integration Performance optimization

0),,,( ptxxf

0),( pxF

bAx

BxAx

0,0),(..),(min uuxFtsuxu

Optimizer

Linear solver

Eigensolver

Time integrator

Nonlinear solver

Indicates dependence

Sens. Analyzer

(w/ sens. anal.)

(w/ sens. anal.)


Dominant data structures are grid-based

finite differences finite elements

finite volumes

All lead to problems with sparse Jacobian matrices; many tasks can leverage off an efficient set of tools for manipulating distributed sparse data structures

J=

node i

row i


Newton-Krylov-Schwarz: a PDE applications “workhorse”

Newtonnonlinear solver

asymptotically quadratic

0)(')()( uuFuFuF cc uuu c

Krylovaccelerator

spectrally adaptive

FuJ }{minarg

},,,{ 2

FJxuFJJFFVx

Schwarzpreconditionerparallelizable

FMuJM 11

iTii

Tii RJRRRM 11 )(


SPMD parallelism w/domain decomposition

Partitioning of the grid induces block structure on the Jacobian

1

2

3

A23A21 A22

rows assigned to proc “2”


Time-implicit Newton-Krylov-SchwarzFor accommodation of unsteady problems, and nonlinear robustness in

steady ones, NKS iteration is wrapped in time-stepping:for (l = 0; l < n_time; l++) {

select time step

for (k = 0; k < n_Newton; k++) {

compute nonlinear residual and Jacobian

for (j = 0; j < n_Krylov; j++) {

forall (i = 0; i < n_Precon ; i++) {

solve subdomain problems concurrently

} // End of loop over subdomains

perform Jacobian-vector product

enforce Krylov basis conditions

update optimal coefficients

check linear convergence

} // End of linear solver

perform DAXPY update

check nonlinear convergence

} // End of nonlinear loop

} // End of time-step loop

NKS loop

Pseudo-time loop


(N)KS kernel in parallel

local scatter

Jac-vec multiply

precond sweep

daxpy inner product

Krylov iteration

…

Bulk synchronous model leads to easy scalability analyses and projections. Each phase can be considered separately. What happens if, for instance, in this (schematicized) iteration, arithmetic speed is doubled, scalar all-gather is quartered, and local scatter is cut by one-third?

P1:

P2:

Pn:

…P1:

P2:

Pn:


Estimating scalability of stencil computations Given complexity estimates of the leading terms of:

the concurrent computation (per iteration phase) the concurrent communication the synchronization frequency

And a bulk synchronous model of the architecture including: internode communication (network topology and protocol reflecting horizontal

memory structure) on-node computation (effective performance parameters including vertical

memory structure)

One can estimate optimal concurrency and optimal execution time

on per-iteration basis, or overall (by taking into account any granularity-dependent convergence rate), based on problem size N and concurrency P

simply differentiate time estimate in terms of (N,P) with respect to P, equate to

zero and solve for P in terms of N


Scalability results for DD stencil computations With tree-based (logarithmic) global reductions and

scalable nearest neighbor hardware: optimal number of processors scales linearly with problem

size

With 3D torus-based global reductions and scalable nearest neighbor hardware:

optimal number of processors scales as three-fourths power of problem size (almost “scalable”)

With common network bus (heavy contention): optimal number of processors scales as one-fourth power

of problem size (not “scalable”) bad news for conventional Beowulf clusters, but see 2000

Bell Prize “price-performance awards”, for multiple NICs


PETSc codeUser code

ApplicationInitialization

FunctionEvaluation

JacobianEvaluation

Post-Processing

PC KSPPETSc

Main Routine

Linear Solvers (SLES)

Nonlinear Solvers (SNES)

Timestepping Solvers (TS)

NKS efficiently implemented in PETSc’s MPI-based distributed data structures


PETSc codeUser code

ApplicationInitialization

FunctionEvaluation

JacobianEvaluation

Post-Processing

PC KSPPETSc

Main Routine

Linear Solvers (SLES)

Nonlinear Solvers (SNES)

Timestepping Solvers (TS)

User Code/PETSc library interactions

Can be AD code


1999 Bell Prize for unstructured grid computational aerodynamics

mesh c/o D. Mavriplis, ICASE

Implemented in PETSc

www.mcs.anl.gov/petsc

Transonic “Lambda” Shock, Mach contours on surfaces


Fixed-size parallel scaling results

Four orders of magnitude in 13 years

c/o K. Anderson, W. Gropp, D. Kaushik, D. Keyes and B. Smith

128 nodes 128 nodes 43min43min

3072 nodes 3072 nodes 2.5min, 2.5min, 226Gf/s226Gf/s

11M unknowns 11M unknowns 70% efficient70% efficient


BEB

divbt

JBVE

BJ 0

nDnt

n

V

VBJVVV

p

t

QTnpTt

Tn

IbbVV

ˆˆ1 ||

Physical models based on fluid-like magnetohydrodynamics (MHD)

0

Three “war stories” from magnetic fusion energy applications in SciDAC


• Conditions of interest possess two properties that pose great challenges to numerical approaches—anisotropy and stiffness.

• Anisotropy produces subtle balances of large forces, and vastly different parallel and perpendicular transport properties.

• Stiffness reflects the vast range of time-scales in the system: targeted physics is slow (~transport scale) compared to waves

Challenges in magnetic fusion


Tokamak/stellerator simulations Center for Extended MHD Modeling (based at

Princeton Plasma Physics Lab) M3D code Realistic toroidal geom., unstructured mesh,

hybrid FE/FD discretization Fields expanded in scalar potentials, and

streamfunctions Operator-split, linearized, w/11 potential

solves in each poloidal cross-plane/step (90% exe. time)

Parallelized w/PETSc (Tang et al., SIAM PP01, Chen et al., SIAM AN02, Jardin et al., SIAM CSE03)

Want from TOPS: Now: scalable linear implicit solver for much

higher resolution (and for AMR) Later: fully nonlinearly implicit solvers and

coupling to other codes


Provided new solvers across existing interfaces Hypre in PETSc

codes with PETSc interface (like CEMM’s M3D) can now invoke Hypre routines as solvers or preconditioners with command-line switch

SuperLU_DIST in PETSc as above, with SuperLU_DIST

Hypre in AMR Chombo code so far, Hypre is level-solver only; its AMG will ultimately

be useful as a bottom-solver, since it can be coarsened indefinitely without attention to loss of nested geometric structure; also FAC is being developed for AMR uses, like Chombo


smoother

Finest Grid

First Coarse Grid

coarser grid has fewer cells (less work & storage)

Restrictiontransfer from fine to coarse grid

Recursively apply this idea until we have an easy problem to solve

A Multigrid V-cycle

Prolongationtransfer from coarse to fine grid

Hypre: multilevel preconditioning


Hypre’s AMG in M3D PETSc-based PPPL code M3D has been retrofit with Hypre’s algebraic

MG solver of Ruge-Steuben type Iteration count results below are averaged over 19 different PETSc

SLESSolve calls in initialization and one timestep loop for this operator split unsteady code, abcissa is number of procs in scaled problem; problem size ranges from 12K to 303K unknowns (approx 4K per processor)

0

100

200

300

400

500

600

700

3 12 27 48 75

ASM-GMRESAMG-FMGRES


Hypre’s AMG in M3D Scaled speedup timing results below are summed over 19 different PETSc

SLESSolve calls in initialization and one timestep loop for this operator split unsteady code

Majority of AMG cost is coarse-grid formation (preprocessing) which does not scale as well as the inner loop V-cycle phase; in production, these coarse hierarchies will be saved for reuse (same linear systems are called in each timestep loop), making AMG much less expensive and more scalable

0

10

20

30

40

50

60

3 12 27 48 75

ASM-GMRESAMG-FMGRESAMG inner (est)


Hypre’s “Conceptual Interfaces”

Data Layout

structured composite block-struc unstruc CSR

Linear Solvers

GMG, ... FAC, ... Hybrid, ... AMGe, ... ILU, ...

Linear System Interfaces

(Slide c/o E. Chow, LLNL)


SuperLU in NIMROD NIMROD is another MHD code in the CEMM collaboration

employs high-order elements on unstructured grids very poor convergence with default Krylov solver on 2D poloidal

crossplane linear solves

TOPS wired in SuperLU, just to try a sparse direct solver Speedup of more than 10 in serial, and about 8 on a

modest parallel cluster (24 procs) PI Dalton Schnack (General Atomics) thought he entered a

time machine SuperLU is not a “final answer”, but a sanity check Parallel ILU under Krylov should be superior


Equilibrium:

Model equations: (Porcelli et al., 1993, 1999)

2D Hall MHD sawtooth instability (PETSc examples /snes/ex29.c and /sles/ex31.c)

(figures c/o A. Bhattacharjee, CMRS)

Vorticity, early time

Vorticity, later time

zoom


PETSc’s DMMG in Hall MR application Implicit code (snes/ex29.c)

versus explicit code (sles/ex31.c), both with second-order integration in time

Implicit code (snes/ex29.c) with first- and second-order integration in time


Abstract Gantt Chart for TOPS

Algorithmic Development

Research Implementations

Hardened Codes

Applications Integration

Dissemination

time

e.g.,PETSc

e.g.,TOPSLib

e.g., ASPIN

Each color module represents an algorithmic research idea on its way to becoming part of a supported community software tool. At any moment (vertical time slice), TOPS has work underway at multiple levels. While some codes are in applications already, they are being improved in functionality and performance as part of the TOPS research agenda.


Jacobian-free Newton-Krylov In the Jacobian-Free Newton-Krylov (JFNK) method, a

Krylov method solves the linear Newton correction equation, requiring Jacobian-vector products

These are approximated by the Fréchet derivatives

(where is chosen with a fine balance between approximation and floating point rounding error) or automatic differentiation, so that the actual Jacobian elements are never explicitly needed

One builds the Krylov space on a true F’(u) (to within numerical approximation)

)]()([1

)( uFvuFvuJ


Philosophy of Jacobian-free NK To evaluate the linear residual, we use the true F’(u) , giving

a true Newton step and asymptotic quadratic Newton convergence

To precondition the linear residual, we do anything convenient that uses understanding of the dominant physics/mathematics in the system and respects the limitations of the parallel computer architecture and the cost of various operations:

Jacobian of lower-order discretization Jacobian with “lagged” values for expensive terms Jacobian stored in lower precision Jacobian blocks decomposed for parallelism Jacobian of related discretization operator-split Jacobians physics-based preconditioning


Recall idea of preconditioning Krylov iteration is expensive in memory and in

function evaluations, so subspace dimension k must be kept small in practice, through preconditioning the Jacobian with an approximate inverse, so that the product matrix has low condition number in

Given the ability to apply the action of to a vector, preconditioning can be done on either the left, as above, or the right, as in, e.g., for matrix-free:

)]()([1 11 uFvBuFvJB

bBxAB 11 )( 1B


Physics-based preconditioning In Newton iteration, one seeks to obtain a correction

(“delta”) to solution, by inverting the Jacobian matrix on (the negative of) the nonlinear residual:

A typical operator-split code also derives a “delta” to the solution, by some implicitly defined means, through a series of implicit and explicit substeps

This implicitly defined mapping from residual to “delta” is a natural preconditioner

Software must accommodate this!

)()]([ 1 kkk uFuJu

kk uuF )(


Physics-based preconditioning We consider a standard “dynamical

core,” the shallow-water wave splitting algorithm, as a solver

Leaves a first-order in time splitting error

In the Jacobian-free Newton-Krylov framework, this solver, which maps a residual into a correction, can be regarded as a preconditioner

The true Jacobian is never formed yet the time-implicit nonlinear residual at each time step can be made as small as needed for nonlinear consistency in long time integrations


Example: shallow water equations Continuity (*)

Momentum (**)

These equations admit a fast gravity wave, as can be seen by cross differentiating, e.g., (*) by t and (**) by x, and subtracting:

0)(

x

u

t

0)()( 2

xg

x

u

t

u

termsotherx

gt

2

2

2

2

t

x


1D shallow water equations, cont.

Wave equation for geopotential:

Gravity wave speed g

Typically , but stability restrictions would require timesteps based on the Courant-Friedrichs-Levy (CFL) criterion for the fastest wave, for an explicit method

One can solve fully implicitly, or one can filter out the gravity wave by solving semi-implicitly

ug

termsotherx

gt

2

2

2

2


1D shallow water equations, cont. Continuity (*)

Momentum (**)

0)( 11

x

u nnn

0)()()( 121

xg

x

uuu nn

nnn

Solving (**) for and substituting into (*),

where

1)( nu

x

S

xxg

nn

nnn

)(1

21

x

uuS

nnn

)(

)(2


1D shallow water equations, cont. After the parabolic equation is spatially discretized and

solved for , then can be found from n

nnn S

xgu

1

1)(

One scalar parabolic solve and one scalar explicit update replace an implicit hyperbolic system

This semi-implicit operator splitting is foundational to multiple scales problems in geophysical modeling

Similar tricks are employed in aerodynamics (sound waves), MHD (multiple Alfvén waves), reacting flows (fast kinetics), etc.

Temporal truncation error remains due to the lagging of the advection

in (**)

1n 1)( nu

To be dealt with shortly


1D Shallow water preconditioning Define continuity residual for each timestep:

Define momentum residual for each timestep:

_)]([

Rx

u

uRx

gu n _

][)(

Continuity delta-form (*):

Momentum delta form (**):

x

uR

nnn

11 )(

_

xg

x

uuuuR

nn

nnn

121 )()()(_


1D Shallow water preconditioning, cont. Solving (**) for and substituting into (*),

After this parabolic equation is solved for , we have

This completes the application of the preconditioner to one Newton-Krylov iteration at one timestep

Of course, the parabolic solve need not be done exactly; one sweep of multigrid can be used See paper by Mousseau et al. (2002) for impressive results for longtime weather integration

)( u

)_(_)][

( 22 uRx

Rxx

g n

uRx

gu n _][

)(


Physics-based preconditioning update

So far, physics-based preconditioning has been applied to several codes at Los Alamos, in an effort led by D. Knoll

Summarized in new J. Comp. Phys. paper by Knoll & Keyes (Jan 2004)

PETSc’s “shell preconditioner” is designed for inserting physics-based preconditioners, and PETSc’s solvers underneath are building blocks


Nonlinear Schwarz preconditioning Nonlinear Schwarz has Newton both inside and

outside and is fundamentally Jacobian-free It replaces with a new nonlinear system

possessing the same root, Define a correction to the partition (e.g.,

subdomain) of the solution vector by solving the following local nonlinear system:

where is nonzero only in the components of the partition

Then sum the corrections: to get an implicit function of u

0)( uF

0)( uthi

thi

)(ui

0))(( uuFR ii n

i u )(

)()( uu ii


Nonlinear Schwarz – picture

1

1

1

1

0 0

u

F(u)

Ri

RiuRiF



1

1

1

1

0 0

1

1

1

1

0 0

u

F(u)

Ri

Rj

Riu

RjF

RiF

Rju



u

F(u)

Fi’(ui)

Ri

Rj

δiu+δju

1

1

1

1

0 0

1

1

1

1

0 0 RiuRiF

RjuRjF


Nonlinear Schwarz, cont. It is simple to prove that if the Jacobian of F(u) is

nonsingular in a neighborhood of the desired root then and have the same unique root

To lead to a Jacobian-free Newton-Krylov algorithm we need to be able to evaluate for any : The residual The Jacobian-vector product

Remarkably, (Cai-Keyes, 2000) it can be shown that

where and All required actions are available in terms of !

0)( u

nvu ,)()( uu ii

0)( uF

vu ')(

JvRJRvu iiTii )()( 1'

)(' uFJ Tiii JRRJ

)(uF


Experimental example of nonlinear Schwarz

Newton’s methodAdditive Schwarz Preconditioned Inexact Newton

(ASPIN)

Difficulty at critical Re

Stagnation beyond

critical Re

Convergence for all Re


The 2003 SCaLeS initiative

Workshop on a Science-based Case for Large-scale Simulation

Arlington, VA

24-25 June 2003


Charge (April 2003, W. Polansky, DOE): “Identify rich and fruitful directions for the

computational sciences from the perspective of scientific and engineering applications”

Build a “strong science case for an ultra-scale computing capability for the Office of Science”

“Address major opportunities and challenges facing computational sciences in areas of strategic importance to the Office of Science”

“Report by July 30, 2003”

Chapter 1. Introduction

Chapter 2. Scientific Discovery through Advanced Computing: a Successful Pilot Program

Chapter 3. Anatomy of a Large-scale Simulation

Chapter 4. Opportunities at the Scientific Horizon

Chapter 5. Enabling Mathematics and Computer Science Tools

Chapter 6. Recommendations and Discussion

Volume 2 (due out early 2004):

11 chapters on applications

8 chapters on mathematical methods

8 chapters on computer science and infrastructure

First fruits!


“There will be opened a gateway and a road to a large and excellent science

into which minds more piercing than mine shall penetrate to recesses still deeper.”

Galileo (1564-1642)(on ‘experimental mathematical analysis of nature’

appropriated here for ‘simulation science’)


Related URLs TOPS project

http://www.tops-scidac.org SciDAC initiative

http://www.science.doe.gov/scidac

SCaLeS reporthttp://www.pnl.gov/scales

scalable solvers and software for pde applications david e. keyes department of applied physics...

Documents

scidac researchersimi

scidac reportimi lecture

horizonimi lecture

symbolicimi lecture

isimi lecture

orgimi lecture

scientific horizon chapter

infrastructure4 projects