numerical solution of partial diﬀerential equationsjliu/math226/book3.pdf · numerical solution...

Numerical Solution of Partial Differential Equations

John A. Trangenstein1

December 6, 2006

1Department of Mathematics, Duke University, Durham, NC 27708-0320 [email protected].

Contents

1 Introduction to Partial Differential Equations 11.1 Types of Second-Order Partial Differential Equations . . . . . . . . . . . . . 11.2 Physical Examples of Diffusion . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.1 Heat Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2.2 Electrical Wave Propagation in the Heart . . . . . . . . . . . . . . . 51.2.3 Miscible Displacement Model . . . . . . . . . . . . . . . . . . . . . . 61.2.4 Buckley-Leverett Model . . . . . . . . . . . . . . . . . . . . . . . . . 91.2.5 Thin Films . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.2.6 Incompressible Fluid Flow . . . . . . . . . . . . . . . . . . . . . . . . 151.2.7 Summary of Physical Examples . . . . . . . . . . . . . . . . . . . . . 16

2 Parabolic Equations 172.1 Theory of Linear Parabolic Equations . . . . . . . . . . . . . . . . . . . . . 17

2.1.1 Fourier Transform Methods . . . . . . . . . . . . . . . . . . . . . . . 172.1.2 Reflection and Superposition . . . . . . . . . . . . . . . . . . . . . . 202.1.3 Maximum Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.1.4 Bounded Domains and Separable Solutions . . . . . . . . . . . . . . 22

2.2 Finite Difference Methods in One Dimension . . . . . . . . . . . . . . . . . 242.2.1 Continuous-In-Time Methods . . . . . . . . . . . . . . . . . . . . . . 242.2.2 Explicit Centered Differences . . . . . . . . . . . . . . . . . . . . . . 312.2.3 Programs for Explicit Centered Differences . . . . . . . . . . . . . . 33

First Explicit Centered Difference Program . . . . . . . . . . . . . . 33Second Explicit Centered Difference Program . . . . . . . . . . . . . 34Third Explicit Centered Difference Program . . . . . . . . . . . . . . 35Fourth Explicit Centered Difference Program . . . . . . . . . . . . . 36Fifth Explicit Centered Difference Program . . . . . . . . . . . . . . 37

2.2.4 Implicit Centered Differences . . . . . . . . . . . . . . . . . . . . . . 382.2.5 Higher-Order Temporal Discretization . . . . . . . . . . . . . . . . . 41

2.3 Consistency, Stability and Convergence . . . . . . . . . . . . . . . . . . . . 46

i

ii CONTENTS

2.3.1 Stability of Explicit and Implicit Centered Differences . . . . . . . . 472.3.2 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482.3.3 Stability of the Crank-Nicolson Scheme . . . . . . . . . . . . . . . . 522.3.4 Lax Convergence Theorem . . . . . . . . . . . . . . . . . . . . . . . 54

2.4 Fourier Analysis of Finite Difference Schemes . . . . . . . . . . . . . . . . . 562.4.1 Constant Coefficient Equations and Waves . . . . . . . . . . . . . . 572.4.2 Dimensionless Groups . . . . . . . . . . . . . . . . . . . . . . . . . . 582.4.3 Linear Finite Differences and Diffusion . . . . . . . . . . . . . . . . . 592.4.4 Fourier Analysis of Finite Difference Schemes . . . . . . . . . . . . . 612.4.5 Lax Equivalence Theorem . . . . . . . . . . . . . . . . . . . . . . . . 69

2.5 Measuring Accuracy and Efficiency . . . . . . . . . . . . . . . . . . . . . . . 842.6 Finite Difference Methods in Multiple Dimensions . . . . . . . . . . . . . . 90

2.6.1 Unsplit Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 902.6.2 Operator Splitting Methods . . . . . . . . . . . . . . . . . . . . . . . 91

3 Iterative Linear Algebra 993.1 Relative Efficiency of Implicit Computations . . . . . . . . . . . . . . . . . . 993.2 Neumann Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1033.3 Perron-Frobenius Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 1083.4 M Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1173.5 Iterative Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

3.5.1 Richardson’s Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . 1273.5.2 Jacobi Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1333.5.3 Gauss-Seidel Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . 1373.5.4 Successive Over-Relaxation . . . . . . . . . . . . . . . . . . . . . . . 1413.5.5 Termination Criteria for Iterative Methods . . . . . . . . . . . . . . 147

3.6 Incomplete Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1503.6.1 Block Factorization of Block Tridiagonal Matrices . . . . . . . . . . 1503.6.2 Approximate Factorization . . . . . . . . . . . . . . . . . . . . . . . 1523.6.3 Banded Part of an Inverse . . . . . . . . . . . . . . . . . . . . . . . . 1543.6.4 Approximate Factorization and Iterative Improvement . . . . . . . . 1553.6.5 Incomplete Factorization for Three-Dimensional Problems . . . . . . 1563.6.6 Incomplete Factorization Software . . . . . . . . . . . . . . . . . . . 156

3.7 Conjugate Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1573.7.1 Self-Adjoint Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 1583.7.2 Preconditioned Conjugate Gradients . . . . . . . . . . . . . . . . . . 167

3.8 Minimum Residual Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 1703.8.1 Orthomin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1703.8.2 GMRES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

3.9 Nonlinear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

CONTENTS iii

3.9.1 Newton Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 1843.9.2 Nonlinear Krylov Algorithms . . . . . . . . . . . . . . . . . . . . . . 185

3.10 Algebraic Multigrid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1903.10.1 Algebraic Multigrid Algorithm . . . . . . . . . . . . . . . . . . . . . 1903.10.2 Multigrid Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1913.10.3 Coarse Grid Correction . . . . . . . . . . . . . . . . . . . . . . . . . 1933.10.4 Pre Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1963.10.5 Prolongation and Restriction . . . . . . . . . . . . . . . . . . . . . . 1973.10.6 Steady-State Diffusion in One Dimension . . . . . . . . . . . . . . . 1983.10.7 Heat Equation in One Dimension . . . . . . . . . . . . . . . . . . . . 2023.10.8 2D Laplace Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . 2053.10.9 Work Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2073.10.10Multigrid Debugging Techniques . . . . . . . . . . . . . . . . . . . . 208

4 Finite Element Methods 2114.1 Galerkin Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211

4.1.1 Weak Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2124.1.2 Galerkin Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

4.2 Finite Element Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2204.2.1 Finite Elements in 1D . . . . . . . . . . . . . . . . . . . . . . . . . . 2214.2.2 Triangular Finite Elements . . . . . . . . . . . . . . . . . . . . . . . 228

4.3 Elliptic Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2374.3.1 Multi-Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2374.3.2 Elliptic Differential Operators . . . . . . . . . . . . . . . . . . . . . . 2384.3.3 Dirichlet Problems for Elliptic Operators . . . . . . . . . . . . . . . 240

4.4 Elliptic Regularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2424.4.1 Function Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2424.4.2 Function Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2444.4.3 Sobolev Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2474.4.4 Sobolev Imbedding Theorems . . . . . . . . . . . . . . . . . . . . . . 2484.4.5 Hilbert Scales and Fractional-Order Sobolev Spaces . . . . . . . . . 2514.4.6 Sobolev Trace Theorems . . . . . . . . . . . . . . . . . . . . . . . . . 253

4.5 Elliptic Differential Operators . . . . . . . . . . . . . . . . . . . . . . . . . . 2594.5.1 Garding’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . 2594.5.2 Existence and Uniqueness for Weak Problems . . . . . . . . . . . . . 2604.5.3 Higher-Order Dependence on the Data . . . . . . . . . . . . . . . . . 265

4.6 Piecewise Polynomial Approximation . . . . . . . . . . . . . . . . . . . . . . 2684.6.1 Bramble-Hilbert Lemma . . . . . . . . . . . . . . . . . . . . . . . . . 2684.6.2 Polynomial Approximation on Star-Shaped Domains . . . . . . . . . 275

4.7 Galerkin Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282

iv CONTENTS

4.7.1 Well-Posedness of Galerkin Equations . . . . . . . . . . . . . . . . . 2834.7.2 Approximation Assumption . . . . . . . . . . . . . . . . . . . . . . . 2874.7.3 Hm Error Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . 2884.7.4 Convergence for Rough Problems . . . . . . . . . . . . . . . . . . . . 2894.7.5 H0 Error Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . 2904.7.6 Non-Positive Weak Forms . . . . . . . . . . . . . . . . . . . . . . . . 297

5 Finite Element Implementations 3035.1 Computational Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3035.2 Finite Element Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . 3045.3 Finite Elements in 1D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306

5.3.1 Essential and Natural Boundary Conditions . . . . . . . . . . . . . . 3145.3.2 Higher Order Finite Elements . . . . . . . . . . . . . . . . . . . . . . 3175.3.3 Hierarchical Elements . . . . . . . . . . . . . . . . . . . . . . . . . . 333

5.4 Approximation Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3375.4.1 Piecewise Linear Approximation . . . . . . . . . . . . . . . . . . . . 3375.4.2 Continuous Piecewise Quadratic Approximation . . . . . . . . . . . 3475.4.3 Higher Order Piecewise Polynomial Approximation . . . . . . . . . . 3475.4.4 Barycentric Coordinates . . . . . . . . . . . . . . . . . . . . . . . . . 3505.4.5 Coordinate Transformations . . . . . . . . . . . . . . . . . . . . . . . 353

5.5 Triangular Finite Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . 3555.5.1 Quadrature Rules for Triangles . . . . . . . . . . . . . . . . . . . . . 3555.5.2 Linear Lagrange Element . . . . . . . . . . . . . . . . . . . . . . . . 3565.5.3 Quadratic Lagrange Element . . . . . . . . . . . . . . . . . . . . . . 3575.5.4 Cubic Lagrange Element . . . . . . . . . . . . . . . . . . . . . . . . . 3585.5.5 General Lagrange Elements . . . . . . . . . . . . . . . . . . . . . . . 3595.5.6 Hermite Cubics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3605.5.7 Hierarchical Triangular Elements . . . . . . . . . . . . . . . . . . . . 364

5.6 Quadrilateral Finite Elements . . . . . . . . . . . . . . . . . . . . . . . . . . 3665.6.1 Quadrature Rules for Quadrilaterals . . . . . . . . . . . . . . . . . . 3675.6.2 Tensor Product Linears: V0

1 . . . . . . . . . . . . . . . . . . . . . . . 3675.6.3 Biquadratics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3705.6.4 Bicubics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3705.6.5 Hermite Bicubics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3715.6.6 Serendipity Quadratics . . . . . . . . . . . . . . . . . . . . . . . . . . 3715.6.7 Hierarchical Quadrilateral Elements . . . . . . . . . . . . . . . . . . 371

5.7 Three-Dimensional Finite Elements . . . . . . . . . . . . . . . . . . . . . . . 3725.7.1 3D Rectangular Finite Elements . . . . . . . . . . . . . . . . . . . . 3725.7.2 3D Tetrahedral Finite Elements . . . . . . . . . . . . . . . . . . . . . 3765.7.3 3D Wedge Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . 377

CONTENTS v

5.8 Error Estimates for Linear Elements . . . . . . . . . . . . . . . . . . . . . . 3785.9 Condition Number Estimates for Linear Elements . . . . . . . . . . . . . . . 3885.10 Interpolation Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399

5.10.1 Low-Order Polynomials on Triangles . . . . . . . . . . . . . . . . . . 3995.10.2 General Interpolation Errors . . . . . . . . . . . . . . . . . . . . . . 401

5.11 Numerical Quadrature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4035.12 Interpolated Boundary Conditions . . . . . . . . . . . . . . . . . . . . . . . 4125.13 Finite Elements and Multigrid . . . . . . . . . . . . . . . . . . . . . . . . . 417

5.13.1 Full Multigrid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4175.13.2 Weak Formulation of the Problem . . . . . . . . . . . . . . . . . . . 4185.13.3 Linear Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4195.13.4 Smoothers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4265.13.5 Multigrid Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4305.13.6 Matrix-Vector Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . 435

5.14 Mixed and Hybrid Finite Elements . . . . . . . . . . . . . . . . . . . . . . . 4455.14.1 Review of Quadratic Programming . . . . . . . . . . . . . . . . . . . 4455.14.2 Mixed Formulation of Physical Problems . . . . . . . . . . . . . . . 447

5.15 Mixed and Hybrid Finite Elements . . . . . . . . . . . . . . . . . . . . . . . 4485.15.1 Lowest Order Mixed Finite Element Method . . . . . . . . . . . . . 4505.15.2 Hybrid Mixed Finite Element Method . . . . . . . . . . . . . . . . . 4525.15.3 Mathematical Formulation of the Hybrid Mixed Finite Element Method4535.15.4 Positive-Definiteness of the Linear System . . . . . . . . . . . . . . . 4545.15.5 Numerical Implementation of the Hybrid Mixed Finite Element Method4555.15.6 Comments on the Hybrid Mixed Finite Element Method . . . . . . . 4585.15.7 Constrained Minimization and Lagrangians . . . . . . . . . . . . . . 4705.15.8 Well-Posedness of the Weak Mixed Problem . . . . . . . . . . . . . . 4715.15.9 Galerkin Approximations for the Mixed Problem . . . . . . . . . . . 4785.15.10Raviart-Thomas Spaces . . . . . . . . . . . . . . . . . . . . . . . . . 4835.15.11Lowest Order Mixed Finite Element Method . . . . . . . . . . . . . 4855.15.12Hybrid Mixed Finite Element Method . . . . . . . . . . . . . . . . . 4875.15.13Mathematical Formulation of the Hybrid Mixed Finite Element Method4875.15.14Positive-Definiteness of the Linear System . . . . . . . . . . . . . . . 4895.15.15Convergence Estimates for the Hybrid Mixed Finite Element Method 4905.15.16Numerical Implementation of the Hybrid Mixed Finite Element Method4905.15.17Comments on the Hybrid Mixed Finite Element Method . . . . . . . 492

6 Finite Element Methods for Parabolic Equations 4956.1 Well-Posedness of Parabolic Problems . . . . . . . . . . . . . . . . . . . . . 495

6.1.1 Existence and Uniqueness of Generalized Solutions . . . . . . . . . . 4966.1.2 Continuous Dependence on the Data . . . . . . . . . . . . . . . . . . 498

vi CONTENTS

6.2 Galerkin Methods for Parabolic Problems . . . . . . . . . . . . . . . . . . . 5006.2.1 Continuous-Time Galerkin Method . . . . . . . . . . . . . . . . . . . 5006.2.2 Existence of the Continuous-Time Galerkin Approximation . . . . . 5016.2.3 Finite Element Approximations for Parabolic Problems . . . . . . . 502

6.3 Error Estimates for Parabolic Galerkin Methods . . . . . . . . . . . . . . . 507

Index 520

Chapter 1

Introduction to Partial DifferentialEquations

1.1 Types of Second-Order Partial Differential Equations

Partial differential equations arise in a number of physical problems, such as fluid flow,heat transfer, solid mechanics and biological processes. These equations often fall into oneof three types. Hyperbolic equations are most commonly associated with advection, andparabolic equations are most commonly associated with diffusion. Elliptic equationsare most commonly associated with steady-states of either parabolic or hyperbolic problems.

It is reasonably straightforward to determine the type of a general second-order partialdifferential equation. Consider the equation

d∑j=1

d∑i=1

Aij∂2u

∂xi∂xj+

d∑i=1

bi∂u

∂xi+ cu = 0 .

Without loss of generality, we can assume that A is symmetric, by averaging the coefficientsof the i, j and j, i derivative terms. By performing a linear coordinate transformation

ξ = Fx

we hope to transformation the equation into a simpler form. We will find a way to choosethe transformation matrix F below.

1

2 CHAPTER 1. INTRODUCTION TO PARTIAL DIFFERENTIAL EQUATIONS

Note that

∂ξi∂xj

= Fij

∂u

∂xi=

d∑j=1

∂u

∂ξj

∂ξj∂xi

=d∑j=1

∂u

∂ξjFji

∂2u

∂xixj=

d∑`=1

d∑k=1

∂ξk∂xi

∂2u

∂ξk∂ξ`

∂ξ`∂xj

=d∑`=1

d∑k=1

Fki∂2u

∂ξk∂ξ`F`j

After the coordinate transformation, the differential equation takes the form

0 =d∑j=1

d∑i=1

Aij

[d∑`=1

d∑k=1

Fki∂2u

∂ξk∂ξ`F`j

]+

d∑i=1

bi

d∑j=1

∂u

∂ξjFji

+ cu

=d∑`=1

d∑k=1

d∑j=1

d∑i=1

FkiAijF`j

∂2u

∂ξk∂ξ`+

d∑j=1

[d∑i=1

Fjibi

]∂u

∂ξj+ cu

We would like to choose the matrix F so that D = FAF> is diagonal. Recall that wecan diagonalize a symmetric matrix by means of an orthogonal change of variables. In otherwords, we can choose F to be an orthogonal matrix.

If D has nonzero diagonal entries all of the same sign, the differential equation is elliptic.The canonical example of an elliptic equation is the Laplace equation ∇x · ∇xu = 0. If Dhas nonzero diagonal entries with one entry of different sign from the others, then thedifferential equation is hyperbolic. The canonical example of a hyperbolic equation is thewave equation ∂2u

∂t2− ∇x · ∇xu = 0. We will discuss simple hyperbolic equations in chapter

??, and general hyperbolic equations in chapter ??. If D has one zero diagonal entry, theequation may be parabolic. The canonical example of a parabolic equation is the heatequation ∂u

∂t + ∇x · ∇xu = 0.Example 1.1-1: Consider the differential equation

∂2u

∂x21

+∂2u

∂x22

− ∂2u

∂x3∂x4= 0

which arises in the Khokhlov-Zabolotskaya-Kuznetsov (KZK) equation for biomedical imag-ing. In this case, the coefficient matrix is

A =

1 0 0 00 1 0 00 0 0 −1/20 0 −1/2 0

1.2. PHYSICAL EXAMPLES OF DIFFUSION 3

A coordinate transformation that diagonalizes A is given by

F =

1 0 0 00 1 0 00 0 1/

√2 1/

√2

0 0 −1/√

2 1/√

2

and the new coefficient matrix is

D =

1 0 0 00 1 0 00 0 1/2 00 0 0 −1/2

In this case, we see that the KZK equation is hyperbolic.

1.2 Physical Examples of Diffusion

1.2.1 Heat Flow

The simplest model of heat flow is based on three principles: conservation of energy,Fourier’s law of cooling and a constitutive law. The simplest constitutive law assumes thatthe total energy per volume e satisfies

e = ρc(T − T0) ,

where ρ is the density, c is the heat capacity and T is the temperature. We assume thatother forms of energy, such as kinetic energy, are negligible. Fourier’s Law of Coolingstates that the flux of the energy is proportional to the temperature gradient,

f = −k∇xT ,

with proportionality factor k called the thermal conductivity. Finally, conservation ofenergy gives us the conservation law

∂e

∂t+ ∇x · f = 0 .

When we substitute the expressions for the energy and the energy flux into the conservationlaw, we obtain the partial differential equation

∂ρcT

∂t+ ∇x · (−k∇xT ) = 0 .


For water, we have c ≈ 4× 103 Joules per kilogram per degree Celcius, k ≈ 0.6 Joules permeter per second per degree Celcius, and ρ ≈ 103 kilograms per cubic meter. For ice, wehave c ≈ 2× 103 Joules per kilogram per degree Celcius, k ≈ 2 Joules per meter per secondper degree Celcius, and ρ ≈ 103 kilograms per cubic meter.

In order to specify a unique solution to the problem, we must provide additional in-formation. Initially, let us consider a one-dimensional material. The initial temperaturedistribution is assumed to be given throughout the problem domain:

T (x, 0) = T0(x) , ∀x ∈ Ω .

In one dimension, we could provide boundary conditions in a variety of ways. For example,we could specify the energy flux at the left-hand side

−k∂T∂x

(0, t) = f0(t) ,

and the temperature at the right-hand side

T (L, t) = TL(t) .

These boundary conditions are intended as examples; other boundary conditions are pos-sible. For boundary conditions in multiple dimensions, we might consider a region Ω withspecified heat flux:

∀x ∈ ∂Ω ∀t > 0 , −kn · ∇xT (x, t) = f(x, t)

where n is the unit outer normal to ∂Ω at x. Alternatively, we might consider a materialcontained in a heat bath at specified temperature T∂ :

∀x ∈ ∂Ω ∀t > 0 , T (x, t) = T∂ .

When the density ρ, heat capacity c and the thermal conductivity k are constants, wecan use these constants to non-dimensionalize the problem. If L is some characteristiclength of the problem domain Ω, let the dimensionless distance and dimensionless time be(respectively)

ξ =xL

, τ =kt

ρcL2.

As a function of dimensionless variables, the temperature will be

T (ξ, τ) = T (ξL, τρcL2/k) .

Then it is easy to see that conservation of energy can be written

∂T

∂τ= ∇ξ · ∇ξT , ∀ ξL ∈ Ω , ∀0 < τ . (1.2-1)


This dimensionless equation is commonly called the heat equation. Note that the initialand boundary conditions are also easy to write in dimensionless variables.

As presented here, the heat equation is a purely parabolic equation. Solutions of theheat equation are instantly smooth, in spite of initial discontinuities. The solutions alsotend toward a steady-state, unless forced by time-dependent boundary conditions.

Exercises 1.2.11. Show that the heat equation is parabolic (see the definition of parabolic in the introduction).

2. Show how to write the initial and boundary conditions for heat conservation in terms of dimensionlessvariables.

3. For heat transfer in mixtures of water and ice, the latent heat of fusion must be considered. Inthis case, the temperature T is related to the energy per volume e by

T =

8<:.0005e/ρ, e < 0 Joules / kilogram

.00025(e/ρ− 334), e > 334 Joules / kilogram0, 0 < e/ρ < 334 334 Joules / kilogram

Write down the equations describing the temperature distribution in an insulated glass of water.

1.2.2 Electrical Wave Propagation in the Heart

The FitzHugh-Nagumo model [55] is designed to represent the qualitative featuresof electrical wave propagation in the heart:

∂

∂t

[vr

]=[∇x · (D∇xv) + f(v, r)

g(v, r)

]∀x ∈ Ω ∀t > 0[

vr

](x, 0) =

[v0r0

](x) ∀x ∈ Ω ,

n ·D∇xv(x, t) = 0 ∀x ∈ ∂Ω ∀t > 0 .

Here v represents the electrical potential, and r is a recovery variable that represents theaction of cell membranes in controlling the flow of ions. The functions f and g describe thelocal kinetics:

f(v, r) = Hv(v − V0)(Vm − v)− r ,

g(v, r) = av − br .

In this model H, Vm, V0, a and b are constants.This model is a simple reaction-diffusion system. The reactions are represented by the

functions f and g. Typically f is a fast reaction, meaning that ∂f∂v is large. This fast reaction

tends to drive v rapidly to steady-state values determined by r, while the diffusion tendsto spread discontinuities. Under appropriate circumstances, the combined effect producestraveling waves [44, 55, 64].

Exercises 1.2.2


1. Find the stationary points of the FitzHugh-Nagumo reactions f and g. Which of these are stable?

2. Suppose that D = 0, 0 < V0 < 12Vm and H < 4a/[b(Vm − V0)

2]. Show that the FitzHugh-Nagumoequations have only one steady-state, namely v = 0 and r = 0. Determine if this steady-state isstable.

3. Under the assumptions of the previous exercise, show that orbits of the FitzHugh-Nagumo model arebounded. (Hint: show that for large values of v or r the reaction terms push the solution towardsmaller values.)

4. Put the FitzHugh-Nagumo equations into dimensionless form in the following way Assume that theproblem length is L, and define dimensionless distance to be x/L. If the speed of a representativetraveling wave is s, define dimensionless time to be ts/L. Let the dimensionless electrical potentialbe v/Vm, and the dimensionless recovery varible be rL/(sVm). Find the dimensionless form of theFitzHugh-Nagumo equations.

1.2.3 Miscible Displacement Model

The miscible displacement model [31] describes the flow in a porous medium of afluid consisting of a single incompressible phase but multiple chemical components. Thisproblem occurs in modeling the flow of water-soluble contaminants in aquifers, and ofsolvent-enhanced recovery of oil. For simplicity, we will assume that the fluid is composedof two components: water and a tracer. It is assumed that the tracer is inert; in otherwords, there are no chemical reactions that would transform the water and tracer intoother chemicals. Further, the tracer is transported entirely with the water, and does notadsorb onto the surface of the porous medium.

We will denote the concentration of the tracer by c; by definition, c is the mass of tracerdivided by the total mass in the fluid occupying some region in space. It follows that theconcentration of water is 1− c. Because the tracer concentration can vary, the fluid densityρ can vary; we will write ρ(c). Similarly, the fluid viscosity is µ(c).

The fluid moves through tiny holes in the rock. The ratio of the volume of these holesto the total rock volume is called the porosity. Thus porosity is dimensionless. Since therock is incompressible, porosity is independent of time, but may vary in space. Thus wedenote the porosity by φ(x).

The holes must be connected for the fluid to move through the rock. This is measuredby the permeability of the rock. Typically the permeability is independent of time, butvaries in space. Thus we denote permeability by K(x). It turns out that permeability hasunits of area, and should be a symmetric, non-negative matrix. Neither the permeabilitynor the porosity need be continuous functions.

The velocity of the fluid is typically modeled by Darcy’s law. This takes the form

v = K [−∇xp+ gρ]1µ, (1.2-2)

where p is the pressure in the fluid and g is the acceleration due to gravity.


The flux of the components is represented in two parts. One is due to the macro-scale flow from Darcy’s law; this part of the flux takes the form cv for the tracer, and(1 − c)v for water. The second part of the flux represents smaller scale convective mixingof the components as they flow through irregular pore channels, and molecular diffusion.Typically, this part of the flux is represented by Fick’s law. The resulting flux for thetracer is

fc = cv −[vα`‖v‖

v> + (Iv>v − vv>)αt‖v‖

+ Iφδcτ

]∇xc .

Here, α` and αt are the longitudinal and transverse mixing lengths, δc is the diffusivity ofthe tracer and τ is the tortuosity of the rock. The equation for the flux of water is similar:we replace c by 1− c to get

fw = (1− c)v +[vα`‖v‖

v> + (Iv>v − vv>)αt‖v‖

+ Iφδcτ

]∇xc .

We must have equations representing the conservation of mass for water and the tracer.It is easy to see that the mass of tracer per bulk (rock) volume is cφ, and the mass of waterper bulk volume is (1− c)φ. Thus conservation of the tracer and water can be written

∂cφ

∂t+ ∇x · (fc) = 0 , (1.2-3a)

∂(1− c)φ∂t

+ ∇x · (fw) = 0 . (1.2-3b)

If we add the tracer and water conservation equations together, we obtain

0 = ∇x · v = ∇x ·K [−∇xp+ gρ(c)]

1µ(c)

. (1.2-4)

It is interesting to note that the coefficients ρ(c) and µ(c) in this problem are functions ofthe variable tracer concentration. Since the tracer concentration c can change in time, socan the pressure p. This completes the description of the miscible displacement model.

Next, let us restrict our discussion to one dimension. Since

∂v∂x

= 0 ,

we see that the Darcy velocity v is independent of position x. The transverse mixing lengthαt has no effect in one dimension, so the tracer flux simplifies to

fc = cv −|v|α` +

φδcτ

∂c

∂x.


Given the Darcy velocity, the two mass conservation equations become redundant. It willsuffice to work with conservation of the tracer:

∂cφ

∂t+∂vc∂x

=∂

∂x

(|v|α` +

φδcτ

)∂c

∂x

. (1.2-5)

In the typical case, we specify the total fluid velocity v(t) and the tracer concentrationc(xL, t) at inflow. In this case, the equations for conservation of mass of the tracer (1.2-5)and for the fluid pressure (1.2-4) decouple; the solution of the pressure equation can bedetermined by specifying the fluid pressure at outflow. Suppose that we are given p(xR, t) =pR(t) at the right-hand boundary. Since the one-dimensional pressure equation says that

K[−∂p∂x

+ gρ]

1µ

= v

is equal to the inflow fluid velocity, the solution of the pressure equation is

p(x, t) = pR(t)− g∫ xR

xρ(c(x, t)) dx− v(t)

∫ xr

x

µ(c(x, t))K(x)

dx .

Some common test problems choose the fluid velocity to be around 30 centimeters perday, the porosity to be 0.25, and the longitudinal mixing length αL to be 0.01 times theproblem length. Normally, molecular diffusion is negligible unless we are working on veryfine scales (close to the scale of the rock pores):

δc |v|α`τφ

.

Note that the mass conservation law (1.2-5) involves convection due to the Darcy veloc-ity, and diffusion due to the molecular diffusion and convective mixing. The diffusive termsappear on the right-hand side of the equation. The diffusion due to convective mixing isrelated to the fluid velocity and is not isotropic. Typically, these diffusive fluxes are verysmall compared to the convective fluxes. Nevertheless, the mass conservation equation isa convection-diffusion equation, and its efficient numerical solution will use techniques forboth hyperbolic and parabolic equations.

Also note that the incompressibility condition on the velocity field represents the limit,as the compressibility of the fluid becomes small, of a compressible equation. We will notpresent that more general equation here. Instead, we will remark that the incompressibilitycondition is equivalent to the steady-state of the compressible parabolic equation.

The solutions of the miscible displacement problem can show either stable transport ofthe tracer with slight spreading due to diffusive Fickian forces, or unstable viscous fingeringif the tracer reduces the fluid viscosity significantly. Heterogeneities in the permeability canproduce flow channels that look like viscous fingers in space but behave quite differently intime.

Exercises 1.2.3


1. Show that for fixed tracer concentration c, equation (1.2-4) is an elliptic equation in pressure p.

2. Show that for fixed velocity v, equation (1.2-5) is a parabolic equation in tracer concentration c. Ifthe mixing lengths and diffusivity are zero, show that this equation is hyperbolic.

1.2.4 Buckley-Leverett Model

The Buckley-Leverett model for flow of two immiscible incompressible phases in aporous medium is important to models of oil reservoirs and contaminated aquifers. In thismodel, we assume that the fluid consists of two distinct phases, oil and water. It is assumedthat the chemicals forming these two phases do not interact or move from one phase to theother. Since the fluid is incompressible and the chemical composition of the phases is fixed,the phase densities ρo and ρw are constants. Since the chemical composition of each phaseremains constant, the viscosities µw and µo are constant.

As in the miscible displacement model (described in section 1.2.3), the porosity φ(x) isdimensionless, and the permeability K(x) is a symmetric nonnegative matrix with units ofarea. Neither the permeability nor the porosity need be continuous functions.

The saturations of the phases so and sw are the ratios of the phase volumes to thefluid volume. By definition,

sw + so = 1 .

Typically, water is the wetting phase, meaning that it prefers to move along the surfaceof the rock pores. Oil is the non-wetting phase, and prefers to sit as disconnected dropletsin the center of cell pores, or move as ganglia when the droplets can connect. Thus thepresence of both oil and water reduces the flow of the other. This effect is often modeled bya relative permeability, which is a dimensionless modification to the total permeabilityK. Typically, relative permeability of a phase is chosen to be an empirical function ofthat phase saturation. Thus the relative permeability of oil is κro(so), and the relativepermeability of water is κrw(sw). We must have

κro(0) = 0 = κrw(0) ,

because neither phase can flow if it occupies no volume in the fluid. We must also have

κro(1) ≤ 1 and κrw(1) ≤ 1 ,

because neither phase can flow more easily than the total permeability permits. Finally, wemust have

κ′ro(so) ≥ 0 ∀so ∈ [0, 1] and κ′rw(sw) ≥ 0 ∀sw ∈ [0, 1] ,

because an increase in relative volume of a phase makes it easier for that phase to flow.The velocities of the phases are typically modeled by Darcy’s law. In the miscible

displacement model, described in section 1.2.3, Darcy’s law for a single phase was written


in the form (1.2-2). For two-phase flow, Darcy’s law is usually modified to take the forms

vo = K [−∇xpo + gρo]κroµo

,

vw = K [−∇xpw + gρw]κrwµw

.

Note that for two-phase flow the pressures in the phases are not necessarily the same.Typically, the oil-phase pressure po is related to the water-phase pressure pw by

po = pw + Pc(sw) ,

where Pc(sw) is the capillary pressure between the phases. Capillary pressure arises fromthe interfacial tension between the phases and the narrow flow paths available to the fluids.Typically, capillary pressure is a strictly decreasing function of water saturation, and a verylarge pressure is required to drive the water saturation to zero. See Figure 1.2-1 for a typicalcapillary pressure curve.

It will simplify notation if we define the phase mobilities

λo(so) ≡κro(so)µo

, λw(sw) ≡ κrw(sw)µw

.

Then the two-phase modification of Darcy’s law can be written

vo = K [−∇xpo + gρo]λo ,vw = K [−∇xpw + gρw]λw .

Finally, we have equations representing the conservation of mass for oil and water. It iseasy to see that the mass of oil per bulk (rock) volume is ρosoφ, and the mass of water perbulk volume is ρwswφ. The volumetric flux of oil is ρovo, and the flux of water is ρwvw.Thus conservation of oil and water can be written

∂ρosoφ

∂t+ ∇x · (ρovo) = 0 ,

∂ρwswφ

∂t+ ∇x · (ρwvw) = 0 .

This completes the description of the Buckley-Leverett model.Next, we need to manipulate these equations into a conservation law in one dimension.

We can divide the mass conservation equations by the constant phase densities to get

∂soφ

∂t+∂vo∂x

= 0 ,

∂swφ

∂t+∂vw∂x

= 0 .


Pc

s w

Figure 1.2-1: Capillary Pressure Curve


Next, we can add these two equations to get

∂vo + vw∂x

= 0 .

This is an expression of the incompressibility of the flow: the total fluid velocity has zerodivergence. Given the total fluid velocity, the two mass conservation equations becomeredundant. It will suffice to work with conservation of oil.

We can use the two-phase flow modification of Darcy’s law to represent the total fluidvelocity as a function of the water phase pressure and the oil saturation:

vT ≡ vo + vw = K[−∂(pw + Pc)

∂xλo −

∂pw∂x

λw + g(ρoλo + ρwλw)].

Since the total fluid velocity is divergence-free, in one dimension it is a constant in space.Thus this equation can be viewed as providing a relationship between the gradient of thewater-phase pressure and the oil saturation:

−∂pw∂x

=vT /K − g(ρoλo + ρwλw) + λo∂Pc/∂x

λo + λw.

This equation allows us to eliminate the pressure gradient from the expression for the Darcyvelocity for oil

vo =[vT +Kgλw(ρo − ρw)−K

∂Pc∂x

λw

]λo

λo + λw.

This means that conservation of oil can be written

∂soφ

∂t+

∂

∂x

[vT +Kgλw(ρo − ρw)]

λoλo + λw

=

∂

∂x

(Kλoλwλo + λw

∂Pc∂x

). (1.2-6)

This has the form of a conservation law. The capillary pressure term introduces a physicaldiffusion; this diffusion term is nonlinear. In most oil recovery problems, the capillarypressure gradient is small compared to the fluid pressure gradient.

Oil is less dense than water and more viscous. A porosity φ = 0.25 is typical. Problemlengths might be 100 meters, with a total fluid velocity vT = 30cm meters / day. In testproblems, the permeability K can be defined in terms of the gravity number

γ ≡ Kg(ρo − ρw)/µwvT

.

Some typical values for the densities and viscosities are contained in Table 1.2-1The Buckley-Leverett flux function is neither convex nor concave. With zero total fluid

velocity, the flux function formed solely by the action of gravity is shaped like a script


(a) vT = 1, g = 0 (b) vT = 0, g = −10

(c) vT = 1, g = −10

Figure 1.2-2: Buckley-Leverett Flux Function: f(s) = (vT+g(1−s)2)s2/(s2+(1−s)2)µo/µwwith µo/µw = 1


fluid density (g/cc) viscosity (gm/sec/cm)water 0.998 0.0114diesel fuel 0.729 0.0062kerosene 0.839 0.0230prudhoe bay crude 0.905 0.6840

Table 1.2-1: Density and Viscosity of Fluids

“V”. This model is especially interesting when both the total fluid velocity and gravity arenonzero. Figure 1.2-2 shows some examples of the Buckley-Leverett flux function.

Solutions of the Buckley-Leverett equations typically exhibit sharp travelling wave frontsin oil saturation, followed by smooth variation. The sharpness of the oil front depends onthe capillary pressure, and can be quite impressive in actual oil recovery.

Exercises 1.2.41. Show that for fixed total fluid velocity vT , equation (1.2-6) is parabolic, and hyperbolic if Pc = 0.

Note that in this case, the capillary pressure contributes a nonlinear second-order diffu-sion to the equation for conservation of oil. Some common models for mobility and capillarypressure are due to Corey:

λo = λom max0, so − sr,oαo

λw = λwm max0, 1− so − sr,wαw

Pc = cp max0, so − sr,oβ

Here sr,o is a known constant, representing the residual saturation of oil that is trapped inthe pores by the wetting phase (water). Also sr,w is a constant, representing the irreduciblesaturation of water that cannot be driven out of the pores by oil under realizable pressures.The constants λom and λwm are the maximum mobilities of oil and water, respectively. Theexponents αo and αw are positive constants. The exponent β is also constant.

1.2.5 Thin Films

A model for the motion of thin liquid films on a solid surface with fluid-solid interfacedriven by surface tension is

∂h

∂t+ ∇x · [f(h)∇x∇x · ∇xh] = 0 ∀x ∈ Ω ∀t > 0 .

Here h represents the height of the thin film and f is a factor multiplying the surface tension.In all of the models,

f(h) ≈ chα as h→ 0 .


As a result, the thin film model is a degenerate fourth-order diffusion equation. Solutionsof the thin film equation are smooth wherever h > 0, but can develop discontinuities inthe derivatives of h as h→ 0. The exponent α depends on the boundary conditions at thefluid-solid interface. No-slip conditions imply that α = 3, while slip conditions imply thatα < 3. This problem is computationally difficult because the diffusion is both nonlinearand fourth-order.

1.2.6 Incompressible Fluid Flow

The Navier-Stokes equation for incompressible flow are [24]

∂v∂t

+ (v · ∇x)v = −1ρ∇xp+ ν(∇x · ∇x)v (1.2-7a)

∇x · v = 0 . (1.2-7b)

Here ρ is the density of the fluid, and is assumed to be constant. Also ν is the coefficientof kinematic viscosity, which is the fluid viscosity divided by the density.

It is common to rewrite these equations in dimensionless form Let L be a problem lengthand s a problem speed; for flow past a sphere, L could be the diameter of the sphere and scould be the magnitude of the velocity at infinity. Let the dimensionless position and timebe

x = x/L , t = ts/L .

We also define the dimensionless velocity and pressure to be

v = v/s , p = p/(ρs2) .

Finally, we define the Reynolds number by

R = Ls/ν . (1.2-8)

Then the Navier-Stokes equations can be written in the dimensionless form

∂v∂t

+ (v · ∇x)v = −∇xp+1R

(∇x · ∇x)v

∇x · v = 0 .

Exercises 1.2.61. For small Reynolds number R, it is acceptable to ignore the inertial term (v · ∇x)v relative to the

viscous term 1R

(∇x · ∇x) in the dimensionless Navier-Stokes equations. Show that the result Stokesequations

∂v

∂t= −∇xp +

1

R(∇x · ∇x) (1.2-9a)

∇x · v = 0 (1.2-9b)

are parabolic.


1.2.7 Summary of Physical Examples

Most physical problems involve some diffusion. However, the diffusion is not alwayslinear. Usually, the diffusion drives the solution of a system to a steady state. In some cases,diffusion can lead to travelling waves, such as in reaction-diffusion systems or in problemsinvolving sub-diffusion (where the diffusive coefficient tends to zero for some parameters).

Chapter 2

Parabolic Equations

In this chapter we will study finite difference methods for the numerical solution ofparabolic partial differential equations. These methods are simple to describe for problemson rectangular domains, but are difficult to extend to general domains or to high-order.We will see that explicit time integration of these problems will lead to severe restrictionson the timestep for numerical stability, but implicit treatment of will require the solutionof large sparse systems of equations. This discussion will motivate later development ofiterative methods for solving linear systems of equations in the next chapter. In chapter 6,we will discuss more robust discretizations based on finite element methods.

2.1 Theory of Linear Parabolic Equations

In order to begin our discussion of diffusion problems, we wil consider the simplestdiffusion equation, namely the heat equation ∂u

∂t = ∇x · ∇xu.

2.1.1 Fourier Transform Methods

The analytical solution of the heat equation is well-known; for additional information,see [42, 66, 82, 90].

17

18 CHAPTER 2. PARABOLIC EQUATIONS

Lemma 2.1-1: If u0 ∈ L2(Rn), then the solution of the initial-value problem for the heatequation,

∂u

∂t= ∇x · ∇xu ∀x ∈ Rn ∀t > 0

u(x, 0) = u0(x) ∀x ∈ Rn ,

is

u(x, t) =∫Rn

Gn(x− s; t)u0(s) ds ,

where the heat kernel is

Gn(x, t) ≡1

(4πt)n/2e−‖x‖

2/(4t) .

Proof: We will restrict our discussion to one dimension. For simplicity, wewill assume that the initial data u0 has compact support (meaning that it iszero outside some bounded interval). The Fourier transform of the initial datais

u0(ξ) =12π

∫ ∞

−∞e−ıξxu0(x) dx ,

and the Fourier transform of the solution of the heat equation is

u(ξ, t) =12π

∫ ∞

−∞e−ıξxu(x, t) dx .

When we apply the Fourier transform to the heat equation, we obtain

∂u

∂t= −ξ2u ∀ξ ∈ (−∞,∞) ∀t > 0 ,

and

u(ξ, 0) = u0(ξ) ∀ξ ∈ (−∞,∞) .

This gives us a continuum of initial value problems for u, each an ordinarydifferential equation in time for fixed frequency ξ. The solution is

u(ξ, t) = e−ξ2tu0(ξ) .

2.1. THEORY OF LINEAR PARABOLIC EQUATIONS 19

This equation shows that all nonzero Fourier modes decay, and modes associatedwith large frequencies decay faster than modes associated with small frequencies.

In order to solve this equation for u, we apply the inverse Fourier transform:

u(x, t) =∫ ∞

−∞eıξxe−ξ

2tu0(ξ) dξ =12π

∫ ∞

−∞eıξx−ξ

2t

∫ ∞

−∞e−ıξsu0(s) ds dξ .

Next, we can interchange the order of integration to obtain

u(x, t) =12π

∫ ∞

−∞u0(s)

∫ ∞

−∞eıξ(x−s)−ξ

2t dξ ds .

Completing the square gives us

u(x, t) =12π

∫ ∞

−∞e−(x−s)2/(4t)u0(s)

∫ ∞

−∞e−(ξ

√t−ı(x−s)/(2

√t))2 dξ ds .

It is also well-known from complex analysis that the integral of an analyticfunction over a closed curve is zero; this implies that∫ ∞

−∞e−(ξ

√t−ı(x−s)/(2

√t))2 dξ =

∫ ∞

−∞e−ξ

2t dξ .

Finally, we can change variables to obtain∫ ∞

−∞e−ξ

2t dξ =1√t

∫ ∞

−∞e−ν

2dν =

1√t

√∫ ∞

−∞

∫ ∞

−∞e−ξ

21−ξ22 dξ1 dξ2

=1√t

√∫ 2π

0

∫ ∞

0e−r2r dr dθ =

1√t

√∫ ∞

0e−r22r dr =

√π√t.

Thus our formula for the solution of the heat equation on an unbounded intervalis

u(x, t) =1√4πt

∫ ∞

−∞e−(x−s)2/(4t)u0(s) ds

≡∫ ∞

−∞G1(x− s; t)u0(s) ds = (G1(·; t) ∗ u0)(x) .

2

Since the heat kernelG1(x, t) ≡

1√4πt

e−x2/(4t)

is smooth, this formula shows that u(x, t) is smooth for t > 0, no matter how rough theinitial data u0 might be. Furthermore, as t→∞ for fixed x, we must have that u(x, t) → 0.Note that if u0 is an odd function, then so is u(x, t); similarly, if u0 is even, so is u.


2.1.2 Reflection and Superposition

The solution of the heat equation on some semi-bounded domains can be obtained byreflection principles [82]. We will illustrate some of these ideas.

Lemma 2.1-2: The solution of the heat equation on the half-line with homogeneousDirichlet data,

∂u

∂t=∂2u

∂x2,∀x > 0 ∀t > 0

u(0, t) = 0 ∀t > 0 , u(x, 0) = u0(x) ∀x > 0 ,

is the same as the solution of the heat equation on the line with odd extension of the initialdata:

∂u

∂t=∂2u

∂x2,∀x ∈ (−∞,∞) ∀t > 0

u(x, 0) =

u0(x) ∀x > 0−u0(−x) ∀x < 0

.

Lemma 2.1-3: The solution of the heat equation on the half-line with homogeneousNeumann data,

∂u

∂t=∂2u

∂x2,∀x > 0 ∀t > 0

∂u

∂x(0, t) = 0 ∀t > 0 , u(x, 0) = u0(x) ∀x > 0 ,

is the solution of the heat equation on the line with even extension of the initial data:

∂u

∂t=∂2u

∂x2,∀x ∈ (−∞,∞) ∀t > 0

u(x, 0) =

u0(x) ∀x > 0u0(−x) ∀x < 0

.


Lemma 2.1-4: The solution of the inhomogeneous heat equation on the unbounded line,

∂u

∂t− ∂2u

∂x2= f(x, t) ,∀x ∈ (−∞,∞) ∀t > 0 .

u(x, 0) = u0(x) ,

is

u(x, t) = (G1(·; t) ∗ u0)(x) +∫ t

0

∫ ∞

−∞G1(x− y; t− s)f(y, s) dy ds .

Proof: This solution is a superposition of the solution of the homogeneousheat equation with inhomogeneous initial data u0, and the solution of the inho-mogeneous heat equation with homogeneous initial data. 2

Lemma 2.1-5: If u is the solution of the heat equation on the half-line with inhomoge-neous Dirichlet data,

∂u

∂t− ∂2u

∂x2= f(x, t) ,∀x > 0 ∀t > 0 .

u(0, t) = ν0(t) ∀t > 0 , u(x, 0) = u0(x) ∀x > 0 ,

then u(x, t) − ν0(t) solves the inhomogeneous heat equation with source f(x, t) − ν ′0(t),initial data u0(x)− ν(0) and homogeneous Dirichlet data at x = 0.

Lemma 2.1-6: If u is the solution of the heat equation on the half-line with inhomoge-neous Neumann data,

∂u

∂t− ∂2u

∂x2= f(x, t) ,∀x > 0 ∀t > 0 .

∂u

∂x(0, t) = ν0(t) ∀t > 0 , u(x, 0) = u0(x) ∀x > 0 ,

then u(x, t)− xν0(t) is the solution of the heat equation on the line with source f(x, t)−xν0(t), initial data u0(x)− xν0(0) and homogeneous Neumann data.

2.1.3 Maximum Principle

It is easy to show that the solution of the heat equation on an unbounded domain

∂u

∂t=∂2u

∂x2, ∀x ∈ (−∞,∞)


achieves its maximum at t = 0. At a local maximum, the gradient of u would satisfy[∂u∂t∂u∂x

]= 0.

Furthermore, the matrix of second derivatives would be negative definite; this would implythat ∂2u

∂x2 < 0. Since ∂u∂t = 0 and ∂2u

∂x2 < 0 at an interior maximum, the heat equation couldnot be satisfied there.

A stronger form of the maximum principle is also known on bounded domains [41, 42]:

Theorem (Maximum Principle) 2.1-7: If u(x, t) satisfies the homogeneous heatequation for 0 ≤ x ≤ 1 and 0 ≤ t ≤ T , then the maximum of u occurs either at t = 0 orat x = 0 or at x = 1.

By replacing u with −u, we see that the minimum of the solution of the heat equation mustalso occur in the initial or boundary data.

The maximum principle can also be used to prove the uniqueness of the solution of theheat equation. If two solutions of the heat equation have the same initial and boundarydata, then their difference solves the heat equation with zero initial and boundary data. Themaximum principle shows that the difference of the two solutions must be zero everywhere.

2.1.4 Bounded Domains and Separable Solutions

It is useful to construct analytical solutions to the heat equation on intervals, in orderto provide test problems for numerical computations. Consider the problem

∂u

∂t=∂2u

∂x2∀x ∈ (0, 1) ∀t > 0

u(0, t) = 0 = u(1, t) ∀t > 0 ,u(x, 0) = u0(x)∀x ∈ (0, 1) .

Suppose that this problem with homogeneous initial data has separable solutions of theform

u(x, t) = T (t)X(x) .

If we substitute this expression into the heat equation, we obtain

T ′

T=X ′′

X.

Since the left side of this equation is a function of t and the right side is a function of x,both sides of this equation are constant. The homogeneous boundary conditions imply that

X(x) = sin(kπx)


for some integer k. It follows that

u(x, t) = e−k2π2t sin(kπx)

is a solution of the homogeneous heat equation on the bounded interval.General solutions to heat equations with inhomogeneous initial data are linear combi-

nations of these separable solutions:

u(x, t) =∞∑k=1

αke−k2π2t sin(kπx) .

In order to determine the coefficients αk, we note that the initial data must satisfy

u0(x) =∞∑k=1

αk sin(kπx) .

Since the sine functions are orthogonal,∫ 1

0u0(x) sin(kπx) dx = αk

∫ 1

0sin2(kπx) dx =

12αk .

For the purpose of developing test problems, however, it is more common to define theinitial data u0 by selecting the values of the αk. Typically, all but one of the αk are chosento be zero in simple test problems.

Similarly, it is easy to find analytical solutions to the heat equation with zero Neumannboundary data, or with periodic boundary data.

Exercises 2.11. Construct the general solution to the heat equation on (0, 1) × (0,∞) with zero heat flow at the

boundaries:∂u

∂x(0, t) = 0 =

∂u

∂x(1, t) ∀t > 0 .

2. Construct the general solution to the heat equation on (0, 1) × (0,∞) with zero Neumann data atx = 0 and zero Dirichlet data at x = 1:

∂u

∂x(0, t) = 0 ∀t > 0 ,

u(1, t) = 0 ∀t > 0 .

3. Construct the general solution to the heat equation on (0, 1)×(0,∞) with period boundary conditions

u(0, t) = u(1, t) ∀t > 0

∂u

∂x(0, t) =

∂u

∂x(1, t) ∀t > 0 .


2.2 Finite Difference Methods in One Dimension

Numerical solutions of parabolic partial differential equations commonly involve dis-cretization in space and time. However, it will be advantageous to consider these discretiza-tions separately.

2.2.1 Continuous-In-Time Methods

Suppose that we want to solve

∂u

∂t−D(x)

∂2u

∂x2= f(x, t) ∀x ∈ (0, 1) ∀t > 0 (2.2-1a)

u(0, t) = ν0(t) , D(1)∂u

∂x(1, t) = ν1(t) ∀t > 0 , (2.2-1b)

u(x, 0) = w0(x) ∀x ∈ (0, 1) . (2.2-1c)

We will select a finite difference mesh

0 = x0 < x1 < . . . < xJ < 1

as shown in figure 2.2-1. We will also define the cell widths by

4xj+1/2 = xj+1 − xj for 0 ≤ j < J ,4xJ+1/2 = 2(1− xJ) .

and the cell centers byxj+1/2 = xj + 4xj+1/2/2 .

Note that xJ+1/2 = 1.For each t, we will approximate the solution of the heat equation by

uj(t) ≈ u(xj , t) 0 ≤ j ≤ J . (2.2-2)

Also for each t, we will approximate the spatial derivative by second-order centered finitedifferences to get the system of ordinary differential equations

∂uj∂t

−D(xj+1/2)

uj+1(t)−uj(t)4xj+1/2

−D(xj−1/2)uj(t)−uj−1(t)4xj−1/2

(4xj+1/2 + 4xj−1/2)/2= f(xj , t) ∀0 < j ≤ J ∀t > 0

uj(0) = w0(j4x) ∀0 < j < J

In order to satisfy the boundary data at the left side, we will take

u0(t) = ν0(t) .

2.2. FINITE DIFFERENCE METHODS IN ONE DIMENSION 25

Figure 2.2-1: Spatial discretization

In order to satisfy the boundary data at the right side, we will require

∂uJ∂t

−ν1(t)−D(xJ−1/2)

uJ (t)−uJ−1(t)4xJ−1/2

(4xJ+1/2 + 4xJ−1/2)/2= f(xJ , t) .

We can rewrite these ordinary differential equations in the form

Mdudt

= −Ku + f ,

where the vector of mesh functions is

u ≡

u1

u2...

uJ−1

uJ

,


and the vector of inhomogeneities is

f ≡

f(x1, t) + α1/2ν0(t)

f(x2, t)...

f(xJ−1, t)f(xJ , t) +D(xJ+1/2)ν1(t)

.

Here we are using the notation

αj+1/2 =2D(xj+1/2)4xj+1/2

, 0 ≤ j < J .

The coefficient matrix multiplying the time derivatives is

M ≡

4x1/2 + 4x3/2

4x3/2 + 4x5/2

. . .4xJ−3/2 + 4xj−1/2

4xJ−1/2 + 4xJ+1/2

,

and the coefficient matrix performing the spatial differences is

K ≡

α1/2 + α3/2 −α3/2

−α3/2 α3/2 + α5/2. . .

. . . . . . . . .. . . αJ−3/2 + αj−1/2 −αJ−1/2

−αJ−1/2 αJ−1/2

.

Note that K is diagonal and M is tridiagonal; both are symmetric and positive definite.We can rewrite the system of ordinary differential equations for the numerical solution

in the formdM1/2udt

= −[M−1/2KM−1/2

]M1/2u + M−1/2f .

Note that M−1/2KM−1/2 is symmetric positive-definite. The analytical solution of thisordinary differential equation is

M1/2u(t) = e−M−1/2KM1/2tM1/2u(0) +∫ t

0e−M−1/2KM−1/2(t−s)M−1/2f(s) ds .

Our purpose in this expression is not to provide a useful computational form; rather, wewould like to see how the numerical solution depends on the parameters in the problem.


The finite size of M and K gives the continuous-in-time equations a finite set of decay rates,corresponding to the eigenvalues of M−1/2KM−1/2. It is reasonable to expect that smallestof these eigenvalues to approximate the slowest decay rates of the heat equation, but thelargest of these eigenvalues may not be accurate.

Although non-uniform meshes can be useful computationally, they can complicate thediscussion. On a uniform mesh with constant diffusion coefficient D, we can solve for thesolution derivative to get

d

dt

u1

u2

u3...

uJ−1

uJ

= − D

4x2

2 −1 0 . . . 0 0−1 2 −1 0 0

0 −1 2. . . 0 0

.... . . . . . . . .

0 0 0. . . 2 −1

0 0 0 . . . −1 1

u1

u2

u3...

uJ−1

uJ

+

f(x1, t) + Dν0(t)4x2

f(x2, t)f(x3, t)

...f(xJ−1, t)

f(xJ , t) + ν1(t)4x

(2.2-3)

This can be rewritten in the matrix-vector form

dudt

= −(AD

4x2)u + f ,

where A is symmetric, positive definite and independent of the cell width or diffusionconstant. The analytical solution of this ordinary differential equation is

u(t) = e−AtD/4x2u(0) +

∫ t

0eA(s−t)D/4x2

f(s) ds .

Let us determine the eigenvalues of A.

Lemma 2.2-1: The J × J matrix

A =

2 −1 0 . . . 0 0−1 2 −1 0 0

0 −1 2. . . 0 0

.... . . . . . . . .

0 0 0. . . 2 −1

0 0 0 . . . −1 1

has eigenvalues

λk = 4 sin2

[k + 1/2J + 1/2

π

2

], 0 ≤ k < J


Proof: Note that the Gerschgorin circle theorem implies that the eigenvaluesof λ of A satisfy 0 ≤ λ ≤ 4. Also note that the eigenvector equation

2 −1 0 . . . 0 0−1 2 −1 0 0

0 −1 2. . . 0 0

.... . . . . . . . .

0 0 0. . . 2 −1

0 0 0 . . . −1 1

u1

u2

u3...

uJ−1

uJ

=

u1

u2

u3...

uJ−1

uJ

λ

implies that

u2 = u1(2− λ)[uj+1

uj

]=[2− λ −1

1 0

] [ujuj−1

], 2 ≤ j < J ,

uJ−1 = uJ(1− λ) .

In order to solve the linear recurrence, we note that the eigenvector equation

[2− λ −1

1 0

] [1y

]=[1y

]µ

implies that the eigenvalue µ solves a quadratic equation, with the product ofthe eigenvalues equal to the determinant of this matrix, i.e. one. By solvingthis quadratic equation, we find that the eigenvalues have the form

µ =2− λ±

√(2− λ)2 − 42

=2− λ± ı

√λ(4− λ)

2≡ e±ıθ

where

2− λ = 2Re(µ) = 2 cos θ .

An easy calculation for the eigenvector shows that

[2− λ −1

1 0

] [1 1e−ıθ eıθ

]=[

1 1e−ıθ eıθ

] [eıθ 00 e−ıθ

].


Thus the solution of the linear recurrence is[ujuj−1

]=[2− λ −1

1 0

]j−2 [u2

u1

]=[

1 1e−ıθ eıθ

] [eıθ 00 e−ıθ

]j−2 [ 1 1e−ıθ eıθ

]−1 [u2

u1

]=[

1 1e−ıθ eıθ

] [e(j−2)ıθ 0

0 e−(j−2)ıθ

] [eıθ −1−e−ıθ 1

] [u2

u1

]1

eıθ − e−ıθ

=[

1 1e−ıθ eıθ

] [e(j−2)ıθ 0

0 e−(j−2)ıθ

] [eıθ −1−e−ıθ 1

] [eıθ + e−ıθ

1

]u1

2ı sin θ

=[

1 1e−ıθ eıθ

] [e(j−2)ıθ 0

0 e−(j−2)ıθ

] [e2ıθ

−e−2ıθ

]u1

2ı sin θ

=[

1 1e−ıθ eıθ

] [ejıθ

−e−jıθ]

u1

2ı sin θ=[

2ı sin jθ2ı sin(j − 1)θ

]u1

2ı sin θ

=[

sin jθsin(j − 1)θ

]u1

sin θ

In other words, the eigenvectors have the form

uj = α sin jθ ,

where α = u1/ sin θ, and the permissible values of θ are to be determined.

The final equation λuJ = −uJ−1 + uJ in the system of equations for an eigen-vector of A implies that

2[1− cos θ] sinJθ = − sin(Jθ − θ) + sin Jθ= − sinJθ cos θ + cos Jθ sin θ + sinJθ= [1− cos θ] sinJθ + cos Jθ sin θ .

This implies that

0 = sinJθ − sinJθ cos θ − cos Jθ sin θ

= sinJθ − sin(J + 1)θ = −2 sinθ

2cos[J +

12]θ .

We should not choose θ so that sin 12θ = 0, because that will imply that uj = 0

for all j. In order to obtain nontrivial eigenvectors and eigenvalues, we chooseθ so that cos(J + 1/2)θ = 0:

θk =k + 1/2J + 1/2

π , 0 ≤ k < J .


Similar values of θ outside this range for j lead to repetitions of the eigenvectors.This gives us J distinct eigenvalues, of the form

λk = 2[1− cos θk] = 4 sin2 θk2

= 4 sin2

[k + 1/2J + 1/2

π

2

].

The smallest eigenvalue is

λ0 = 4 sin2 π

2J + 1≈ 4π2

(2J + 1)2,

and the largest eigenvalue is

λJ = 4 sin2

[J − 1/2J + 1/2

π

2

]≈ 4 .

2

Note that spatial discretization has produced a large system of coupled ordinary dif-ferential equations. The solution of this system involves exponential decay of the portionof the solution due to the initial data, since the matrix A is positive definite. For finediscretization (i.e., J large) this system is stiff, since the ratio of the largest to smallesteigenvalues of A is large, with value approximately equal to [4/(π4x)]2.

However, numerical methods also involve evaluation of the approximate solution atdiscrete times. This will be performed by numerical methods for time integration.

Exercises 2.2.11. Find the eigenvalues and eigenvectors of266666666664

1 −1 0 . . . 0 0−1 2 −1 0 0

0 −1 2. . . 0 0

.... . .

. . .. . .

0 0 0. . . 2 −1

0 0 0 . . . −1 1

377777777775This matrix corresponds to the finite difference discretization of the heat equation with Neumann(flux) boundary conditions at both boundaries.

2. Find the eigenvalues and eigenvectors of266666666664

2 −1 0 . . . 0 0−1 2 −1 0 0

0 −1 2. . . 0 0

.... . .

. . .. . .

0 0 0. . . 2 −1

0 0 0 . . . −1 2

377777777775This matrix corresponds to the finite difference discretization of the heat equation with Dirichlet(specified value) boundary conditions at both boundaries.


2.2.2 Explicit Centered Differences

The simplest numerical discretization of the one-dimensional heat equation (2.2-1) isthe explicit centered difference scheme

4xj−1/2 + 4xj+1/2

2un+1j − unj

4tn+1/2−[D(xj+1/2)

unj+1 − unj4xj+1/2

−D(xj−1/2)uj − unj−1

4xj−1/2

]=4xj−1/2 + 4xj+1/2

2f(xj , tn) ∀1 ≤ j < J ∀n ≥ 0 .

This scheme uses a second-order centered difference in space, and the first-order forwardEuler scheme in time. With boundary conditions given by (2.2-1b), we take the numericalsolution at the left boundary to be

un0 = ν0(tn)

and write the scheme at the right boundary in the form

4xJ−1/2 + 4xJ+1/2

2un+1J − unJ4tn+1/2

−[ν1(tn)−D(xJ−1/2)

unJ − unJ−1

4xJ−1/2

]=4xJ−1/2 + 4xJ+1/2

2f(xJ , tn) .

If we solve the difference equations for un+1j , we get

un+1j =

2D(xj−1/2)4tn+1/2

(4xj−1/2 + 4xj+1/2)4xj−1/2unj−1 +

[1−

2D(xj−1/2)4tn+1/2

(4xj−1/2 + 4xj+1/2)4xj−1/2−

2D(xj+1/2)4tn+1/2

(4xj−1/2 + 4xj+1/2)4xj+1/2

]unj +

2D(xj+1/2)4tn+1/2

(4xj−1/2 + 4xj+1/2)4xj+1/2unj+1 + f(xj , tn)4tn+1/2 .

In the absence of internal heat sources or sinks (i.e. f = 0), the new numerical solution isa weighted average of the old solution values provided that

24tn+1/2

4xj−1/2 + 4xj+1/2

[D(xj+1/2)4xj+1/2

+D(xj−1/2)4xj−1/2

]≤ 1 .

Solving this inequality for 4tn+1/2 gives us

4tn+1/2 ≤4xj−1/2 + 4xj+1/2

21

D(xj+1/2)

4xj+1/2+ D(xj−1/2)

4xj−1/2

.

If the timestep is chosen in this way, then the numerical scheme satisfies a maximumprinciple for problems with no internal heat sources or sinks. To see this, suppose that the


maximum value of un+1j is un+1

i . Then for all j,

maxjun+1j = un+1

i =2D(xi−1/2)4tn+1/2

(4xi−1/2 + 4xi+1/2)4xi−1/2uni−1 +

[1−

2D(xi−1/2)4tn+1/2

(4xi−1/2 + 4xi+1/2)4xi−1/2−

2D(xi+1/2)4tn+1/2

(4xi−1/2 + 4xi+1/2)4xi+1/2

]uni +

2D(xi+1/2)4tn+1/2

(4xi−1/2 + 4xi+1/2)4xi+1/2uni+1

≤2D(xi−1/2)4tn+1/2

(4xi−1/2 + 4xi+1/2)4xi−1/2maxjunj +

[1−

2D(xi−1/2)4tn+1/2

(4xi−1/2 + 4xi+1/2)4xi−1/2−

2D(xi+1/2)4tn+1/2

(4xi−1/2 + 4xi+1/2)4xi+1/2

]maxjunj +

2D(xi+1/2)4tn+1/2

(4xi−1/2 + 4xi+1/2)4xi+1/2maxjunj+1

= maxjunj .

Similarly, we can establish a minimum principle; these prevent the numerical method fromestablishing new extrema, and imply numerical stability.

Away from the boundaries on a uniform mesh with constant diffusion coefficient, theexplicit centered difference scheme can be rewritten in the form

un+1j = τunj+1 + (1− 2τ)unj + τunj−1 + fnj 4t, ∀0 < j ≤ J ∀n ≥ 0 (2.2-4)

where the decay number is

τ ≡ D4t

4x2.

For boundary value problem (2.2-1), we obtain a linear recurrenceun+1

1

un+12...

un+1J

=

1− 2τ τ

τ 1− 2τ. . .

. . . . . . ττ 1− τ

un1un2...unJ

+

fn1 4t+ ν0(n4t)τ

fn2 4t...

fnJ4t+ ν1(n4t)4t/4x

which can be written in matrix-vector form

un+1 = (I−Aτ)un + fn .

It is not hard to see that the explicit centered difference scheme differs from the continuous-in-time method by approximating the matrix exponential by

e−Aτ ≈ I −Aτ .

Note that if f = 0 and 0 < τ < 12 , then equation (2.2-4) shows that the solution

at the new time is a weighted average of the solution at the old time. As we saw inthe discussion above for non-uniform grids, this implies a discrete maximum principle,and therefore numerical stability. On a uniform grid, the timestep restriction also impliesmonotonicity. If unj+1 − unj > 0 for all j, then

un+1j+1 − un+1

j = τ(unj+2 − unj+1) + (1− 2τ)(unj+1 − unj ) + τ(unj − unj−1) > 0 .


Monotonicity corresponds to choosing 4t small enough that I−A4t is positive definite, sothat the approximation to the matrix exponential is positive definite.

However, 0 < τ < 12 implies that we have chosen the timestep to be small:

4t ≤ 4x2

2D.

This restriction typically requires an unacceptably large number of timeseps, unless the dif-fusion constant D is very small. On the other hand, any attempt to balance the spatial andtemporal truncation errors would choose 4t = O(4x2), since we have used a second-orderfinite difference approximation to the spatial derivative and a first-order finite differenceapproximation to the temporal derivative.

2.2.3 Programs for Explicit Centered Differences

Explicit centered differences are easy to program. Nevertheless, in order to assist thestudent with code organization, visualization and debugging, we have provided five exampleprograms. These programs will increase the complexity in the main program and makefile,but typically share subroutines that compute the solution of the heat equation. Whenwe begin to experiment with different integration schemes and differential equations fromwithin this text, we will use the the last of these programs.

To prepare to obtain copies of these codes, perform the following steps:

1. Type “cd” to return to your home directory.

2. Type “mkdir parabolic” to make a directory to contain the program code in thischapter.

3. Type “cd parabolic” to enter the new directory.

4. Download Program 2.2-2: tarfile from the web page.

5. Type “tar -xvf tarfile” to unbundle the codes in your new parabolic directory.

6. Type “rm tarfile” to remove the code bundle in your parabolic directory. Theunbundled code will remain.

First Explicit Centered Difference Program

The first program is designed to be as simple as possible. It consists of the singlemain program Program 2.2-3: main.f, written in Fortran. This program is specificallydesigned to solve the heat equation with a given diffusion constant. It also uses a uniformmesh, specifies a fixed value for the solution at the left boundary and zero derivative for

http://www.math.duke.edu/~johnt/math226/parabolic/tarfile

http://www.math.duke.edu/~johnt/math226/parabolic/PROGRAM0/main.f


the solution at the right boundary, and uses piecewise-constant initial data. The timeevolution of the solution is terminated by exceeding either a specified number of timestepsor a specified simulation time. The program prints the final results for use by a separateplotting program. Instructions for using the program can be found in the accompanyingProgram 2.2-4: README.

Second Explicit Centered Difference Program

The second program is designed to be more modular than the first program. Thisprogram consists of several pieces:

• Program 2.2-5: heatmain.f Fortran main program for solving the heat equationover some specified time or number of timesteps.

• Program 2.2-6: heat.f Fortran routines for initializing the solution and mesh, andintegrating the heat equation;

• Program 2.2-7: const.i Fortran common block for machine dependent parameters,and parameter statements for some common constants;

• Program 2.2-8: GNUmakefile Makefile to compile and load the Fortran files.

It is strongly suggested that the student maintain this basic style of organization for thecode. The separation of initial and boundary conditions from the time integration will makethe experimentation with alternative numerical schemes easier. It will also make it easierfor us to apply the methods to a variety of differential equations, or to different initial valuesor boundary conditions.

File heat.f contains six routines, initialize, explict centered, implicit centered,forward euler fem, implicit centered periodic and crank nicolson. Subroutine initializeinitializes the temperature and mesh for the heat equation. The other routines implementdifferent schemes for integrating the heat equation in time.

To run a copy of this code, perform the following steps:


2. Type “cd parabolic/PROGRAM1” to enter the directory for this program.

3. Type “make” to compile the program files and make the executable flinearad.

4. Type “heat > output”; heat runs the program and > output redirects the resultsto the file output.

5. Type “xmgrace output” to plot the computational results.

http://www.math.duke.edu/~johnt/math226/parabolic/PROGRAM0/README

http://www.math.duke.edu/~johnt/math226/parabolic/PROGRAM1/heatmain.f

http://www.math.duke.edu/~johnt/math226/parabolic/PROGRAM1/heat.f

http://www.math.duke.edu/~johnt/math226/parabolic/PROGRAM1/const.i

http://www.math.duke.edu/~johnt/math226/parabolic/PROGRAM1/GNUmakefile


The final step will show a graph of the numerical solution plotted as a function of space, atthe final time in the simulation.

There are several difficulties with this simple Fortran code. One is that whenever wewant to change the input parameters, such as the number of grid cells or timesteps, wehave to recompile the main program. Another problem is that the arrays have to be givena fixed size, because Fortran 77 does not perform dynamic memory allocation. We will fixthese problems with the next program.

Third Explicit Centered Difference Program

Our third program is more sophisticated, employing a mixture of C++and Fortran. Thisprogram consists of several pieces that we have already seen, namely heat.f and const.i.However, we have a different main program, a different make file and a new input file:

• Program 2.2-9: HeatMain.C C++main program;

• Program 2.2-10: GNUmakefile Makefile to compile and load the mixed-languageprogram;

• Program 2.2-11: input the input file for executing the program.

The C++file, HeatMain.C, has several important features. Since C++is strongly typed, thisfile contains function prototypes for the Fortran routines; these can be found in the extern‘‘C’’ block. Next, this C++file also defines C++structures for the Fortran common block,namely machine common; this allow us to refer to the data in the Fortran common blocksfrom within the main program.

Inside the main program itself, we provide values for the machine-dependent constantsin the Fortran common block machine. Afterward, we read the parameters from the inputfile.

After this preliminary work, the main program is prepared for computation. It allocatesmemory for the computational arrays, and defines array bounds for the Fortran subroutinecalls. Next, the main program initializes the array entries to IEEE infinity; if the programuses an entry before it is given a proper value, then the resulting values will be obviouslywrong. This initialization is useful in debugging; think of it as defensive programming.

Now that the problem parameters are known and the data arrays have been allocated,the main program calls initialize to set the initial values. At the end of the computation,the main program writes out the final results.




http://www.math.duke.edu/~johnt/math226/parabolic/PROGRAM2/HeatMain.C


http://www.math.duke.edu/~johnt/math226/parabolic/PROGRAM2/input


3. Type “make” to compile the program files and make the executable linearad.

4. Type “heat input > output” to run the program and redirect the results to the fileoutput.

5. Type “xmgrace output” to plot the computational results.

There are still difficulties with this program. Since we cannot see the numerical resultsduring execution, it is difficult to see the time evolution of the computation. Further, weare not fully able to perform other important aspects of defensive programming that wewill introduce in the next program.

Fourth Explicit Centered Difference Program

Our fourth program is even more sophisticated, employing a mixture of C++and Fortran,together with some references to external libraries. This program consists of several piecesthat we have already seen, namely heat.f and const.i. However, the main program, inputfile and make file are different:

• Program 2.2-12: HeatMain.C C++main program;

• Program 2.2-13: GNUmakefile Makefile to compile the mixed-language programand link with libraries;


The first change in HeatMain.C is that we construct a MemoryDebugger to watch for out-of-bounds writes and unfreed pointers. Later, we define InputParameters for everythingwe would like to read from our input file. Each InputParameter knows the location of thevariable to be assigned, a character string identifier and lower/upper bounds on permissiblevalues. After defining the InputParameters, we read the parameters from the input file.

The biggest change to the main program is our use of interactive graphics to plot thesolution. To do this, we compute the upper and lower bounds on the mesh and the solu-tion. Then we construct an XYGraphTool that will plot our results. The arguments to theXYGraphTool constructor are the title to appear on the graphics window, the user coordi-nates for the window, a pointer to the colormap, and the desired size of the window as afraction of the screen size. Next, we set the colors for the background and foreground, anddraw the axes. Afterward, we draw plus signs at the cell centers for the numerical solution.During the loop over timesteps, we also plot the new results.

The makefile is necessarily complicated, because we are linking with other libraries formemory debugging, graphics, and graphical user interfaces. At the beginning of makefile,we include macros.gnu, which contains machine-dependent macros to describe the compiler

http://www.math.duke.edu/~johnt/math226/parabolic/PROGRAM3/HeatMain.C


http://www.math.duke.edu/~johnt/math226/parabolic/PROGRAM3/input


names and options. Next, we set some internal macros for compiling and linking. Afterward,we make a list of routines needed by our program.

The trickiest part of the makefile is how we provide different targets to construct codefor debugging or optimized performance. We can choose whether we will make debug oroptimized code by setting the OPT OPTIONS flag in GNUmakefile. The choice d will generatecode for debugging with no optimization, while the choice o will generate optimized code.During code development, you will want to work with debug code. Once your code hasbeen tested, you can create optimized code for greater execution speed.




3. Type “make” to compile the program files and make the executable 1d/linearad.

4. Type “1d/linearad input” to run the program.

When the program is run, the user will see a movie of the simulation, showing the conservedquantity plotted as a function of space at each time in the movie.

The directory 1d refers to code written in one dimension, for debugging. Optimizedcode is compiled and loaded in directory 1o. Figure 2.2-2 contains some example resultswith this program, at the final time in each simulation.

In order to capture the graphics into a file for printing, you can create a shell script, suchas Program 2.2-15: eps4paper This command first copies the contents of a window toa .gif file, then converts that file to .pdf form.

Fifth Explicit Centered Difference Program

Our fifth and final version of our upwind finite difference program is designed to be runfrom within this book. For this purpose, it is necessary that the user be able to changeall input parameters interactively, from within a graphical user interface. This programconsists of several pieces that we have already seen, namely heat.f and const.i. However,the main program, input file and make file are different:

• Program 2.2-16: HeatMainGUI.C C++main program and C++auxiliary proce-dures;

• Program 2.2-17: GNUmakefile Makefile to compile the mixed-language programand link with libraries;


http://www.math.duke.edu/~johnt/math226/parabolic/eps4paper

http://www.math.duke.edu/~johnt/math226/parabolic/HeatMainGUI.C

http://www.math.duke.edu/~johnt/math226/parabolic/GNUmakefile

http://www.math.duke.edu/~johnt/math226/parabolic/input


In order to work with the graphical user interface, the main program basically performssome preliminary work before entering an event loop. The event loop calls various routinesin response to user interaction with the graphical user interface. One of these callbackroutines is runMain in HeatMainGUI.C; this routine contains most of the statements thatappeared in the main program of the previous example. The event loop allows the user toperform one simulation, adjust the input parameters, and then perform another simulation,all in the same run of the program. However, because of the separate threads used for theevents, such a program is more difficult to debug than the previous examples.



2. Type “cd parabolic” to enter the directory for this program.

3. Type “make” to compile the program files and make the executable 1d/guiheat.

4. Type “1d/guiheat input” to run the program.

The directory 1d refers to code written in one dimension, for debugging. Optimized codeis compiled and loaded in directory 1o. You can also run the executable by clicking on thefollowing: Executable 2.2-19: guiheat The latter will use a graphical user interface forparameter input. Pull down on “View” and release the mouse on “Main”. Click on any ofthe arrows to see current values of either the “Problem Parameters”, “Numerical MethodParameters” or “Graphics Parameters”. After selecting your values, click on “Start RunNow” in the original graphical user interface. As with the executable “1d/guiheat”, you willget a window displaying a movie of the temperature plotted as a function of space duringsimulation time.

In figure 2.2-2 we show computational results with the explicit centered difference schemefor the boundary value problem (2.2-1) with homogeneous boundary data and piecewise-constant initial data

u(x, 0) =

1, 1/3 < x < 2/30, otherwise

(2.2-5)

The computation uses a uniform grid of 100 cells, a diffusion constant of D = 1, and thetimestep chosen so that the decay number is τ = 0.51. Thus, this computation is numericallyunstable. The numerical oscillations grow until they eventually dominate the form of thesolution and exceed the ability of the computer to store floating-point numbers.

2.2.4 Implicit Centered Differences

Our next numerical discretization of the heat equation is the implicit centered dif-ference scheme

4xj−1/2 + 4xj+1/2

2un+1j − unj

4tn+1/2= D(xj+1/2)

un+1j+1 − un+1

j

4xj+1/2−D(xj−1/2

un+1j − un+1

j−1

4xj−1/2+f(xj , tn+1) ∀1 ≤ j < J ∀n ≥ 0 .

http://www5.math.duke.edu/cgi-bin/startvnc?run=parabolic_guiheat


(a) step 10 (b) step 20

(c) step 30 (d) step 40

Figure 2.2-2: Explicit centered differences with τ = 0.51: onset of instability


With boundary conditions given by (2.2-1b), we take the numerical solution at the leftboundary to be

un+10 = ν0(tn+1)

and write the scheme at the right boundary in the form

4xJ−1/2 + 4xJ+1/2

2un+1J − unJ4tn+1/2

=[ν1(tn+1)−D(xJ−1/2)

un+1J − un+1

J−1

4xJ−1/2

]+

4xJ−1/2 + 4xJ+1/2

2f(xJ , tn+1) .

This scheme uses a second-order centered difference in space, and the first-order backwardEuler scheme in time.

The implicit centered scheme satisfies a maximum principle without any restriction onthe timestep. Suppose that un+1

i = maxj un+1j occurs in the interior of the grid, and that

f = 0. Then the numerical scheme implies that

maxjun+1j = un+1

i = uni −24tn+1/2

4xi−1/2 + 4xi+1/2

[D(xi+1/2)4xi+1/2

(un+1i − un+1

i ) +D(xi−1/2)4xi−1/2

(un+1i − un+1

i−1 )]≤ uni ≤ max

junj .

We can establish a minimum principle in a similar fashion. The maximum principle impliesunconditional numerical stability. The maximum principle also implies that in each timestepthe linear system for the implicit centered difference scheme has a unique solution. This isseen by taking unj = 0 for all j and using the maximum principle to show that un+1

j = 0 forall j, so the only solution to the homogeneous linear system in the implicit centered schemeis identically zero.

On a uniform grid with constant diffusion coefficient D, the implicit centered schemecan be rewritten in the form

−τun+1j+1 + (1 + 2τ)un+1

j − τun+1j−1 = unj + fn+1

j 4t ∀0 < j ≤ J ∀n ≥ 0 .

where τ ≡ D4t4x2 is the decay number. In matrix-vector form, this can be written

(I + Aτ)un+1 = un + fn+1 .

Implicit centered differences correspond to approximating the matrix exponential by

e−Aτ ≈ (I + Aτ)−1 .

Note that since A is positive definite, this approximation to the matrix exponential ispositive definite for any positive decay number τ , and thus for any 4t > 0.


One advantage of the implicit centered difference scheme is that its numerical solutiontends to the steady state solution of the heat equation as 4t → ∞. This is known asL-stability in the numerical solution of ordinary differential equations.

In figure 2.2-3 we show computational results with the implicit centered differencescheme for the boundary value problem (2.2-1) with homogeneous boundary data, andinitial data given by (2.2-5). The computation uses a uniform grid of 100 cells, a diffusionconstant of D = 1, and several choices of the timestep.

2.2.5 Higher-Order Temporal Discretization

In order to construct higher-order temporal discretizations for parabolic equations, wecan use our knowledge of numerical methods to solve initial-value problems for ordinarydifferential equations. We will restrict our examples to methods that produce second-ordertemporal discretization, since our centered difference is second-order in space.

On way to achieve second-order accuracy in both space and time is to average theexplicit and implicit centered difference schemes:

4xj+1/2 + 4xj−1/2

2un+1j − unj

4t=D(xj+1/2)4xj+1/2

[un+1j+1 − un+1

j

4xj+1/2+unj+1 − unj4xj+1/2

]−D(xj−1/2)4xj−1/2

[un+1j − un+1

j−1

4xj−1/2+unj − unj−1

4xj−1/2

]+f(xj , tn+1) + f(xj , tn

)2 .

This corresponds to using centered differences in space, and the trapezoidal rule in time.On a uniform mesh for a problem with constant diffusion coefficient D, this can be writtenin matrix-vector form(

I + Aτ

2

)un+1 =

(I−A

τ

2

)un +

(fn + fn+1

) 12,

where τ is the decay number.The Crank-Nicolson scheme approximates the matrix exponential by

e−Aτ ≈(I + A

τ

2

)−1 (I−A

τ

2

).

Note that as τ → ∞, this approximation tends to −I rather than zero. If λ4x2 is aneigenvalue of A and τλ4x2 = Dλ4t is large, then the corresponding decay mode of thediscretized equation does not decay monotonically. This indicates that the timestep in theCrank-Nicolson scheme should be chosen with some care. For example, in order that theexponential approximation be positive-definite, we must choose 4t small so that I−Aτ/2is positive-definite. This leads to a small timestep, although twice as large as with explicitcentered differences. We say that the timestep is small, because notions of balancing thetemporal and spatial truncation errors would suggest that 4t = O(4x).

In section 2.3.3, we will use energy estimates to show that the Crank-Nicolson schemeis unconditionally stable. Later, in section 2.4.4, we will use a Fourier analysis to see that


(a) τ = 0.1, 1000 steps (b) τ = 1., 100 steps

(c) τ = 10., 10 steps (d) τ = 100., 1 step

Figure 2.2-3: Unconditional stability with implicit centered differences, t = 0.01.


the Crank-Nicolson scheme is unconditionally diffusive. However, the lack of monotonicitycan produce annoying numerical oscillations when the initial data has significant high-frequency components and when the decay number satisfies λmaxτ > 2, where λmax is thelargest eigenvalue of A.

In figure 2.2-4 we show computational results with the Crank-Nicolson scheme for theboundary value problem (2.2-1) with homogeneous boundary data, and initial data givenby (2.2-5). The computation uses a uniform grid of 100 cells, a diffusion constant of D = 1,and the timestep chosen so that the decay number is τ = 10.0. Thus, this computation isnumerically stable. However, since the Crank-Nicolson scheme is not monotone, the high-frequency components of the initial data oscillate as they decay. The results are shown attime t = 0.001.

In practice, the Crank-Nicolson scheme is often combined with a smoothing step, simi-lar to the smoothing used in the modified midpoint method for ordinary differential equa-tions [28]. Let unj represent the numerical values computed by the Crank-Nicolson schemewith smoothing. First, we compute the provisional new value un+1

j by the Crank-Nicolsonscheme:

un+1j − unj

4t=D

2un+1j+1 − 2un+1

j + un+1j−1

4x2+D

2unj+1 − 2unj + unj−1

4x2.

Afterward, we smooth the previous result by computing

unj =14[un+1j + 2unj + un−1

j ] .

Another way to avoid non-monotone behavior is to take small timesteps until the highfrequency components of the initial data have decayed, then increase the timestep until thetemporal errors match the spatial errors.

Using well-known techniques for solving ordinary differential equations, we can constructother fully discrete methods for the heat equation from the continuous-in-time discretizationof the heat equation using second-order spatial discretization. Here are a few examples onuniform grids with constant diffusion coefficient.

Example 2.2-20: The second-order Adams-Bashforth method gives us the scheme

un+1j − unj

4t=

3D24x2

[unj+1 − 2unj + unj−1]−D

24x2[un−1j+1 − 2un−1

j + un−1j−1 ] .

As we will see in section 2.4.4.4, stability requires that with this scheme we choose thetimestep even smaller than with explicit centered differences.

Example 2.2-21: The second-order Adams-Moulton method gives us the scheme

un+1j − unj

4t=

D

24x2[unj+1 − 2unj + unj−1] +

D

24x2[un+1j+1 − 2un+1

j + un+1j−1 ] .


(a) t = 0.002 (b) t = 0.004

(c) t = 0.006 (d) t = 0.008

Figure 2.2-4: Crank-Nicolson with τ = 10: lack of monotonicity.


This is the Crank-Nicolson scheme.Example 2.2-22: The second-order backward differentiation formula gives us the scheme

32u

n+1j − 2unj + 1

2un−1j

4t=

D

4x2[un+1j+1 − 2un+1

j + un+1j−1 ] .

We will see in section 2.4.4.5 that this scheme is unconditionally stable.Example 2.2-23: The second-order DIRK (Diagonally Implicit Runge-Kutta) scheme

for solving dydt = f takes the form

k1 = f(tn + γ4t , yn + γ4tk1) ,k2 = f(tn + (1− γ)4t , yn + (1− 2γ)4tk1 + γ4tk2) ,

yn+1 = yn +4t

2(k1 + k2) .

Here γ = 1−√

1/2. If we apply this DIRK to the heat equation, we get the scheme

k1,j = Dunj+1 − 2unj + unj−1

4x2+ γD4t

k1,j+1 − 2k1,j + k1,j−1

4x2

k2,j = Dunj+1 − 2unj + unj−1

4x2+D4t(1− 2γ)

k1,j+1 − 2k1,j + k1,j−1

4x2

+ γD4tk2,j+1 − 2k2,j + k2,j−1

4x2

un+1j = unj +

4t

2[k1,j + k2,j ] .

Thus k1 and k2 are determined by solving systems of linear equations, each involving thesame coefficient matrix. By design, this Runge-Kutta scheme is second-order accurate, L-stable and A-stable; see example 2.4.4.6 below.

Exercises 2.2.51. Consider the continuous-in-time scheme for the one-dimensional heat equation with Neumann data at

both boundaries. Write the spatially-discretized system in the form of a system of ordinary differentialequations

du

dt= −Au + f

and carefully describe the entries of u, A and f . Is A symmetric? Is A positive-definite?

2. Program explicit centered differences, implicit centered differences and the Crank-Nicolson schemefor the heat equation

∂u

∂t− ∂2u

∂x2= 0

in one dimension with periodic boundary values. Consider initial data given by the lowest Fouriermode of the heat equation, or by the square wave

u0(x) =

1, 1/3 < x < 2/30, otherwise


Compare numerical results for these schemes with 10, 100 and 1000 cells at time t = 0.03, which isroughly the time at which the spatial maximum of the analytical solution has decayed to 50% of itsinitial value. Describe carefully how you chose 4t for each scheme.

3. Program the second-order BDF and the DIRK scheme for the problem in exercise 2. Compare theaccuracy and efficiency of these schemes. Describe carefully how you chose 4t.

2.3 Consistency, Stability and Convergence

Numerical schemes are stable if they produce bounded perturbations in the numeri-cal solution as a result of perturbations in the data, such as initial conditions, boundaryconditions, or forcing function in the interior. Numerical schemes are consistent if theirdiscretization produces a small error in approximating the differential equation in eachtimestep. Numerical schemes are convergent if the numerical solution converges to the truesolution of the differential equation as the mesh and timestep are refined, in the absence ofcomputer rounding errors.

First, let us describe what we mean by schemes. We assume that the numerical methodcan be written

un+1 = Qn+1/2un

where Qn+1/2 is some operator on the solution vector. Note that the solution vector maybe defined at an infinite number of points, for the purposes of this analysis. It will typicallybe convenient to use the shift operators

(S+u)nj = unj+1 and (S−u)nj = unj−1 ,

to define Qn+1/2 in specific schemes.Example 2.3-1: The explicit centered scheme can be written

un+1j = τ

n+1/2j−1/2 u

nj−1 +

(1− τ

n+1/2j−1/2 − τ

n+1/2j+1/2

)unj + τ

n+1/2j+1/2 u

nj+1 (2.3-1)

where τn+1/2j+1/2 = D4tn+ 1

2 /4x2j+1/2 is the decay number. In this case, we have

Qn+1/2 = S−τn+1/2·−1/2 + I

(1− τ

n+1/2·−1/2 − τ

n+1/2·+1/2

)+ S+τ

n+1/2·+1/2 .

Example 2.3-2: The implicit centered scheme can be written

− τn+1/2j−1/2 u

n+1j−1 +

(1 + τ

n+1/2j−1/2 − τ

n+1/2j+1/2

)un+1j − τ

n+1/2j+1/2 u

n+1j+1 = unj . (2.3-2)

In this case, we have

Qn+1/2 =[−S−τn+1/2

j−1/2 + I(1 + τ

n+1/2j−1/2 − τ

n+1/2j+1/2

)− S+τ

n+1/2j+1/2

]−1.

2.3. CONSISTENCY, STABILITY AND CONVERGENCE 47

2.3.1 Stability of Explicit and Implicit Centered Differences

Next, let ‖ · ‖ represent some norm on the solution vector. For example, we could use

‖un‖ ≡ supj|unj |

or‖un‖ ≡

∑j

|unj |4xj .

The induced norm on Qn+1/2 is defined by

‖Qn+1/2‖ ≡ supun

‖Qn+1/2un‖‖un‖

.

Lemma 2.3-3: Let ‖ · ‖ be any norm on the solution vector such that

‖S−u‖ ≤ ‖u‖ ∀u‖S+u‖ ≤ ‖u‖ ∀u .

Suppose that we have a uniform spatial grid, and that the decay numbers satisfy

τn+1/2 =D4tn+ 1

2

4x2≤ 1

2∀n ≥ 0 .

Then the solution operator Qn+1/2 for the explicit centered scheme (2.3-1) satisfies

‖Qn+1/2‖ ≤ 1 .

Proof: Using the triangle inequality for norms, we compute

‖Qn+1/2u‖ = ‖(S−u)nτn+1/2 + (1− 2τn+1/2)un + (S+u)nτn+1/2‖≤ ‖(S−u)n‖τn+1/2 + ‖un‖(1− 2τn+1/2) + ‖(S+u)n‖τn+1/2

≤ ‖un‖τn+1/2 + ‖un‖(1− 2τn+1/2) + ‖un‖τn+1/2 = ‖un‖ .

2

Thus the explicit centered difference scheme is stable on a uniform grid, in any norm,provided that the timestep is chosen judiciously. In particular, we can use the max norm toprove that the explicit centered difference scheme satisfies a maximum principle whenever


the decay numbers are at most one. We can use the same approach to prove max-normstability on a non-uniform grid.

Lemma 2.3-4: Let ‖ · ‖ be any norm on the solution vector such that

‖S−u‖ ≤ ‖u‖ ∀u‖S+u‖ ≤ ‖u‖ ∀u .

Suppose that we have a uniform spatial grid. Then the solution operator Qn+1/2 for theimplicit centered scheme (2.3-2) satisfies

‖Qn+1/2‖ ≤ 1 .

Proof: Using the triangle inequality, we compute

‖un‖ = ‖ − (S−u)n+1τn+1/2 +(1 + 2τn+1/2

)un+1 − (S+u)n+1τn+1/2‖

≥ −‖(S−u)n+1‖τn+1/2 + ‖un+1‖(1 + 2τn+1/2

)− ‖(S+u)n+1‖τn+1/2

≥ −‖un+1‖τn+1/2 + ‖un+1‖(1 + 2τn+1/2

)− ‖un+1‖τn+1/2

= ‖un+1‖ = ‖Qn+1/2un‖ .

2

Thus the implicit centered difference scheme is stable on a uniform grid, in any norm andfor any choice of the decay numbers.

2.3.2 Error Analysis

Error analysis of finite difference methods typically depends on two features: localtruncation error and a maximum principle. Both of these depend on the problem and themethod, so we will illustrate the approach with examples.2.3.2.1 Explicit Centered Differences

Suppose that we want to approximate the solution of

∂u

∂t− ∂

∂x

(D∂u

∂x

)= f

u(0, t) = ν0(t) ∀t > 0 ,

D∂u

∂x(1, t) = ν1(t)∀t > 0 ,

u(x, 0) = u0(x) ∀0 < x < 1 .


Note that the analytical solution of this problem is smooth. We assume that the numericalsolution is given by the explicit centered difference scheme

un+1j − unj

4t− 1

4x

[Dunj+1 − unj

4x−D

unj − unj−1

4x

]= f(j4x, n4t) .

When this scheme is applied to the analytical solution, we obtain the local truncationerror:

δnj =u(j4x, n4t+ 4t)− u(j4x, n4t)

4t− f(j4x, n4t)

− 14x

[Du(j4x+ 4x, n4t)− u(j4x, n4t)

4x−D

u(j4x, n4t)− u(j4x− 4x, n4t)4x

]≈ [

∂u

∂t(j4x, n4t) +

4t

2∂2u

∂t2(j4x, n4t)]− f(j4x, n4t)

− d

4x

[∂u

∂x(j4x, n4t) +

4x

2∂2u

∂x2(j4x, n4t) +

4x2

6∂3u

∂x3(j4x, n4t) +

4x3

24∂4u

∂x4(j4x, n4t)

]+

d

4x

[(∂u

∂x(j4x, n4t)− 4x

2∂2u

∂x2(j4x, n4t) +

4x2

6∂3u

∂x3(j4x, n4t)− 4x3

24∂4u

∂x4(j4x, n4t)

]=

4t

2∂2u

∂t2(j4x, n4t)− D4x2

12∂4u

∂x4(j4x, n4t) .

The error in the numerical solution,

enj ≡ u(j4x, n4t)− unj

satisfiesen+1j − enj

4t− 1

4x

[Denj+1 − enj

4x−D

enj − enj−1

4x

]= δnj .

We can rewrite this equation in the form

en+1j = τenj+1 + (1− 2τ)enj + τenj−1 + 4tδnj .

If τ ≤ 12 then the explicit centered difference scheme satisfies the maximum principle:

maxj

|en+1j |

≤ max

j

|enj |

+ 4tmaxj

|δnj |.

In this case, the timestep restriction for the maximum principle is equivalent to a stabilityrequirement. We can solve this recurrence to get the inequality

maxj

|enj |≤ max

j

|e0j |

+ 4t

n−1∑k=0

maxj

|δkj |.


Taking T = n4t and approximating the local truncation error by the leading order terms,we get the approximate inequality

maxj

|enj |≤ max

j

|e0j |

+ T max0≤t≤T

max0≤x≤1

4t

2

∣∣∣∣∂2u

∂t2

∣∣∣∣+ D4x2

12

∣∣∣∣∂4u

∂x4

∣∣∣∣ .

This shows that the explicit centered difference scheme is first-order in time, and second-order in space, provided that the timestep is chosen so that 4t ≤ 4x2/(2D).

In order to choose the timestep 4t with the explicit centered difference scheme, we havetwo considerations. First, we need the scheme to be stable, so we require

τ ≤ 12⇐⇒ 4t ≤ 4x2

2D.

Secondly, we might like to balance the temporal and spatial truncation errors. This wouldrequire

4t

2

∣∣∣∣∂2u

∂t2

∣∣∣∣ ≈ 4x2

12

∣∣∣∣∂4u

∂x4

∣∣∣∣ .

This also suggests that we choose 4t = O(4x2). In order to choose a timestep accordingto this condition, we could use spatial and temporal differences from computed results toapproximate the higher-order derivatives.

2.3.2.2 Implicit Centered Differences

Next, let us examine the convergence of implicit centered differences. When the implicitcentered difference scheme is applied to the solution of the heat equation, we obtain the


local truncation error:

δn+1j =

u(j4x, n4t+ 4t)− u(j4x, n4t)4t

− f(j4x, n4t+ 4t)

− D

4x

u(j4x+ 4x, n4t+ 4t)− u(j4x, n4t+ 4t)4x

+D

4x

u(j4x, n4t+ 4t)− u(j4x− 4x, n4t+ 4t)4x

≈[∂u

∂t(j4x, n4t+ 4t)− 4t

2∂2u

∂t2(j4x, n4t+ 4t)

]− f(j4x, n4t+ 4t)

− D

4x

[∂u

∂x(j4x, n4t+ 4t) +

4x

2∂2u

∂x2(j4x, n4t+ 4t) +

4x2

6∂3u

∂x3(j4x, n4t+ 4t)

+4x3

24∂4u

∂x4(j4x, n4t+ 4t)

]+

D

4x

[∂u

∂x(j4x, n4t+ 4t)− 4x

2∂2u

∂x2(j4x, n4t+ 4t) +

4x2

6∂3u

∂x3(j4x, n4t+ 4t)

−4x3

24∂4u

∂x4(j4x, n4t+ 4t)

]= −4t

2∂2u

∂t2(j4x, n4t+ 4t)− D4x2

12∂4u

∂x4(j4x, n4t+ 4t) .

Arguing as in section 2.2.4, we can establish the maximum principle

en+1j ≤ max

0,max

kenk + 4tmax

kδn+1k

and the minimum principle

en+1j ≥ min

0,min enk − 4tmin

kδn+1k

It follows that

|en+1j | ≤ max

k|enk |+ 4tmax |δn+1

k | .

We can solve this recurrence to get the inequality

maxj

|enj |≤ max

j

|e0j |

+ 4t

n∑k=1

maxj

|δnj |.

Taking T = n4t, we get the approximate inequality

maxj

|enj |≤ max

j

|e0j |

+ T max0≤t≤T

max0≤x≤1

4t

2

∣∣∣∣∂2u

∂t2

∣∣∣∣+ D4x2

12

∣∣∣∣∂4u

∂x4

∣∣∣∣ .


This shows that the implicit centered difference scheme is also first-order in time, andsecond-order in space.

If we choose the timestep to balance the spatial and temporal truncation errors, then wemust take 4t = O(4x2). An alternative is to let the temporal truncation error dominate thespatial truncation error, and choose the timestep so that the temporal truncation error isacceptable. For example, the second-order time derivative could be estimated by a second-order difference of values of the numerical solution at three successive timesteps.

2.3.3 Stability of the Crank-Nicolson Scheme

Analysis of higher-order finite difference methods is more difficult, but still possiblewhenever the scheme has either a maximum principle or an energy principle. Analysis ofproblems with rough initial data is also more difficult. We will postpone that analysis untilour discussion of finite element methods.

The Crank-Nicolson scheme is an example of a higher-order method with a both amaximum principle and an energy principle. While discussing the Crank-Nicolson schemewe will generalize the differential equation somewhat, while we also restrict our choice ofnorms.


∂u

∂t+Au = f

where A is nonnegative, meaning that

∀v ∈ H (v,Av) ≥ 0 .

Here A is some partial differential operator and H is an appropriate space of functions. Forexample, we could have

Au = −∇x · ∇xu

and let H be the set of all functions whose first derivatives are square-integrable.Suppose that we discretize in time by means of the midpoint rule, and discretize in space

to get the Crank-Nicolson scheme

un+1 − un

4t+ A

un+1 + un

2= fn+1/2 .

Here un is an array of discrete data values on a grid, and A is matrix representation of thediscretization of A. We can rewrite this equation in the form[

I + A4t

2

]un+1 =

[I−A

4t

2

]un + fn+1/24t ,


or equivalently

un+1 =[I + A

4t

2

]−1[I−A

4t

2

]un + fn+1/24t

≡ Tun + fn+1/24t .

It is easy to see that this scheme is stable under reasonable assumptions on the matrix A.

Lemma 2.3-5: If A is symmetric with x>Ax ≥ 0 for all x,

T ≡ [I + A4t/2]−1 [I−A4t/2]

and ‖ ‖ represents the 2-norm on either vectors or matrices, then

1.∥∥∥∥(I + A4t

2

)−1∥∥∥∥ ≤ 1 ,

2. ‖T‖ ≤ 1 ,

3. ‖un+1‖ ≤ ‖un‖+ 4t‖fn+1/2‖, and

4. ‖un‖ ≤ ‖u0‖+ 4t∑n

k=1 ‖fk−1/2‖.

Proof: Let us prove the first claim. For all x,

∥∥∥∥(I + A4t

2

)x∥∥∥∥ = max

y 6=0

y>(I + A4t

2

)x

‖y‖≥

x>(I + A4t

2

)x

‖x‖=

x>x + x>Ax4t2‖x‖

≥ x>x‖x‖

= ‖x‖

Given any vector y, we can choose x =(I + A4t

2

)−1y to get

‖y‖ =∥∥∥∥(I + A

4t

2

)x∥∥∥∥ ≥ ‖x‖ =

∥∥∥∥∥(I + A

4t

2

)−1

y

∥∥∥∥∥ .The first claim follows from this result.

To show that ‖T‖ ≤ 1, we can assume that f = 0. Then

0 =(un+1 + un

2

)>(un+1 − un

4t+ A

un+1 + un

2

)=‖un+1‖2 − ‖un‖2

24t+(un+1 + un

2

)>Aun+1 + un

2≥ ‖un+1‖2 − ‖un‖2

24t.


It follows that

‖Tun‖ = ‖un+1‖ ≤ ‖un‖ ,

which proves the second claim.

To prove the third claim, note that

un+1 = Tun +(I + A

4t

2

)−1

fn+1/24t ,

from which it follows that

‖un+1‖ ≤ ‖T‖‖un‖+

∥∥∥∥∥(I + A

4t

2

)−1∥∥∥∥∥∥∥∥fn+1/2

∥∥∥4t ≤ ‖un‖+ ‖fn+1/2‖4t .

The fourth claim follows from the third by induction. 2

Exercises 2.3

1. Show that the local truncation error for the Crank-Nicolson scheme is

δnj ≈

4t2

24

∂3u

∂t3−D

4x2

12

∂4u

∂x4−D

4t2

8

∂4u

∂x2∂t2.

2.3.4 Lax Convergence Theorem

The following theorem is fairly general. It applies to consistent stable linear schemes;these schemes do not need to be discretizations of parabolic equations.


Lax Convergence Theorem 2.3-6: Assume that

1. un+1 = Qn+1/2un is a scheme to approximate the solution of some linear partialdifferential equation

2. Given T and n > 0, we perform n steps with this scheme, using timesteps 4tm+ 12

satisfying

∀T > 0 ∀n > 0 ,n−1∑m=0

4tm+ 12 ≤ T

and∀T > 0 ∃α > 0 such that max

0≤m<n4tm+ 1

2≤ Tα/n

3. the scheme is stable, meaning that

∃C > 0 ∀n , ‖Qn+1/2‖ ≤ 1 + C4tn+ 12

4. the scheme has order p in time and order q in space, meaning that if u is the exactsolution of the partial differential equation, then the local truncation error εn

satisfies

∃Ct > 0 ∃p > 0 ∃Cx > 0 ∃q > 0 ∀4tn+ 12 ∀4xj

εn+1 ≡ 14tn+1/2

‖u(xj , tn+1)−Qn+1/2u(·, tn)‖ ≤ Ct(4tn+ 12 )p + Cx(4xj)q

Then the error in the approximate solution satisfies

‖u(·, tn)− un‖ ≤ eCT ‖u(·, 0)− u0‖+ αTeCT[Ct max

n(4tn+ 1

2 )p + Cx maxj

(4xj)q].

Proof: For all timesteps satisfying the assumptions,

‖u(·, tn)− un‖ ≤ ‖u(·, tn)−Qn−1/2u(·, tn−1)‖+ ‖Qn−1/2u(·, tn)−Qn−1/2un−1‖≤ εn−1 + ‖Qn−1/2‖‖u(·, tn−1)− un−1‖ .


We can solve this recurrent inequality to get

‖u(·, tn)− un‖ ≤

n−1∏k=0

‖Qk+1/2‖

‖u(·, t0)− u0‖+

n−1∑`=0

n−1∏k=`

‖Qk+1/2‖

ε`

≤

n−1∏k=0

‖Qk+1/2‖

‖u(·, t0)− u0‖+ max

`ε`n−1∑`=0

n−1∏k=`

‖Qk+1/2‖

≤ ‖u(·, t0)− u0‖n−1∏k=0

(1 + C4tk+

12

)+ nmax

`ε`n−1∏k=`

(1 + C4t`+

12

)≤ ‖u(·, t0)− u0‖eC

Pn−1k=0 4t

k+12

+ eCPn−1

k=` 4t`+1

2 maxk,j

n4tk+

12

[Ct(4tk+

12 )p + Cx(4xj)q

]≤ eCT ‖u(·, t0)− u0‖+ αTeCT

[Ct max

n(4tn+ 1

2 )p + Cx maxj

(4xj)q].

2

Example 2.3-7: The same Taylor series expansions that were used in example 2.3.2.1for the explicit centered difference scheme can be used to show that the local truncation errorin this scheme is

εnj ≡1

4tn+1/2

u(xj , tn+1)− u(xj , tn)− f(xj , tn)

−τn+ 12 [u(xj+1, t

n)− 2u(xj , tn) + u(xj−1, tn)]

≈ 4tn+ 12

2∂2u

∂t2(xj , tn)−

D4x2

12∂4u

∂x4(xj , tn) .

Recall that in lemma 2.3-3 we showed that the explicit centered difference scheme is stablein the max norm provided that the timesteps are chosen so that the decay number is alwaysat most one half. If the second partial derivatives of u are uniformly bounded for all stateswithin the range of the problem of interest, then theorem 2.3-6 proves that the explicit upwindscheme is first-order accurate in time, and second-order accurate in space.

2.4 Fourier Analysis of Finite Difference Schemes

In this section, we will develop a tool for analyzing linear finite difference schemes. Wewill use Fourier transforms to study the dissipation and dispersion introduced by linearschemes in solving linear problems on unbounded domains. The Fourier analysis will give

2.4. FOURIER ANALYSIS OF FINITE DIFFERENCE SCHEMES 57

us useful information about the inter-relationship between dissipation and dispersion incontroling numerical oscillations. For a second source of some of the information in thissection, the reader can consult [78].

2.4.1 Constant Coefficient Equations and Waves

Let us consider the linear partial differential equation

∂u

∂t+ c

∂u

∂x= ru+D

∂2u

∂x2+ f

∂3u

∂x3∀x ∈ R ∀t > 0 (2.4-1a)

u(x, 0) = u0(x) . (2.4-1b)

In order to understand the behavior of this problem, we will define the Fourier transformof an integrable function to be

u(ξ, tn) =∫ ∞

−∞u(x, tn)e−ıξx dx .

It is well-known [73] that if both u and u are integrable in x and ξ, respectively, then theappropriate inversion formula for the Fourier transform is

u(x, tn) =12π

∫ ∞

−∞u(ξ, tn)eıξx dξ (2.4-2)

almost everywhere. If we take the Fourier transform in space of (2.4-1a), we obtain anordinary differential equation that is parameterized by the Fourier variable ξ:

∂u

∂t= ([r − dξ2]− ı[cξ + fξ3])u(ξ, t) .

The solution of this ordinary differential equation is

u(ξ, t) = e[r−Dξ2]t−ı[cξ+fξ3]tu0(ξ) .

If the initial data has Fourier transform u(ξ, 0) = αδ(ξ − β), then the Fourier inversionformula (2.4-2), shows that u(x, 0) = α

2πeıβx. Thus the initial data consists of a single wave

number β; a wave number is equal to 2π over the wave length. It is also easy to use theinverse Fourier transform to find that the solution of (2.4-1) is

u(x, t) = e[r−Dβ2]t−ı[cβ+fβ3]tu(x, 0) .

It is common to define the frequency

ω = −(βc+ β3f) + ı(−r + β2D) .

so that u(x, t) = eıωtu(x, 0). Note that the frequency ω has units of one over time.We can provide several important examples of frequencies, by considering the equation

(2.4-1a) with only one nonzero coefficient and initial data consisting of a single wave number.


advection: If r = 0, D = 0 and f = 0 then the frequency is ω = −cβ and the solution of(2.4-1) is u(x, t) = αeıβ(x−ct). In this case, all wave numbers β travel with the samespeed c.

reaction: If c = 0, D = 0 and f = 0 then the frequency of the wave is ω = −ır, and thesolution of (2.4-1) is u(x, t) = αerteıβx. All wave numbers β remain stationary, andthe amplitude of the wave either grows (r > 0) or decays (r < 0) in time.

diffusion: If c = 0, r = 0 and f = 0 then the frequency is ω = ıβ2D and the solution of(2.4-1) is u(x, t) = αe−β

2Dteıβx. In this case, all wave numbers remain stationary. IfD > 0 all nonzero wave numbers decay, and large wave numbers decay faster thansmall wave numbers. If D < 0 then all nonzero wave numbers grow, and large wavenumbers grow faster than small wave numbers.

dispersion: If c = 0, r = 0 and D = 0 then the frequency of the wave is ω = −β3fand the solution of (2.4-1) is u(x, t) = αeıβ(x−β2ft). This says that different wavenumbers travel with different speeds, and high wave numbers travel faster than slowwave numbers.

2.4.2 Dimensionless Groups

Another collection of interesting partial differential equations involves the time derivativeand two other terms in (2.4-1a). There are five interesting cases among the six possibilities:

convection-diffusion: Suppose that r = 0 and f = 0. Given some useful length L (suchas the problem length or the grid cell width), we can define a dimensionless timecoordinate τ = ct/L and a dimensionless spatial coordinate η = (x− ct)/L. We thenchange variables by defining u(η, τ) = u(x, t). These lead to the transformed diffusionequation

∂u

∂τ=D

cL

∂2u

∂η2.

Here the ratio cL/D of convection to diffusion is called the Peclet number.

convection-dispersion: If r = 0 and D = 0 we can define τ = ct/L, η = (x− ct)/L andu(η, τ) = u(x, t) to obtain the dispersion equation

∂u

∂τ=

f

cL2

∂3u

∂η3.

The dimensionless ratio cL2/f of convection to dispersion does not have a commonlyused label.


convection-reaction: If D = 0 and f = 0 we can define τ = ct/L, η = (x − ct)/L andu(η, τ) = u(x, t) to obtain the system of ordinary differential equations (parameterizedby η)

∂u

∂τ=rL

cu .

reaction-diffusion: If c = 0 and f = 0 we can define τ = rt, ξ = x/L and eτ u(ξ, τ) =u(x, t) to obtain the diffusion equation

∂u

∂τ=

D

rL2

∂2u

∂ξ2.

reaction-dispersion: If c = 0 and D = 0 we can define τ = rt, ξ = x/L and u(x, t) =eτ u(ξ, τ) to obtain the dispersion equation

∂u

∂τ=

f

rL3

∂3u

∂ξ3.

2.4.3 Linear Finite Differences and Diffusion

Although Fourier analysis is applicable to general linear partial differential equations,in this section we are interested only in diffusion. Recall that the Fourier transform of thesolution of the linear diffusion equation

∂u

∂t−D

∂2u

∂x2= 0

satisfies∂u

∂t+Dξ2u = 0 .

Thusu(ξ, t+ 4t) = e−Dξ

24tu(ξ, t) ,

so the exact solution merely involves multiplying the Fourier transform by a ratio thatdepends on the wave number. Let us define τ to be the (dimensionless) decay number

τ =D4t

4x2,

and θ the (dimensionless) mesh wave number

θ = ξ4x .

Then the solution ratio is e−ξ2D4t = e−θ

2τ .


Next, let us consider a linear finite difference scheme∑k

akun+1j+k =

∑k

bkunj+k (2.4-3)

on an uniform mesh tn = n4t, xj = j4x. We will assume that

∀0 ≤ n ≤ N = T/4t , ∀|j| > J = a/4x , unj = 0 ,

so that the initial data for the numerical scheme has finite support. We will define the finiteFourier transform of this discrete data to be

un(ξ) =J∑

j=−Junj e

−ıjξ4x4x .

This finite Fourier transform of the discrete data is a midpoint rule approximation to theFourier transform of u(x, tn). Note that the corresponding inversion formula for the finiteFourier transform is

unj =12π

∫ π/4x

−π/4xun(ξ)eıjξ4x Dξ . (2.4-4)

Also note that if we define the shift operators S+ and S− by

(S+u)nj = unj+1 and (S−u)nj = unj−1 ,

then it is easy to see that

(S+u)n

= eıξ4xun and (S−u)n = e−ıξ4xun .

Thus, if we take the finite Fourier transform of the linear finite difference scheme (2.4-3),we obtain [∑

k

akeıkξ4x

]un+1(ξ) =

[∑k

bkeıkξ4x

]un(ξ) .

Consequently, the finite Fourier transform at the new time satisfies

un+1(ξ) =∑

k bkeıkξ4x∑

k akeıkξ4x u

n(ξ) =∑

k bkeıkθ∑

k akeıkθun(ξ) ≡ z(θ)un(ξ) .

In other words, the scheme amounts to the following approximation to the exponential:

e−τθ2 ≈ z(θ) .

We will say that the scheme (2.4-3) is dissipative if and only if |z| < 1 ∀θ 6= 0 anddispersive if and only if arg(z)/θ depends on θ. (Recall that if the complex number z has


the polar form z = |z|eıψ, then arg(z) ≡ ψ.) If a scheme is dissipative, then all Fouriermodes decay.

In addition, we will say that the scheme is positive if z is real and positive for all θ. Ifa scheme is positive, then Fourier modes do not reverse sign from one timestep to another.

In order to assess the cumulative effect of numerical dissipation and dispersion overseveral timesteps, we will compare the numerical solution to the analytical solution at thetime required for the unit mesh wave number to decay by a factor of e. The number oftimesteps required for this decay is 1/τ . For general mesh wave numbers, the analyticalsolution at this time is

u(ξ, [n+ 1/τ ]4t) = u(ξ, n4t)e−ξ2D4t/τ = u(ξ, n4t)e−θ

2,

and the numerical solution is

un+1/τ (ξ) = un(ξ)z(θ)1/τ .

These results give us quantitative measures of the errors introduced by numerical methods.The total numerical dissipation error in the time required for the unit mesh wave numberto decay by a factor of e is |e−θ2 | − |z(θ)|1/τ . The total numerical dispersion is measuredby the phase error arg

(z(θ)1/τ

).

2.4.4 Fourier Analysis of Finite Difference Schemes

Let us apply our Fourier analysis techniques to some of the schemes we have developedso far.2.4.4.1 Explicit Centered Differences

The explicit centered difference scheme takes the form

un+1j − unj

4t= D

unj+1 − 2unj + unj−1

4x2.

Taking the finite Fourier transform with respect to j, we obtain

un+1 − un = τ[eıθ − 2 + e−ıθ

]un .

This implies that the solution ratio is

z(θ) = 1 + τ[eıθ − 2 + e−ıθ

]= 1− 2τ [1− cos θ] = 1− 4τ sin2(θ/2) .

Thus the explicit centered difference scheme is dissipative if

−1 < 1− 4τ < 1 ⇐⇒ 0 < τ <12.


Figure 2.4-1: Explicit Centered Differences Dissipation Errors; red: τ = 0.1, green: τ = 0.5,blue: τ = 1.0, yellow: τ = 2.0, cyan: τ = 10., magenta: τ = 100., black: stability limite−τθ

2 − 1.


Since z is real, the scheme has zero phase error. We also see that the scheme is positiveif 0 ≤ τ < 1

4 . The dissipation error for the explicit centered difference scheme is shown infigure 2.4-1.2.4.4.2 Implicit Centered Differences

The implicit centered difference scheme takes the form

un+1j − unj

4t= D

un+1j+1 − 2un+1

j + un+1j−1

4x2.

This implies that

z =1

1− τ [eıθ − 2 + e−ıθ]=

11 + 4τ sin2(θ/2)

.

Thus the implicit centered difference scheme is dissipative for all τ > 0 and positive forall τ ≥ 0. Since z is real, the scheme has zero phase error. The dissipation error for theimplicit centered difference scheme is shown in figure 2.4-2.2.4.4.3 Crank-Nicolson

The Crank-Nicolson scheme takes the form

un+1j − unj

4t= D

un+1j+1 − 2un+1

j + un+1j−1

24x2+D

unj+1 − 2unj + unj−1

24x2.

This implies that the numerical solution ratio is

z =1 + 1

2τ[eıθ − 2 + e−ıθ

]1− 1

2τ [eıθ − 2 + e−ıθ]=

1− 2τ sin2(θ/2)1 + 2τ sin2(θ/2)

.

This has the form 1−µ1+µ where µ > 0. Since −1 < 1−µ

1+µ < 1 for all µ > 0, it follows that theCrank-Nicolson scheme is dissipative for all τ . Again since z is real, the scheme has zerophase error. However, this scheme is positive only for τ < 1

2 .2.4.4.4 Adams-Bashforth in Time, Centered Differences in Space

A Fourier analysis of second-order Adams-Bashforth time integration coupled with cen-tered differences (see example 2.2-20) leads to

un+1(ξ) = un(ξ)[1− 6τ sin2

(θ

2

)]+ un−1(ξ)2τ sin2

(θ

2

).

This can be written as a linear recurrence

vn+1 ≡[un+1(ξ)un(ξ)

]=[1− 6η 2η

1 0

] [un(ξ)un−1(ξ)

]≡ Avn .

Here the matrix A has eigenvalues λ satisfying

λ2 − λ(1− 6η)− 2η = 0 ,


Figure 2.4-2: Implicit Centered Difference Dissipation Errors; red: τ = 0.1, green: τ = 0.5,blue: τ = 1.0, yellow: τ = 2.0, cyan: τ = 10., magenta: τ = 100., black: stability limite−τθ

2 − 1.


Figure 2.4-3: Crank Nicolson Dissipation Errors; red: τ = 0.1, green: τ = 0.5, blue: τ = 1.0,yellow: τ = 2.0, cyan: τ = 10., magenta: τ = 100., black: stability limit e−τθ

2 − 1.


where

η = τ sin2

(θ

2

).

Note that η ≥ 0. The solution of this quadratic is

λ =12

[1− 6η ±

√(1− 6η)2 + 8η

].

In order for the scheme to be dissipative, we require that |λ| < 1; this leads to the inequalities

−2 < 1− 6η ±√

1− 4η + 36η2 < 2 ,

or equivalently−3 + 6η < ±

√1− 4η + 36η2 < 1 + 6η .

These lead to the inequality

1− 4η + 36η2 < min9− 36η + 36η2 , 1 + 12η + 36η2 ,

which is equivalent to 0 < η < 1/4. Thus the scheme is dissipative for all wave numbers θ if

τ <14.

It is not hard to see that both values for λ are real for all η. It is also easy to see thatfor all η > 0 exactly one of the values for λ will be negative. As a result, this scheme cannever be positive.2.4.4.5 Backward Differentiation Formula in Time, Centered Differences inSpace

Let us consider the Fourier analysis of the second-order backward differentiation formulawith centered differences (see example 2.2-22). We obtain

32un+1(ξ)− 2un(ξ) +

12un−1(ξ) = −un+1(ξ)4τ sin2

(θ

2

).

Writing η = τ sin2(θ/2) and forming a linear recurrence as before, we obtain the quadratic(32

+ 4η)λ2 − 2λ+

12

= 0

for the eigenvalues of the matrix in the linear recurrence. The solution of this quadratic is

λ =2±

√4− 2(3

2 + 4η)

3 + 8η=

2±√

1− 8η3 + 8η

.


In order for the scheme to be dissipative, we must have |λ| < 1. We have two cases.If η ≤ 1

8 , this implies the inequalities

−3− 8η < 2±√

1− 8η < 3 + 8η ,

which can be rewritten−5− 8η < ±

√1− 8η < 1 + 8η .

These inequalities are equivalent to

1− 8η < min(−5− 8η)2 , (1 + 8η)2 = min25 + 80η + 64η2 , 1 + 16η + 64η2 .

This is equivalent to0 < min24 + 88η + 64η2 , 24η + 64η2 ,

and places no further restriction on η.In the other case, when η > 1

8 , λ is complex; we require

1 > |λ| =

√(2

3 + 8η

)2

+8η − 1

(3 + 8η)2=√

13 + 8η

.

It is obvious that |λ| < 1 in this case as well. Thus the backward differentiation formulaleads to an unconditionally dissipative scheme for the heat equation.

However, this scheme will have a nonzero phase error for τ > 18 . Since λ is not real in

this case, we say that this scheme is positive only for τ < 18 .

2.4.4.6 DIRK in Time, Centered Differences in SpaceLet us consider the Fourier analysis of the DIRK scheme, described in example 2.2-23.

We have

k14t = −4τ sin2( θ2)

1 + 4γτ sin2( θ2)un

and

k24t = −4τ sin2( θ2)(un + (1− 2γ)k14t)

1 + 4γτ sin2( θ2)

= −4τ sin2( θ2)

1 + 4γτ sin2( θ2)

1− 4(1− 3γ)τ sin2( θ2)

1 + 4γτ sin2( θ2)un

where γ = 1−√

1/2. These lead to

z = 1 +12(k1 + k2)/un(ξ) =

1− 4τ(1− 2γ) sin2( θ2)[1 + 4γτ sin2( θ2)

]2 .


Note that this scheme is positive provided that the numerator of this expression is nonneg-ative for all 0 ≤ sin2(θ/2) ≤ 1; this is true if

τ <1

4(1− 2γ)≈ 0.60355 .

It is easy to see that −1 ≤ z ≤ 1 for all values of τ > 0 and all 0 ≤ sin2(θ/2) ≤ 1. Thusthe scheme is unconditionally dissipative.

Exercises 2.4.4

1. Make a table of values of the decay number τ such that the schemes in this section for ∂u∂t

= D ∂2u∂x2

are dissipative, and the values of τ for which each is positive. Note whether the scheme approachesthe correct steady-state limit as 4t →∞. Finally, note what value of the timestep 4t will make thetemporal and spatial truncation errors equal.

2. Consider the DuFort and Frankel scheme

un+1j − un−1

j

24t=

d

4x2[un

j+1 − un+1j − un−1

j + unj−1] .

This scheme is second order in both space in time, and is easy to solve for the new solution un+1j .

(a) Under what circumstances is this scheme dissipative?

(b) Under what circumstances is this scheme positive?

3. Study the Fourier analysis of the Crank-Nicolson scheme with smoothing.

4. Suppose we develop a scheme for the heat equation by developing a rational approximation to theexponential of the form

e−x ≈ 1 + ax

1 + bx + cx2.

Find a, b and c so that the order of this approximation is as high as possible. Describe the corre-sponding discrete scheme for the solution of the heat equation. Perform the Fourier analysis of thisscheme.

5. Suppose we develop a scheme for the heat equation by developing a rational approximation to theexponential of the form

e−x ≈ 1 + ax

(1 + bx)2.

Show that a = 1 +√

2 and b = 1 +p

1/2 make the order of this approximation as high as possible.Describe the corresponding discrete scheme for the solution of the heat equation. Perform the Fourieranalysis of this scheme to show that the scheme is positive and dissipative for all τ > 0.

6. Consider general linear finite difference schemes for the heat equation, with their corresponding ratio-nal polynomial approximations to the exponential. Under what conditions on the rational polynomialis the scheme positive? When is the scheme dissipative? When is the scheme non-dispersive? Howcan partial fractions decompositions be used to affect the numerical implementation of the schemefor the heat equation?


2.4.5 Lax Equivalence Theorem

The material in this section was adapted from Strikwerda [83].Previously, we considered the Fourier analysis of linear schemes for diffusion. In this

section, we will consider more general linear partial differential equations of the form

Pu ≡ ∂u

∂t−Q(

∂

∂x)u = 0 .

Here Q is some linear operator. If we take the Fourier transform of this equation in space,we obtain

p(ξ,∂

∂t)u(ξ, t) ≡ ∂u

∂t− q(ξ)u = 0

where q(ξ) is whatever comes out of the Fourier transform of the partial differential equation.The function p(ξ, s) is called the symbol of the partial differential equation. Note that

p(ξ, q(ξ)) = 0 .

Example 2.4-1: The symbol for the linear diffusion equation

∂u

∂t−D

∂2u

∂x2= 0

is p(ξ, s) = s+Dξ2. In this case, q(ξ) = −Dξ2.Next, suppose that we have a two-level numerical scheme of the form

P4x,4tun ≡

∑k

akun+1j+k −

∑k

bkunj+k = 0

which is assumed to approximate the partial differential equation above. If we take thefinite Fourier transform of this scheme, we obtain[∑

k

akeıkξ4x

]un+1(ξ)−

[∑k

bkeıkξ4x

]un(ξ) = 0

In the special case where un(ξ, s) = esn4teıjξ4x we have

p4x,4t(ξ, s)un(ξ) ≡

[∑k

akeıkξ4x

]es4t −

[∑k

bkeıkξ4x

]un(ξ)

Here p4x,4t(ξ, s) is called the symbol of the numerical scheme. The solution ratio is theratio

z(ξ4x) ≡∑

k bkeıkξ4x∑

k akeıkξ4x =

un+1(ξ)un(ξ)

.


Note that s = ln z(ξ4x)/4t is a zero of p4x,4t:

p4x,4t(ξ,14t

ln z(ξ4x)) = 0 .

Example 2.4-2: The symbol for the explicit centered difference scheme applied to lineardiffusion is

p4x,4t(ξ, s) = es4t − τe−ıξ4x − (1− 2τ)− τeıξ4x = es4t − 1 + 4τ sin2(ξ4x

2)

and the solution ratio is

z(ξ4x) = 1− 4τ sin2(ξ4x

2) .

For computations that occur below, it will be useful to compute

∂p4x,4t∂s

(ξ, s) = 4tes4t∑k

akeıkξ4x .

In particular, it will be useful to note that

∂p4x,4t∂s

(ξ,14t

ln z(ξ4x)) = 4tz(ξ4x)∑k

akeıkξ4x = 4t

∑k

bkeıkξ4x .

Thus the zeros of∂p4x,4t

∂s (ξ, 14t ln z(ξ4x)) are the zeros of the trigonometric polynomial∑

k bkeıkξ4x, or equivalently, of the solution ratio z(ξ4x) considered as a function of ξ.

Example 2.4-3: The solution ratio for the explicit centered scheme applied to lineardiffusion with decay constant D is z(ξ4x) = 1 − 4τ sin2(ξ4x/2), where τ = D4t/4x2. Inorder for z to be zero, we must have τ ≥ 1/4. This can happen when the scheme is diffusive,but not positive.

Definition 2.4-4: We will say that the scheme P4x,4tun = 0 is consistent with thepartial differential equation Pu = 0 if and only if

∀φ ∈ C∞ ∀j ∈ Z ∀n ∈ Z+ ∀ε > 0 ∃4x0 > 0 ∃4t0 > 0 ∀0 < 4x ≤ 4x0 ∀0 < 4t ≤ 4t0

|P4x,4tφ(j4x, n4t)− (Pφ)(j4x, n4t)| < ε .


Definition 2.4-5: We will say that the scheme P4x,4tun = 0 has order p in time andorder q in space if and only if the local truncation error satisfies

∃p > 0 ∃q > 0 ∀φ ∈ C∞ ∀j ∈ Z ∀n ∈ Z+ ∃Cp > 0 ∃Cq > 0∃4x0 > 0 ∃4t0 > 0 ∀0 < 4x ≤ 4x0 ∀0 < 4t ≤ 4t0

|P4x,4tφ(j4x, n4t)− (Pφ)(j4x, n4t)| ≤ Cp4tp + Cq4t

q .

We expect that if the scheme is consistent with the partial differential equation, then thezero s = 1

4t ln z(ξ4x) of p4x,4t(ξ, s) should be close to the zero s = q(ξ) of p4x,4t(ξ, s).The following lemma discusses one sense in which this is true.

Lemma 2.4-6: Suppose that the scheme P4x,4tun = 0 is consistent with the partialdifferential equation Pu = 0. Further, suppose that the symbol p4x,4t(ξ, s) of the schemeis continuously differentiable in s, and

∂p4x,4t∂s

(ξ,14t

ln z(ξ4x)) 6= 0 .

Then eq(ξ)4t − z(ξ4x) = o(4t); in other words,

∀ξ such that∂p4x,4t∂s

(ξ,14t

ln z(ξ4x)) 6= 0

∀δ > 0 ∃4x0 > 0 ∃4t0 > 0 ∀0 < 4x ≤ 4x0 ∀0 < 4t ≤ 4t0 |eq(ξ)4t − z(ξ4x)| ≤ δ4t .

Proof: Since s = q(ξ) is a zero of the symbol p of the differential equation,the definition of consistency with φ(x, t) = eıxξeq(ξ)t implies that

∀j ∈ Z ∀n ∈ Z+ ∀δ > 0 ∃4x0 > 0 ∃4t0 > 0 ∀0 < 4x ≤ 4x0 ∀0 < 4t ≤ 4t0

δ > |p4x,4t(ξ, q(ξ))eıjξ4xeq(ξ)n4t−p(ξ, q(ξ))eıjξ4xeq(ξ)n4t| = |p4x,4t(ξ, q(ξ))|eq(ξ)n4t .

Since s = 14t ln z(ξ4x) is a zero of the symbol p4x,4t of the scheme,

e−q(ξ)n4tδ > |p4x,4t(ξ, q(ξ))− p4x,4t(ξ,14t

ln z(ξ4x))|

= |∫ q(ξ)

ln z(ξ4x)/4t

∂p4x,4t∂s

(ξ, s) ds|

≥ |q(ξ)− 14t

ln z(ξ4x)| mins∈ int(ln z(ξ4x)/4t,q(ξ))

|∂p4x,4t∂s

(ξ, s)| .


It follows that

|eq(ξ)4t − z(ξ4x)

4t| = 1

4t|∫ q(ξ)4t

ln z(ξ4x)es ds|

≤|q(ξ)− 14t

ln z(ξ4x)|emaxq(ξ)4t,ln z(ξ4x)

<δe−q(ξ)n4temaxq(ξ)4t,ln z(ξ4x)

mins∈ int(ln z(ξ4x)/4t,q(ξ)) |∂p4x,4t

∂s (ξ, s)|.

Now choose ξ so that∂p4x,4t

∂s is nonzero, and choose ε. The continuity of∂p4x,4t

∂simplies that

∃γ > 0 ∃4x0 > 0 ∃4t0 ∀0 < 4x ≤ 4x0 ∀0 < 4t ≤ 4t0;

mins∈ int(ln z(ξ4x)/4t,q(ξ))

|∂p4x,4t∂s

(ξ, s)| < γ

Further, the continuity of q and z implies that

∃β > 0 ∃n > 0 ∃4x0 > 0 ∃4t0 e−q(ξ)n4temaxq(ξ)4t,ln z(ξ4x) < β .

Since δ is arbitrary, we can choose δ < εγ/β so that the conclusion of the lemmais satisfied. 2

Corollary 2.4-7: Suppose that the scheme P4x,4tun = 0 is consistent with the partialdifferential equation Pu = 0 of order p in time and q in space. Further, suppose that thesymbol p4x,4t(ξ, s) of the scheme is continuously differentiable in s, and

∂p4x,4t∂s

(ξ,14t

ln z(ξ4x)) 6= 0 .

Then [eq(ξ)4t − z(ξ4x)]/4t = O(4tp) +O(4xq); in other words,


(ξ,14t

ln z(ξ4x)) 6= 0

∃Cp > 0 ∃Cq > 0 ∃4x0 > 0 ∃4t0 > 0 ∀0 < 4x ≤ 4x0 ∀0 < 4t ≤ 4t0


4t| ≤ Cp4t

p + Cq4xq .

Proof: Replace ε in the previous proof with Cp4tp + Cq4xq. 2


Before going farther, let us recall Parseval’s identity for the finite Fourier transform

12π

∫ π/4x

−π/4x|un(ξ)|2 dξ =

π/4x∑j=−π/4x

|unj |24x ≡ ‖un‖24x . (2.4-5)

The corresponding Parseval identity for the Fourier transform is

12π

∫ ∞

−∞|un(ξ)|2 dξ =

∫ ∞

−∞|u(x)|2 dx . (2.4-6)

Now that we have discussed consistency, let us turn to stability.

Definition 2.4-8: We will say that the scheme P4x,4tun = 0 is a stable finite differenceapproximation to the partial differential equation Pu = 0 if and only if

∃4x0 > 0 ∃4t0 > 0 ∀T > 0 ∃CT > 0 ∀0 < 4t ≤ 4t0 ∀0 ≤ n4t ≤ T ∀0 < 4x ≤ 4x0

‖un‖24x ≡ 4x

∞∑j=−∞

|unj |2 ≤ CT4x

∞∑j=−∞

|u0j |2 ≡ CT ‖u0‖2

4x .

We expect that if the scheme is stable, then the solution ratio is bounded close to one.

Lemma 2.4-9: Suppose that the scheme P4x,4tun = 0 is a finite difference approximationto the partial differential equation Pu = 0, and that the solution ratio z(ξ4x) for thescheme is continuous. Then the scheme is a stable finite difference approximation to thepartial differential equation if and only if z(ξ4x) bounded close to one in the followingsense:

∃K > 0 ∃4x0 > 0 ∃4t0 > 0 ∀0 < 4t ≤ 4t0 ∀0 < 4x ≤ 4x0 ∀θ|z(θ)| ≤ 1 +K4t . (2.4-7)

Proof: First, we will prove that the bounded solution ratio condition impliesstability. By Parseval’s identity, and the fact that un+1(ξ) = z(ξ4x)un(ξ),inequality (2.4-7) implies

‖un‖24x =

∫ π/4x

−π/4x|z(ξ4x)|2n|u0(ξ)|2 dξ

≤ (1 +K4t)2n‖u0‖24x ≤

[(1 +K4t)T/4t

]2‖u0‖2

4x

≤ e2KT ‖u0‖24x

This shows that the scheme is stable.


Next, we will show that if inequality (2.4-7) cannot be satisfied, then the schemeis not stable. The negation of (2.4-7) is

∀K > 0 ∀4x0 > 0 ∀4t0 > 0 ∃0 < 4t ≤ 4t0 ∃0 < 4x ≤ 4x0 ∃θ|z(θ)| > 1 +K4t .

Since z is continuous

∀K > 0 ∀4x0 > 0 ∀4t0 > 0 ∃0 < 4t ≤ 4t0 ∃0 < 4x ≤ 4x0 ∃θ1 < θ2 ∀θ1 < θ < θ2

|z(θ)| > 1 +K4t .

GivenK and the corresponding 4x, θ1 and θ2, define the finite Fourier transformof our initial data by

u0(ξ) = √

4x/(θ2 − θ1), θ1 < ξ4x < θ20, otherwise

Note that Parseval’s identity (2.4-5) implies that

‖u0‖24x = ‖u0‖2 =

∫ θ2/4x

θ1/4x

4x

θ2 − θ1dξ = 1

For any T > 0 and for n4t near T we have

‖un‖24x =

∫ π/4x

−π/4x|z(ξ4x)|2n|u0(ξ)|2 dξ =

∫ θ2/4x

θ1/4x|z(ξ4x)|2n 4x

θ2 − θ1dξ

≥ (1 +K4t)2n ≥ 12e2KT =

12e2KT ‖u0‖2

4x

Since K is arbitrary, this shows that the negation of the stability definition (??)holds, namely

∀4x0 > 0 ∀4t0 > 0 ∃T > 0 ∀CT > 0 ∃0 < 4t ≤ 4t0 ∃0 ≤ n4t ≤ T ∃0 < 4x ≤ 4x0

‖un‖24x > CT ‖u0‖2

4x ,

is satisfied with CT < 12e

2KT . 2

Our next goal will be to study the connections between consistency, stability and con-vergence. In order to do so, we will make use of two new devices. The interpolationoperator I4x : L2(Z4x) → L2(R) is defined for any grid function vj in terms of its finiteFourier transform v by

(I4xv)(x) =12π

∫ π/4x

−π/4xeıxξ v(ξ) dξ .


This definition and the Fourier inversion formula (2.4-4) implies that

I4xv(ξ) =v(ξ), |ξ| ≤ π/4x

0, |ξ| > π/4x

Also, the truncation operator T4x : L2(R) → L2(Z4x) is defined for any L2 functionu(x) in terms of its Fourier transform u by

(T4xu)j =12π

∫ π/4x

−π/4xeıj4xξu(ξ) dξ .

Note that the finite Fourier transform of T4xu satisfies

∀|ξ| ≤ π

4xT4xu(ξ) = u(ξ) .

Both the interpolation operator and the truncation operator are linear.These definitions lead to the following simple lemmas.

Lemma 2.4-10: If unj is a grid function, then

‖I4xun‖ = ‖un‖4x .

Proof: Parseval’s identities (2.4-5) and (2.4-6) imply that

‖I4xun‖2 =∫ ∞

−∞|(I4xun)(x)|2 dx =

12π

∫ ∞

−∞| I4xun(ξ)|2(ξ) dξ

=12π

∫ π/4x

−π/4x|u(ξ)|2(ξ) dξ =

π/4x∑j=−π/4x

|uj |24x = ‖u‖24x .

2

Lemma 2.4-11: Suppose that u ∈ L2(R) and the grid function vj are given. Then

∀4x > 0 , ‖T4xu− v‖4x ≤ ‖u− I4xv‖ .


Proof: Using Parseval’s idensities (2.4-5) and (2.4-6), we compute

‖T4xu− v‖24x = 4x

π/4x∑j=−π/4x

|(T4xu)j − vj |2 =∫ π/4x

−π/4x|T4xu(ξ)− v(ξ)|2 dξ

=∫ π/4x

−π/4x|u(ξ)− v(ξ)|2 dξ

≤∫ π/4x

−π/4x|u(ξ)− v(ξ)|2 dξ +

∫|ξ|>π/4x

|u(ξ)|2 dξ

=∫ ∞

−∞|u(ξ)− I4xv(ξ)|2 dξ = ‖u− I4xv‖2

2

Lemma 2.4-12: Suppose that u ∈ L2(R). Then

∀ε > 0 ∃4x0 > 0 ∀0 < 4x ≤ 4x0 , ‖u− I4x(T4xu)‖ < ε .

Proof: We compute

‖u− I4x(T4xu)‖2 =∫ ∞

−∞|u(ξ)− ˆI4x(T4xu)(ξ)|2 dξ

=∫ π/4x

−π/4x|u(ξ)− T4xu(ξ)|2 dξ +

∫|ξ|>π/4x

|u(ξ)|2 dξ

=∫|ξ|>π/4x

|u(ξ)|2 dξ

Since u ∈ L2(R), the right-hand side of this inequality tends to zero as 4x→ 0.2

These results lead us to the following important theorem.


Lax Equivalence Theorem 2.4-13: Suppose that the scheme

P4x,4tun ≡

∑k

akun+1j+k −

∑k

bkunj+k = 0

is consistent with the partial differential equation Pu = 0. We also assume that thesolution ratio

z(ξ4x) =∑

k bkeıkξ4x∑

k akeıkξ4x

for the scheme is continuous. Further, suppose that the symbol p4x,4t(ξ, s) of the schemeis continuously differentiable in s, and

∂p4x,4t∂s

(ξ,14t

ln z(ξ4x)) 6= 0 .

We assume that the symbol of the partial differential equation P has the form p(ξ, s) =s − q(ξ) where q(ξ) is continuous, and that the partial differential equation Pu = 0 isstable, in the sense that

∀T > 0 ∃CT > 0 ∀0 ≤ t ≤ T ∀ξ , |eq(ξ)t| ≤ CT . (2.4-8)

Finally, we assume that the initial data for the scheme is convergent to the true initialdata, in the sense that

∀u0 ∈ L2(R) ∀ε > 0 ∃4x0 > 0 ∀0 < 4x ≤ 4x0 ‖u0 − I4xu0‖ < ε (2.4-9)

Then the scheme is convergent, meaning that

∀u0 ∈ L2(R) ∀ε > 0 ∀T > 0 ∃4x0 > 0 ∃4t0 > 0∀0 < 4t ≤ 4t0 ∀0 ≤ n4t ≤ T ∀0 < 4x < 4x0

‖u(·, n4t)− I4xun‖ < ε ,

if and only if it is stable.

Proof: First, we will prove that stability implies convergence. Given anyu0 ∈ L2(R), suppose that the grid function wnj satisfies the scheme P4x,4twn =0 with initial data w0 = T4xu0. Using the finite Parseval identity (2.4-5) we


compute

‖u(·, n4t)− I4xwn‖2 =

12π

∫ ∞

−∞|u(·, n4t)(ξ)− I4xwn(ξ)|2 dξ

=12π

∫ π/4x

−π/4x|eq(ξ)n4t − g(ξ4x)n|2|u0(ξ)|2 dξ

+∫|ξ|>π/4x

|eq(ξ)n4tu0(ξ)|2 dξ

≡ 12π

∫ ∞

−∞φ4x(ξ) dξ

where we have defined

φ4x(ξ) ≡|eq(ξ)n4t − g(ξ4x)n|2|u0(ξ)|2, |ξ| < π/4x

|eq(ξ)n4t|2|u0(ξ)|2, |ξ| ≥ π/4x

Since the scheme is stable, lemma 2.4-9 implies that the solution ratio is boundedin the following sense:

∃K > 0 ∃4x0 > 0 ∃4t0 > 0 ∀0 < 4t ≤ 4t0 ∀0 < 4x ≤ 4x0 ∀θ|z(θ)| ≤ 1 +K4t .

This may place our first restrictions on 4x and 4t. Since the scheme is consis-tent, lemma 2.4-6 implies that


(ξ,14t

ln z(ξ4x)) 6= 0

∀δ > 0 ∃4x0 > 0 ∃4t0 > 0 ∀0 < 4x ≤ 4x0 ∀0 < 4t ≤ 4t0

|eq(ξ)4t − z(ξ4x)| ≤ δ4t .

This places further restrictions on 4x0 and 4t, and possibly a restriction on ξ.These last two inequalities imply that

∃K > 0 ∀δ > 0 ∀ξ such that∂p4x,4t∂s

(ξ,14t

ln z(ξ4x)) 6= 0

∃4x0 > 0 ∃4t0 > 0 ∀0 < 4x ≤ 4x0 ∀0 < 4t ≤ 4t0

|eq(ξ)4t| ≤ |eq(ξ)4t − z(ξ4x)|+ |z(ξ4x)| ≤ 1 + (K + δ)4t


The bounds on |z(θ)| and |eq(ξ)4t| imply

∃K > 0∀ξ such that∂p4x,4t∂s

(ξ,14t

ln z(ξ4x)) 6= 0

∀n ≥ 0 ∃4x0 > 0 ∃4t0 > 0 ∀0 < 4x ≤ 4x0 ∀0 < 4t ≤ 4t0

|eq(ξ)n4t − z(ξ4x)n| = |(eq(ξ)4t − z(ξ4x))n−1∑k=0

eq(ξ)(n−k)4tz(ξ4x)k|

≤ |eq(ξ)4t − z(ξ4x)|n[1 + (K + δ)4t]n[1 +K4t]n

≤ δn4te(2K+δ)n4t

Thus φ4x(ξ) → 0 almost everywhere as 4x,4t→ 0:

∃K > 0∀ξ such that∂p4x,4t∂s

(ξ,14t

ln z(ξ4x)) 6= 0

∀n ≥ 0 ∃4x0 > 0 ∃4t0 > 0 ∀0 < 4x ≤ 4x0 ∀0 < 4t ≤ 4t0

φ4x(ξ) ≤

(δn4t)2e2(2K+δ)n4t|u0(ξ)|2, |ξ| < π/4x

e2(K+δ)n4t|u0(ξ)|2, |ξ| ≥ π/4x

Lebesgue’s dominated convergence theorem implies that

∀δ > 0 ∀n > 0 ∃4x0 > 0 ∃4t0 > 0 ∀0 < 4x ≤ 4x0 ∀0 < 4t ≤ 4t0

‖u(·, n4t)− I4xwn‖2 =

∫ ∞

−∞φ4x(ξ) dξ < δ (2.4-10)

For the general scheme with initial data u0j we use the triangle inequality

‖u(·, n4t)− I4xun‖ ≤ ‖u(·, n4t)− I4xw

n‖+ ‖I4xwn − I4xun‖ . (2.4-11)

Lemma 2.4-10 implies that the second term on the right is ‖I4xwn − I4xun‖ =‖wn − un‖4x. Since both wnj and unj are grid functions generated by a stable


linear scheme

∃K > 0 ∀n > 0 ∃4x0 > 0 ∃4t0 > 0∀0 < 4t ≤ 4t0 ∀0 < 4x ≤ 4x0

‖wn − un‖24x =

12π

∫ π/4x

−π/4x|wn(ξ)− un(ξ)|2 dξ

=12π

∫ π/4x

−π/4x|z(ξ4x)|2n|w0(ξ)− u0(ξ)|2 dξ

≤ (1 +K4t)2n

2π

∫ π/4x

−π/4x|w0(ξ)− u0(ξ)|2 dξ

≤ e2Kn4t

2π

∫ π/4x

−π/4x|w0(ξ)− u0(ξ)|2 dξ

= e2Kn4t‖w0 − u0‖24x

Since the grid function wnj uses initial data w0 = T4xu0, lemma 2.4-11 togetherwith inequalities (2.4-11), (2.4-10) and (2.4-9) implies that

∃K > 0 ∀n > 0 ∃4x0 > 0 ∃4t0 > 0∀0 < 4t ≤ 4t0 ∀0 < 4x ≤ 4x0

‖u(·, n4t)− I4xun‖ ≤ ‖u(·, n4t)− I4xw

n‖+ eKn4t‖T4xu0 − u0‖4x≤ ‖u(·, n4t)− I4xw

n‖+ eKn4t‖u0 − I4xu0‖4x

We showed in inequality (2.4-10) that for any initial data and any ε > 0 andany n > 0 we can choose 4x and 4t so that the first of the two terms on theright hand side is less than ε/2. Since we assumed in inequality (2.4-9) that theerror in the initial data can be chosen to be small, for any initial data and anyε > 0 we can further restrict 4x so that the second of these two terms is lessthan ε/2. This proves the conclusion (??) of our theorem, that stability impliesconvergence.

Next, we will prove that if the scheme is unstable, then it is not convergent. Wewill do this by constructing initial data u0(x) so that the numerical solution wnj ,satisfying P4x,4twn = 0 and w0

j = T4xu0, does not converge to u(x, t).

Note that the negation of the stability condition in lemma 2.4-9 says that thesolution ratio satisfies

∀K > 0 ∀4x0 > 0 ∀4t0 > 0 ∃0 < 4t ≤ 4t0 ∃0 < 4x ≤ 4x0 ∃θ|z(θ)| > 1 +K4t .


Since the solution ratio z is assumed to be continuous, this negation of stabilityimplies that

∀K > 0 ∀4x0 > 0 ∀4t0 > 0 ∃0 < 4t ≤ 4t0 ∃0 < 4x ≤ 4x0

∃ξK ∃ηK > 0 ∀|ξ − ξK | ≤ ηK

|z(ξ4x)| > 1 +12K4t .

In particular, we may further restrict the choices as follows:

∀K ∈ Z+ ∃0 < 4tK < 4tK−1 ∃0 < 4xK < 4xK−1

∃ξK ∃0 < ηK ≤ 1/K2 ∀|ξ − ξK | ≤ ηK

|z(ξ4x)| > 1 +12K4t .

We now claim that for K > 1, the interval IK = [ξK−ηK , ξK+ηK ] can be chosento be disjoint from the previous intervals I1, . . . , IK−1. Note that this claim isobviously satisfied for K = 1. We will prove the claim is true by induction andcontradiction. Suppose that K > 1 is the first so that IK cannot be disjointfrom the previous intervals. In other words,

∃K ∈ Z+ ∃0 < 4tK < 4tK−1 ∃0 < 4xK < 4xK−1

if ∃ξK ∃ηK > 0 ∀ξ ∈ [ξK − ηK , ξK + ηK ] |z(ξ4x)| > 1 +12K4t

then ∃J < K [ξK − ηK , ξK + ηK ] ⊂ [ξJ − ηJ , ξJ + ηJ ]

Then we must have the following bound on z outside the union of the previousintervals:

∃K ∈ Z+ ∃0 < 4tK < 4tK−1 ∃0 < 4xK < 4xK−1 ∀ξ 6∈ ∪N<KIN|z(ξ4x)| ≤ 1 +K4tK .

Since the scheme is consistent, lemma 2.4-6 implies that


(ξ,14t

ln z(ξ4x)) 6= 0

∀ε > 0 ∃4x∗ > 0 ∃4t∗ > 0 ∀0 < 4x ≤ 4x∗ ∀0 < 4t ≤ 4t∗


4t| ≤ ε .


As we noted above, the exclusions on ξ at the zeros of∂p4x,4t

∂s are identical withexcluding ξ at zeros of z; these can be ignored for ξ ∈ ∪N<KIN , because z islarge there. Since ∪N<KIN is a union of closed bounded intervals, and since qand z are continuous,

∃4x∗ < 4xK ∃4t∗ < 4tK ∃C∗ > 0 ∀ξ ∈ ∪N<KIN ∀0 < 4x ≤ 4x∗ ∀0 < 4t ≤ 4t∗


4t| ≤ C∗

Since the partial differential equation is assumed to be stable, inequality (2.4-8)implies that

∀T > 0 ∃CT ≥ 1 ∀4t > 0 ∀0 ≤ n4t ≤ T ∀M ≥ (C1/nT − 1)/4t ∀ξ

|eq(ξ)4t = |eq(ξ)n4t|1/n ≤ C1/nT ≤ 1 +M4t .

Thus for sufficiently fine mesh and ξ 6∈ ∪N<KIN , |z(ξ4x)| is bounded by 1+K4t,while for ξ ∈ ∪N<KIN we can bound

|z(ξ4x)| ≤ |eq(ξ)4t|+ 4t|eq(ξ)4t − z(ξ4x)

4t|

≤ (1 +M4t) + C∗4t .

Thus

∃4x∗ < 4xK ∃4t∗ < 4tK ∃C∗ > 0 ∀ξ ∀0 < 4x ≤ 4x∗ ∀0 < 4t ≤ 4t∗

|z(ξ4x)| ≤ 1 + maxK,C∗ +M4t

This contradicts our assumption that the scheme is unstable. Thus the intervalscan be chosen to be disjoint.

Next, let us use these disjoint intervals to define initial data for the partialdifferential equation:

u0(x) =∞∑K=1

wK(x)

where the Fourier transform of wK is given by

wK(ξ) =

1

K√ηK, |ξ − ξK | ≤ ηK

0, otherwise


Note that∫ ∞

−∞|u0(x)|2 dx =

∞∑K=1

12π

∫ ∞

−∞|wK(ξ)|2 dξ =

1π

∞∑K=1

1K2ηK

ηK =1π

π2

3=π

3

so u0 ∈ L2(R).

We now claim that the scheme does not converge for this initial data. First, wenote that

∀T > 0 ∃n > 0 ∃K ≥ 1T

2≤ n4tK ≤ T and

CT − 1K

≤ T

8.

Next, note that for all ξ ∈ [ξK − ηK , ξK + ηK ],

|eq(ξ)n4t − z(ξ4xK)n| ≥ |z(ξ4xK)|n − CT ≥ (1 +12K4tK)n − CT .

Lemma 2.4-11 and the inequality (1+x)n ≥ 1+nx (which holds for for all x > 0and n ≥ 1) imply that

‖I4xun − u(·, tn)‖2 ≥ ‖un − T4xu(·, tn)‖24x

=12π

∫ ∞

−∞|z(ξ4x)n − eq(ξ)n4t|2|u0(ξ)|2 dξ

=12π

∞∑N=1

∫ ∞

−∞|z(ξ4x)n − eq(ξ)n4t|2|wN (ξ)|2 dξ

≥ 12π

∫ ξK+ηK

ξK−ηK

|z(ξ4x)n − eq(ξ)n4t|2|wK(ξ)|2 dξ

=12π

[(1 +12K4tK)n − CT ]2

1K2ηK

2ηK

=1π

[(1 + 1

2K4tK)n − CT ]2

K

]2

≥ 1π

[1 + 1

2Kn4tK − CT ]2

K

]2

≥ 1π

(T/8)2

We have shown that there exists initial data u0 so that for any time T > 0 thereis an error tolerance ε = T 2/(64π) so that for all sufficiently fine mesh the errorin the numerical solution at time at most T is greater than ε. This proves thatinstability implies non-convergence. 2


2.5 Measuring Accuracy and Efficiency

Different numerical schemes have different convergence properties, even when they havethe same order of convergence. It is important to compare the performance of numericalschemes, in order to construct efficient numerical methods. For our purposes, we willmeasure efficiency by comparing the computational time required to achieve a specifiednumerical accuracy. This means that we will have to determine how to measure the accuracyof numerical methods.

The first difficulty we face in measuring the accuracy of finite difference methods is thatour numerical results have point values on a grid, while the solution of the differentialequation is defined on an interval in space. We could overcome this problem by restrictingthe solution of the differential equation to points on the grid, or by extending the numericalsolution to all of the problem domain, and then applying standard norms. The truncationoperator in (??) and interpolation operator in (??) served these purposes in section ??, incombination with L2 norms in space or on a grid. Instead, we will typically use L2 normsin our comparisons.

In section 2.2 we developed finite difference methods by constructing various approx-imations to the time and spatial derivatives in the heat equation. The numerical solutionvalues were taken to be approximations to the point values of the solution in equation(2.2-2). Thus it seems reasonable to define the L2 norm

‖un − u(·, tn))‖22 ≡

J−1∑j=0

|unj − u(xj , tn)|21J

(2.5-1)

Alternatively, we could define dimensionless relative errors by dividing the norms above bythe corresponding norms of the solution. This norm will be used to determine the accuracyof our methods.

Let us examine the use of this norm for the explicit centered difference scheme. In figure2.5-1 we show the numerical solution of the heat equation with initial data

u0(x) = sin(πx

2

)− 10 sin

(3πx2

). (2.5-2)

The numerical solution was computed with a decay number of 0.45 and 20 grid cells untiltime 0.05. The numerical results are superimposed with the exact solution in the left-handimage. In the right-hand image of the same figure, we also show the L2 norm of the errorversus time. The results in figure 2.5-2 were computed with a decay number of 0.05. Theseresults were obtained by running Executable 2.5-1: guiheaterror with initial dataequal to two modes, time explicit centered equal to true and decay number equal to0.45 and tmax equal to 0.05. Note that both decay numbers produce very similar results tothe naked eye. We will have to look more carefully to see the differences.

http://www5.math.duke.edu/cgi-bin/startvnc?run=parabolic_guiheaterror

2.5. MEASURING ACCURACY AND EFFICIENCY 85

(a) Solution vs. Position (b) Discrete L2 Error in Solution vs. Time

Figure 2.5-1: Explicit Centered Differences for Heat Equation, decay number = 0.45, D = 1.,20 grid cells, initial data given by (2.5-2)

It is also useful to examine how the computational errors behave as the mesh is refined.In figure 2.5-3 we show the results of a mesh refinement study for the explicit upwind scheme.These results used Riemann problem initial data, and a CFL number of 0.9. One interestingobservation is that the error is almost exactly proportional to 4x2. This is because thestability condition on the explicit centered scheme required us to take 4t ≤ 4x2/(2D), andthe local trunction error is O(4t)+O(4x2). At any rate, this is generally a good indicationthat the scheme is performing properly. The plot of error versus computational time showssomewhat erratic behavior for coarse mesh, due to the inherent inaccuracy in the availablesystem timing routines. For more refined computations, however, this figure seems toindicate that the error is roughly proportional to the computational time raised to the power0.25. These results were obtained by running executable 2.5-1 with initial data equalto two modes, explicit centered equal to true, decay number equal to 0.1, diffusionequal to 0.9 and ncells equal to 0. This seemingly nonsensical value for the number ofgrid cells is a signal for the executable to perform a mesh refinement study.

Figure ?? shows the error in the explicit centered scheme, plotted against computationaltime for several values of the decay number. The explicit centered scheme becomes mostefficient for the decay number around 0.1. In other words, this scheme requires less com-putational time to reach a given level of accuracy at that decay number. The efficiency of


(a) Solution vs. Position (b) Discrete L2 Error in Solution vs. Time

Figure 2.5-2: Explicit Centered Differences Scheme for Heat Equation, decay number =0.05, D = 1., 20 grid cells, initial data given by (2.5-2)

the implicit centered scheme for the heat equation seems to be greatest for a decay numberaround 0.05. Low decay numbers increase the number of timesteps while decreasing thetemporal error. On the other hand, high decay numbers reduce the cost of the scheme whileincreasing the temporal error.

Our previous examples have involved the heat equation with continuous initial data.Since the analytical solution is smooth, the numerical methods should reach their expectedorder of accuracy. So, it is not surprising that when we plot the errors for these schemesin figure 2.5-5 we see that all three schemes are second-order accurate. What is interestingis how much more efficient the Crank-Nicolson scheme is, in comparison to the other twoschemes. This is because we allowed the Crank-Nicolson scheme to take 4t = O(4x), exceptfor the early timesteps when we first require 4t = 4x2/(2D) then allow 4t to grow by afactor of 1.1.

It is tricky to compare numerical schemes for efficiency. The parameters that makean individual scheme operate efficiently cannot be assumed to be the best parameters foranother scheme. Computational times can be affected by programming care and the choiceof computing machinery.

Some general observations may apply. It is reasonable to expect that implicit numericalschemes are more efficient than explicit numerical schemes only if the former can take


(a) Log10 Error vs. -Log10 Cell Width (b) Log10 Error vs. Log10 Time

Figure 2.5-3: Refinement Study with Explicit Centered Differences for Heat Equation withdecay number 0.1


-2 -1 0 1 2 3log10(computer time)

-6

-5.5

-5

-4.5

-4

-3.5

-3

log1

0(er

ror)

Explicit Centered Efficiency

Figure 2.5-4: Refinement Study with Explicit Upwind Scheme for Linear Advection, twomodes initial data, black : decay = 0.9; red : decay = 0.5; green : CFL = 0.1; blue : decay= ; yellow : decay = ; purple : decay =


Log(error) vs. Log(number cells) Log(error) vs. Log(computer time)

Figure 2.5-5: Refinement Studies Comparing Schemes for Heat Equation, two fundamen-tal modes in initial data (lower curves: Crank-Nicolson, middle curves: explicit centereddifferences, upper curves: implicit centered differences

timesteps much larger than the latter for a given level of accuracy. This is because theimplicit numerical schemes involve greater numerical cost in solving the linear systems forthe implicit treatment.

It is also reasonable to expect that high-order numerical schemes should be more efficientthan low-order numerical schemes when high accuracy is required. This is because thelow-order scheme produces small errors only by using small mesh widths. Of course, thisobservation is problem-dependent. These results were obtained by running executable 2.5-1with scheme equal to lax wendroff.


2.6 Finite Difference Methods in Multiple Dimensions

In general, finite difference methods are tricky to apply to problems in multiple dimen-sions, unless the problem domains are rectangular. We will present some of the basic ideasin this section, but we will postpone the treatment of more complicated equations until thechapter on finite element methods for parabolic equations.

2.6.1 Unsplit Methods

We will present the basic ideas in two dimensions. The extension to three dimensionsshould be obvious.

Suppose that we want to approximate the solution of

∂u

∂t= ∇x · (∇xu) .

If we use centered differences in space and forward Euler integration in time, we obtain

un+1ij − unij

4t=

14x2

1

[uni+1,j − 2unij + uni−1,j ] +1

4x22

[uni,j+1 − 2unij + uni,j−1] .

Since this scheme is explicit, it is easy to evaluate the solution at the new time.Note that the explicit centered difference scheme can be rewritten

un+1ij = unij [1−

24t4x2

1

− 24t4x2

2

] +4t

4x21

[uni+1,j + uni−1,j ] +4t

4x22

[uni,j+1 + uni,j−1] .

Thus this scheme has a discrete maximum principle if

4t(1

4x21

+1

4x22

) ≤ 12.

If 4x1 = 4x2, this timestep is half the size of the stable timestep in one dimension.Alternatively, we could use backward Euler integration in time. This would lead to the

scheme

un+1ij − unij

4t=

14x2

1

[un+1i+1,j − 2un+1

ij + un+1i−1,j ] +

14x2

2

[un+1i,j+1 − 2un+1

ij + un+1i,j−1] .

This leads to a symmetric linear system for the solution. Ignoring boundary conditions, wesee that the linear system has block tridiagonal form

. . . . . .

. . . Aj,j Bj,j+1

B>j,j+1 Aj+1,j+1

. . .. . . . . .

. . .

un+1j

un+1j+1

. . .

=

. . .

unj + bn+1j

unj + bn+1j+1

. . .

.

2.6. FINITE DIFFERENCE METHODS IN MULTIPLE DIMENSIONS 91

Here

un+1j ≡

un+11j...

un+1Ij

is the vector of unknowns in the j’th row of the grid at the new time. The vectors bn+1

j

correspond to the specfied boundary values of un+1j . The diagonal blocks in this linear

system have the form

Aj,j ≡

α1 α2

α2 α1 α2

α2 α1. . .

. . . . . .

where α1 = 1 +24t4x2

1

+24t4x2

2

and α2 = − 4t

4x21

.

The off-diagonal blocks in this linear system have the form

Bj,j ≡

β

ββ

. . .

where β = − 4t

4x22

.

The implicit centered difference scheme is unconditionally stable and dissipative. How-ever, it is also involves the solution of a large and sparse linear system. If the system is largeenough, methods such as Gaussian factorization are inefficient, because they require bothlarge computer memory (to store the factors) and computer time (because they perform alarge number of operations). Typically, it is more efficient to apply an iterative method tosolve this linear system. We will study these methods in the next chapter.

2.6.2 Operator Splitting Methods

There are some alternatives to solving large systems of linear systems for multi-dimensionaldiffusion equations, provided that the diffusion is isotropic and the domain is rectangular.For more information about these methods, see Marchuk [63].2.6.2.1 Stabilization Scheme


∂u

∂t+Au = f

and thatA = A1 +A2


where A1 ≥ 0 and A2 ≥ 0. For example, if

A(u) = −∇x · ∇xu

in two dimensions, we could take

A1(u) = −∂2u

∂x21

, A2(u) = −∂2u

∂x22

.

On a rectangular grid, the matrices A1 and A2 would correspond to second-order differencesin the separate coordinate directions.

Recall that the Crank-Nicolson scheme takes the form

un+1 − un

4t+

12A[un+1 + un] = fn+1/2 .

This can be rewritten in the form(I + A

4t

2

)un+1 − un

4t+ Aun = fn+1/2 .

The stabilization method is the operator splitting method that takes the form(I + A1

4t

2

)(I + A2

4t

2

)un+1 − un

4t+ Aun = fn+1/2 .

Since (I + A1

4t

2

)(I + A2

4t

2

)= I + A

4t

2+ A1A2

4t2

4,

the stabilization method agrees with the Crank-Nicolson scheme to second-order accuracy.It is common to implement the stabilization scheme in the form

vn = fn+1/2 −Aun

solve(I + A1

4t2

)wn+1/2 = vn

solve(I + A2

4t2

)wn+1 = wn+1/2

un+1 = un + wn+14t .

(2.6-1)

First, let us examine the stability of this scheme when f = 0.


Lemma 2.6-1: Suppose that A1 and A2 are symmetric and nonnegative. Given u0,define the vectors un for n > 0 by the stabilization scheme(

I + A14t

2

)(I + A2

4t

2

)un+1 − un

4t+ Aun = 0 .

Then the scheme is stable, in the following sense:

∀n > 0 , ‖(I + A2

4t

2

)un‖ ≤ ‖

(I + A2

4t

2

)u0‖

where ‖ · ‖ represents the 2-norm.

Proof: We can rewrite the scheme in the form(I + A1

4t

2

)(I + A2

4t

2

)un+1 =

(I−A1

4t

2

)(I−A2

4t

2

)un .

If we define

zn =(I + A2

4t

2

)un ,

then we obtain

zn+1 =

[(I + A1

4t

2

)−1(I−A1

4t

2

)][(I−A2

4t

2

)(I + A2

4t

2

)−1

]zn ≡ T1T2zn .

Here we have used the notation

T1 =(I + A1

4t

2

)−1(I−A1

4t

2

)T2 =

(I + A2

4t

2

)−1(I−A2

4t

2

)as well as the fact that I −A2

4t2 and I + A2

4t2 commute. Since A1 ≥ 0 and

A2 ≥ 0, lemma 2.3-5 shows that ‖T1‖ ≤ 1 and ‖T2‖ ≤ 1. It follows that∥∥(I + A24t/2)un+1∥∥ = ‖zn+1‖ ≤ ‖T1‖‖T2‖‖zn‖ ≤ ‖zn‖ = ‖(I + A24t/2)un‖

2

It is straightforward to prove the stability of the inhomogeneous scheme as well.


Lemma 2.6-2: Suppose that A1 and A2 are symmetric and nonnegative. Given u0,define the vectors un for n > 0 by the stabilization scheme(

I + A14t

2

)(I + A2

4t

2

)un+1 − un

4t+ Aun = fn+1/2 .

Then the scheme is stable, in the following sense:

∀n > 0 ,∥∥∥∥(I + A2

4t

2

)un∥∥∥∥ ≤ ∥∥∥∥(I + A2

4t

2

)u0

∥∥∥∥+n∑k=1

∥∥∥(I + A24t/2) fk−1/2∥∥∥

where ‖ · ‖ represents the 2-norm.

Proof: First, we write the inhomogeneous scheme in the form(I + A1

4t

2

)(I + A2

4t

2

)un+1 − un

4t+ Aun = fn+1/2 ,

which leads to the form(I + A1

4t

2

)(I + A2

4t

2

)un+1 =

(I−A1

4t

2

)(I−A2

4t

2

)un + fn+1/24t .

If we define

zn =(I + A2

4t

2

)un ,

then this is equivalent to

zn+1 = T1T2zn +

(I + A1

4t

2

)−1

fn+1/24t .

It follows that

‖zn+1‖ ≤ ‖T1‖‖T2‖‖zn‖+

∥∥∥∥∥(I + A1

4t

2

)−1∥∥∥∥∥ ‖fn+1/2‖4t

≤ ‖zn‖+ ‖fn+1/2‖4t ≤ ‖zn‖+

∥∥∥∥∥(I + A1

4t

2

)−1∥∥∥∥∥∥∥∥∥(I + A1

4t

2

)fn+1/2

∥∥∥∥4t≤ ‖zn‖+

∥∥∥∥(I + A14t

2

)fn+1/2

∥∥∥∥4t .In other words,∥∥∥∥(I + A1

4t

2

)un+1

∥∥∥∥ ≤ ∥∥∥∥(I + A14t

2

)un∥∥∥∥+

∥∥∥∥(I + A14t

2

)fn+1/2

∥∥∥∥4t .


Using induction, we obtain the claimed result. 2

Example 2.6-3: The Peaceman-Rachford Scheme for

∂u

∂t=

∂u

∂x21

+∂u

∂x22

.

can be written via two discretizations:

un+1ij − unij

4t=

14x2

1

[un+1i+1,j − 2un+1

i,j + un+1i−1,j ] +

14x2

2

[uni,j+1 − 2uni,j + uni,j−1]

and

un+1ij − unij

4t=

14x2

1

[uni+1,j − 2uni,j + uni−1,j ] +1

4x22

[un+1i+1,j − 2un+1

i,j + un+1i−1,j ] .

The former is implicit in spatial differences in the first coordinate direction only, while thelatter is implicit in the second direction. Thus we need only solve tridiagonal linear systemsto apply either of the schemes. Either discretization by itself is unconditionally unstable.However, if we apply the first scheme for 1

24t and then apply the second scheme for 124t,

the resulting composite scheme is second-order accurate, and unconditionally dissipativeand stable. The Peaceman-Rachford scheme is the stabilization scheme that takes A1 tocorrespond to the second-difference in the first coordinate direction, A2 to correspond to thesecond-difference in the second coordinate direction.2.6.2.2 Predictor-Correction Scheme

In this subsection we will consider a different scheme that implements operator splitting.The predictor-corrector splitting scheme takes the form

solve(I + A1

4t2

)un+1/4 = un + fn+1/24t

2

solve(I + A2

4t2

)un+1/2 = un+1/4

un+1 = un +(fn+1/2 −Aun+1/2

) 4t2

(2.6-2)

This can be rewrittenun+1/4−un

4t/2 + A1un+1/4 = fn+1/2

un+1/2−un+1/4

4t/2 + A2un+1/2 = 0un+1−un

4t/2 + Aun+1/2 = fn+1/2

orun+1 − un

4t+ A

(I + A2

4t

2

)−1(I + A1

4t

2

)−1(un + fn+1/24t

2

)= fn+1/2 .


It is not hard to see that this scheme agrees with the Crank-Nicolson scheme to second-order.If we define

yn =(I + A2

4t

2

)−1(I + A1

4t

2

)−1

un

and

gn+1/2 = fn+1/2 −A(I + A2

4t

2

)−1(I + A1

4t

2

)−1

fn+1/2 ,

then we can rewrite(I + A1

4t

2

)(I + A2

4t

2

)yn+1 − yn

4t+ Ayn = gn+1/2 .

This has the same form as the stabilization method, so the predictor-corrector scheme isstable.2.6.2.3 Extensions to Higher Splittings

Suppose that

A =m∑j=1

Aj

where Aj ≥ for all j. Then the generalization of the stabilization method is

vn+0/m = fn+1/2 −Aun

for j = 1, . . . ,msolve

(I + Aj

4t2

)vn+j/m = vn+(j−1)/m

un+1 = un + vn+m/m4t

(2.6-3)

The generalization of the predictor-corrector splitting scheme is

vn+0/m = un + fn+1/24t2

for j = 1, . . . ,msolve

(I + Aj

4t2

)vn+j/m = vn+(j−1)/m

un+1 = un + (fn+1/2 −Avn+m/m)4t

(2.6-4)

Exercises 2.61. Suppose that we want to solve

∂u

∂t+ ∇x · ∇xu = 0 , 0 < x1, x2 < 1 , 0 < t .

Discretize in space by centered second-order differences.

(a) Write the spatial difference matrix A for this scheme as a sum of matrices Ai such that x>Aix ≥0 for all x.


(b) Write the stabilization scheme for this problem. How would you solve the linear systems?

(c) Write the predictor-corrector splitting scheme for this problem.

(d) Perform a Fourier stability analysis for the stabilization scheme.

2. Suppose that we want to solve

∂u

∂t+ ∇x ·D∇xu = 0 , 0 < x1, x2 < 1 , 0 < t ,

where D = QΛQ> is positive definite, and Q is constant in space. Since D = q1λ1q>1 + q2λ2q

>2 ,

discuss how to split the spatial operator into a sum of two nonnegative operators. How would yousolve the linear systems in the splitting schemes of this section?

3. Suppose that we want to solve the problem above, but Q varies in space. Can you formulate a splittingscheme that involves inexpensive linear system solves?

4. Suppose that we want to solve∂u

∂t+ ∇x · ∇xu = f(u)

on a non-rectangular domain. Can you find a stable second-order splitting scheme that will allow usto split the problem into a pure diffusion and a pure reaction?

Chapter 3

Iterative Linear Algebra

3.1 Relative Efficiency of Implicit Computations

As we saw in chapter 2, implicit schemes for parabolic equations allow us to take largerstable timesteps than explicit schemes. However, implicit schemes require the solution of(possibly large) systems of equations. Thus implicit schemes will be more efficient thanexplicit schemes only if the cost of solving the linear system is less than the cost of takingthe extra explicit timesteps.

Lemma 3.1-1: Suppose that we have a choice of two schemes for solving a linear time-dependent partial differential equation: an implicit scheme choosing 4t = O(4xi) and anexplicit scheme choosing 4t = O(4xe), where e ≥ i. Also suppose that the number ofunknowns in the linear system system for the implicit scheme is M = O(4x−D), whereD is the number of spatial dimensions in the problem. Finally, suppose that the implicitmethod solves the linear system by means of an iteration that costs O(M) operations periteration and on the order of (1/4x)p iterations to reach the same order of accuracy as theexplicit scheme. Then in order for the implicit scheme to use a lower order of work thanthe explicit scheme as 4x→ 0, we must have

p < e− i .

Proof: The explicit scheme will take on the order of a factor of (1/4x)e−i

more timesteps to catch up with a single implicit timestep. The explicit schemeperforms O(4x−D) operations per timestep. Thus the total work for the explicitscheme to take enough timesteps to catch up with one timestep of an implicitscheme is O((1/4x)D+e−i). On the other hand, the total work in one timestepof the implicit method is O((1/4x)D+p). In order for the implicit method to

99

100 CHAPTER 3. ITERATIVE LINEAR ALGEBRA

require a lower order of work than the explicit method, we require it to have asmaller exponent: D + p < D + e− i. 2

Note, for example, that explicit centered differences takes e = 2, implicit centered differencestakes i = 2 and Crank-Nicolson takes i = 1.

This lemma says that in order for the implicit scheme with an iterative solver to becompetitive with the explicit time integration method, the number of iterations should beat worst O((1/4x)p) ≤ O((1/4x)e−i). In particular, note that if the implicit scheme takestimesteps of the same order in 4x as the explicit scheme, then the linear solver for theimplicit scheme must use a number of iterations per timestep that is independent of themesh width 4x, in order to be competitive with explicit time integration. Of course, thework in solving the linear system of equations in an implicit scheme depends on the linearsolver.Example 3.1-2: If we were to use a standard Gaussian factorization algorithm for thesolver, we would perform O((1/4x)3D) operations to solve the linear system; in other words,the effective p would satisfy D+ p = 3D, which implies that p = 2D. If we were to performthis factorization in each timestep, then it would be foolish to use the implicit scheme, be-cause we would not expect 2D = p < e − i for any of the simple finite difference schemeswe have studied.Example 3.1-3: After Gaussian factorization, the solution of a linear system costs O((1/4x)2D)operations; in other words, the effective p would satisfy D + p = 2D, which implies thatp = D. If we have constant coefficients and timesteps, then we could perform the factoriza-tion once, and only pay for the cost of the system solve in each timestep. In some 1D caseswith e− i > 1 this implicit scheme with Gaussian factorization could involve a lower orderof work than explicit time integration.Example 3.1-4: If we use Gaussian factorization for banded matrices for the linearsolver, then the band width for the heat equation using centered differences would be b =O((1/4x)D−1), and the total work in the factorization would be O(b2M) = O((1/4x)3D−2).This corresponds to p = 2(D − 1); the implicit scheme would be more efficient if D <1 + (e − i)/2. For backward Euler versus forward Euler(e = i = 2), both schemes in-volve the same order of work. If we compare the Crank-Nicolson scheme to forward Euler(e = 2, i = 1), the implicit scheme would be more efficient in one dimension, but not in 2Dor 3D.Example 3.1-5: After Gaussian factorization of a banded matrix, the solution of a linearsystem costs O(bM) = O((1/4x)2D−1) operations. This corresponds to p = D − 1. If wehave constant coefficients and timesteps, then would could perform the factorization once,and only pay for the costs of the system solve in each timestep. The implicit scheme wouldbe more efficient if D < 1 + e − i. For backward Euler versus forward Euler(e = i = 2),both schemes involve the same order of work. If we compare the Crank-Nicolson scheme toforward Euler (e = 2, i = 1), the implicit scheme would be more efficient in one dimension.

3.1. RELATIVE EFFICIENCY OF IMPLICIT COMPUTATIONS 101

Scheme e

Forward Euler in time, centered differences in space 2Adams-Bashforth 2 in time, centered differences in space 1

Table 3.1-1: Explicit Timestepping Mesh Exponents Matching Temporal Order: 4t =O(4xe)

Scheme i

Backward Euler in time, centered differences in space 2Crank-Nicolson in time, centered differences in space 1Adams-Moulton 2 in time, centered differences in space 1Backward differentiation formula 2 in time, centered differences in space 1DIRK 2 in time, centered differences in space 1

Table 3.1-2: Implicit Timestepping Mesh Exponents Matching Temporal Order: 4t =O(4xi)

In 2D the implicit and explicit schemes would involve the same order of work, but in 3Dthe explicit scheme would be more efficient.These example show that faster linear solvers would clearly be useful for parabolic equations.

Examples of values of e, i and p for various methods are given in tables 3.1-1, 3.1-2 and3.1-3.

Elliptic equations often arise as the steady-state of parabolic equations. The next lemmadiscusses the relative efficiency of using explicit time integration to compute the steady stateof a parabolic equation, versus direct solution of the linear system for the steady state.

Solver p

Gaussian factorization 2DBand solver, band width=O(4xD−1) D-1Richardson [log(1 + 4Dτ) + log(log(1/4x))]/ log(1/4x)Jacobi [log(2Dτ) + log(log(1/4x))]/ log(1/4x)SOR with optimal relaxation [12 log((1 + 2Dτ)/8) + log(log(1/4x))]/ log(1/4x)Conjugate gradients 1 + log 2/ log(1/4x)Multigrid log(log(1/4x))/ log(1/4x)

Table 3.1-3: Solver Work Exponents for Heat Equation: convergence requires O(4x−p−D)work in D dimensions.


Lemma 3.1-6: Suppose that we have a choice of two schemes for solving a linear steady-state partial differential equation: an implicit scheme and an explicit time integrationscheme. Suppose that steady state effectively occurs at time T = O(L2/d) (where L issome characteristic length in the problem and d is the scale of the diffusion), and thestability condition for the explicit scheme requires that 4t = O(4xe/d). Also supposethat the number of unknowns in the linear system system for the implicit scheme is M =O(4x−D), where D is the number of spatial dimensions in the problem. Finally, supposethat the implicit method solves the linear system by means of an iteration that costs O(M)operations per iteration and on the order of (1/4x)p iterations to reach the same orderof accuracy as the explicit scheme. Then in order for the implicit scheme for the ellipticequation to use a lower order of work as 4x → 0, when compared to the explicit schemefor finding the steady state of an associated parabolic equation, we must have

p < e .

Proof: Explicit time integration to steady state will require O(T/4t) =O( L2/d

(4x/d)e ) = O(4x−e) timesteps. Thus the total work required to use explicittime integration to find a steady state is O((1/4x)D+e). If we can solve forsteady state directly in O((1/4x)p) iterations, each costing O((1/4x)D) opera-tions, then explicit time integration will cost on the order of a factor of

(1/4x)D+e

(1/4x)D+p= (

14x

)e−p

more work than performing the linear algebra to solve the steady-state equations.2

In other words, iteration to compute a steady state is more efficient than explicit timeintegration to steady state, provided that the iterative method requires fewer than O(1/4xe)iterations. For a band solver we have p = 2(D−1), so the implicit scheme is asymptoticallymore efficient than forward Euler time integration (e = 2) only for D < 2, i.e., fewer than2 dimensions. For a constant-coefficient problem with constant timestep, the band solverwould factor once and solve multiple times, corresponding to p = D − 1. Compared toforward Euler time integration, the implicit scheme would be more efficient only if D < 3.

We will examine several different approaches to solving the linear equations. All of ourtechniques will involve iteration, rather than direct solution techniques. Given some ap-proximate solution, we will determine a step direction and step size to improve the solution.The simplest methods will be not make particularly good use of the special structure of thelinear systems that arise in solving partial differential equations; they will also tend not tobe the most efficient alternatives.

3.2. NEUMANN SERIES 103

It is useful to note that for nonlinear steady state problems, selection of the correctsteady state may depend on evolution of the motion toward steady state. For example, thenature of steady-shocks may depend on entropy conditions that are expressed through theunsteady equations. In such cases, time-accurate integration may be required.

Exercises 3.11. The number of iterations required by the linear solver for the implicit time integration will depend

on the initial guess. Suppose that we use the solution of the heat equation at the previous time asour initial guess.

(a) What is the order of the error in the initial guess?

(b) What is the order of the reduction in the error that must be achieved so that the error in theiterative solution is on the order of the local truncation error in the discretization of the heatequation?

2. Consider techniques for obtaining an initial guess for the solution of the linear equations for theimplicit treatment of the heat equation. Can you suggest a better initial guess than the previoussolution of the heat equation? Is your initial guess numerically stable?

3. Suppose that we use third-order implicit time integration with second-order spatial differences. Fur-ther suppose that we choose the timesteps in implicit time integration so that we balance the spatialand temporal truncation errors: 4t = O(4x2/3). If we use an iterative scheme costing O((1/4x)D)work per iteration, how many iterations can we take with the implicit scheme in order to be compet-itive with the explicit scheme?

3.2 Neumann Series

In section 3.5 we will consider basic iterative methods for solving linear systems ofequations. There the basic approach for solving Ax = b will be to choose a convenientmatrix C and an initial guess x0, and compute iterates

rk = Axk − b , xk+1 = xk −Crk .

Note that the errors in the solution satisfy

xk+1 − x = xk − x−CA(xk − x) = (I−CA)(xk − x) ,

soxk − x = (I−CA)k(x0 − x) .

Thus is is important to understand how matrix powers behave for large exponents. Next,note that the residuals satisfy

rk+1 = Axk+1 − b = A(xk −Crk)− b = rk −ACrk = (I−AC)rk

sork+` = (I−AC)`rk .


It follows that the changes in the solution satisfy

xk+` − xk+`+1 = Crk+` = C(I−AC)`rk .

If the iterative improvement algorithm converges, we have

xk − x =∞∑`=0

(xk+` − xk+`+1) = C

[ ∞∑`=0

(I−AC)`]rk . =

[ ∞∑`=0

(I−AC)`]Crk .

Thus the convergence of these methods will be related to the convergence of the Neumannseries

∑∞k=0(I−AC)k or

∑∞k=0(I−CA)k. In this section, we will develop conditions under

which the Neumann series converges. These conditions will involve the spectral radius ofthe matrix in the series.

Since the matrices we will consider are not necessarily diagonalizable, we will use theSchur decomposition to bound norms of matrix powers. Numerically, this approach is morerobust than the use of Jordan canonical forms, which are numerically unstable in floating-point computations.

Schur Decomposition Theorem 3.2-1: [43, Theorem 7.1.3, p. 313] If A ∈ Cn×n,then there is a unitary matrix Q ∈ Cn×n and a right-triangular matrix R ∈ Cn×n so that

QHAQ = R .

Proof: We will prove this lemma by induction. If n = 1, the lemma issatisfied with Q = 1 and R = A. Suppose that the lemma is true for all squarematrices with fewer than n rows. Since the eigenvalues of A are the zeros of itscharacteristic polynomial and the fundamental theorem of algebra shows thatevery polynomial of degree n has n zeros (counting multiplicity), A has at leastone eigenvalue λ. Since A− Iλ is singular, let q be a unit vector so that

Aq = qλ .

We can extend q to an orthonormal basis for Cn×n; these vectors give thecolumns of a unitary matrix

U =[q W

].

Since

I = UHU =[

qH

WH

] [q W

]=[

qHq qHWWHq WHW

]it follows that

WHq = 0 .


As a result,

UHAU =[

qH

WH

]A[q W

]=[λ qHAW0 WHAW

].

Since WHAW ∈ C(n−1)×(n−1), by the inductive hypothesis we can find Q andR so that

QH(WHAW)Q = R .

Let

Q =[q W

] [1Q

]= U

[1

Q

].

Then

QHAQ =[1

QH

] [qH

WH

]A[q W

] [1Q

]=[λ qHAWQ0 R

]is upper triangular, and

QHQ =[1

QH

]UHU

[1

Q

]= I ,

so Q is unitary. 2

Of course, the eigenvalues of A are the diagonal entries of R.We will make use of the following definition.

Definition 3.2-2: The spectral radius of a matrix A ∈ Cn×n is

ρ(A) ≡ max|λ| : λ is an eigenvalue of A .

We will also define the some matrix norms.

Definition 3.2-3: If A ∈ Cm×n, the Frobenius norm is

‖A‖F ≡

√√√√ n∑j=1

m∑i=1

|Ai,j |2

and the matrix 2-norm is‖A‖2 ≡ sup

x6=0

‖Ax‖2

‖x‖2.

These definitions allow us to state and prove the following result.


Lemma 3.2-4:[43, Lemma 7.3.2, p. 336] Suppose that A ∈ Cn×n has Schur decompositionQHAQ = R where Q ∈ Cn×n is unitary and R ∈ Cn×n is right-triangular. Let R = D+Uwhere D is diagonal and U is strictly upper triangular. Then for all θ ≥ 0 and for allintegers k ≥ 0,

‖Ak‖2 ≤ (1 + θ)n−1

[ρ(A) +

‖U‖F1 + θ

]k.

Proof: For any θ ≥ 0, define the diagonal matrix T ∈ Rn×n by

Ti,i = (1 + θ)i−1, 1 ≤ i ≤ n .

Then it is easy to see that ‖T‖2 = (1 + θ)n−1 and that ‖T−1‖2 = 1. Using thefact that U is strictly upper triangular, we can compute

‖TUT−1‖2F =

n∑j=1

j−1∑i=1

|Ti,iUi,j/Tj,j |2 =n∑j=1

j−1∑i=1

|Ui,j(1 + θ)i−j |2

≤n∑j=1

j−1∑i=1

|Ui,j |2(1 + θ)−2 = ‖U‖2F /(1 + θ)2 .

Since R = D + U = T−1(D + TUT−1)T,

‖Ak‖2 = ‖Rk‖2 = ‖T−1(D + TUT−1)kT‖2

≤ ‖T‖2‖T−1‖2

(‖D‖2 + ‖TUT−1‖2

)k ≤ (1 + θ)n−1

[ρ(A) +

‖U‖F1 + θ

]k.

2

The next lemma will be used in lemma 3.2-6 to show that the Neumann series converges,and in lemma 3.5-1 to show that iterative improvement convergences.

Corollary 3.2-5: If A ∈ Cn×n and ρ(A) < 1, then ‖Ak‖2 → 0 as k →∞.

Proof: Choose θ ≥ 0 so that

1 + θ >2‖U‖F

1− ρ(A).

Thenρ(A) +

‖U‖F1 + θ

<1 + ρ(A)

2< 1 .


It follows from lemma 3.2-4 that

‖Ak‖2 ≤ (1 + θ)n−1

(1 + ρ(A)

2

)k.

Since the right-hand side in this inequality tends to zero as k tends to infinity,the corollary is proved. 2

The next lemma will be used in corollary 3.2-7 and in lemma 3.5-39 to understandstopping criteria for iterative improvement.

Corollary 3.2-6: Suppose that A ∈ Cn×n.

1. If ρ(A) < 1, then I−A is nonsingular, and∑∞

k=0 Ak = (I−A)−1.

2. if∑∞

k=0 Ak converges, then ρ(A) < 1.

Proof: Let us prove the first claim. If we choose θ as in the proof of corollary3.2-5, then

‖N∑k=0

Ak‖2 ≤N∑k=0

‖Ak‖2 ≤ (1 + θ)n−1N∑k=

[1 + ρ(A)

2

]k≤ (1 + θ)n−1

1− [1 + ρ(A)] /2.

Thus the Neumann series∑∞

k=0 Ak converges absolutely. Since corollary 3.2-5shows that

(I−A)N∑k=0

Ak − I = AN+1 → 0 as N →∞

we see that

(I−A)∞∑k=0

Ak = I .

Thus I−A is nonsingular, and the result is proved.

Next, we will prove the second claim. Suppose that Ax = xλ with x 6= 0. Then

(∞∑k=0

Ak)x = (∞∑k=0

λk)x .

Since the geometric series converges if and only if |λ| < 1, the claim is proved.2


The next lemma will be used in lemma 3.3-14 to show that a nonnegative matrix hasan eigenvalue equal to its spectral radius.

Corollary 3.2-7:[7, Theorem 6.16] If A ∈ Rn×n, ρ(A) < 1 and A ≥ 0, then I − A isnonsingular and (I−A)−1 ≥ 0.

Proof: Since ρ(A) < 1, corollary 3.2-6 shows that the Neumann series∑∞k=0 Ak converges to (I − A)−1. Since A ≥ 0 implies that all of the terms

in the series have nonnegative entries, the matrix (I−A)−1 is nonnegative. 2

Exercises 3.21. Suppose that ‖x‖ is any norm on a vector x, and define

‖A‖ ≡ supx 6=0

‖Ax‖‖x‖ .

Show that this defines a norm on Rn×n, and that the spectral radius of A is bounded above by ‖A‖.2. It is common in mathematics classes to use Jordan canonical forms to solve linear systems of ordinary

differential equations with non-diagonalizable coefficient matrices. Numerically, the problem with thisapproach is that the Jordan canonical form is not numerically stable: a small perturbation in thematrix can lead to a totally different configuration of Jordan blocks. On the other hand, the Schurdecomposition of the coefficient matrix is very numerically stable, since it employs a unitary changeof basis. Suppose that we want to solve du

dt= Au where QHAQ = R is the Schur decomposition of

A. Show how we can use back-substitution to solve the system of ordinary differential equations forthe entries of QHu(t).

3.3 Perron-Frobenius Theorem

In the previous section, we studied the convergence of the Neumann series for matrices Awith ρ(A) < 1, and concluded with a result about the implications of the additional assump-tion that A has all nonnegative entries. In this section, we will develop the Perron-Frobeniustheory for irreducible nonnegative matrices as another way to understand convergence ofthe Neumann series. In particular, this theory will enable us to state precisely the conditionunder which the Neumann series converges for real nonnegative matrices.

The solution of linear systems Ax = b can be simplified if some unknowns can bedetermined in a smaller linear system before the other unknowns. Suppose that there is apermutation matrix P so that

P>AP =[A11 A12

0 A22

]where A11 is square. Then we can partition

P>x =[x1

x2

], P>b =

[b1

b2

]

3.3. PERRON-FROBENIUS THEOREM 109

and solve

(P>AP)(P>x) =[A11 A12

0 A22

] [x1

x2

]=[b1

b2

]= P>b

by first solving A22x2 = b2 for x2, and then solving A11x1 = b1 − A12x2 for x1. Thismotivates the following definition.

Definition 3.3-1: Suppose that A ∈ Cn×n with n ≥ 2. Then A is reducible if and onlyif there is a permutation matrix P ∈ Rn×n such that we can partition

P>AP =[A11 A12

0 A22

]where A11 ∈ Cr×r. If no such permutation exists, then A is irreducible.

Lemma 3.3-2: If A ∈ Cn×n is reducible, then there exists a permutation matrix P suchthat

P>AP =

R11 R12 . . . R1m

R22 . . . R2m

. . ....

Rmm

where each diagonal block Rjj is either irreducible or a 1× 1 null matrix.

Proof: This result follows by repeated application of the definition of re-ducibility. 2

This result in turn motivates the following definition.

Definition 3.3-3: A block right-triangular matrixR11 R12 . . . R1m

R22 . . . R2m

. . ....

Rmm

is a normal form if and only if each diagonal block Rjj is either irreducible or a 1 × 1null matrix.

Let us work to understand the properties of irreducible matrices.


Lemma 3.3-4:[89, Theorem 2.2] Suppose that A ∈ Rn×n is irreducible, and A ≥ 0. Then

1. for any nonzero x ≥ 0 we have (I + A)n−1x > 0, and

2. (I + A)n−1 > 0.

Proof: Let us prove the first claim. Given a nonzero x0 ≥ 0, define thesequence xk by

xk+1 = (I + A)xk, ∀0 ≤ k ≤ n− 2 .

This equation and the inequality A ≥ 0 imply that

xk+1 = xk + Axk ≥ xk . (3.3-1)

Thus xk+1 has no more zero components than xk.

If xk has at least one zero component, we will show that xk+1 has fewer zerocomponents. We will prove this claim by contradiction. If xk+1 and xk havethe same postive number of zero components, then equation (3.3-1) shows thatthere exists a permutation matrix P so that

P>xk+1 =[a0

]and P>xk =

[b0

],

where a,b ∈ Rm, m < n and a,b > 0. We can use the permutation matrix Pto partition the equation P>xk+1 = P>xk + (P>AP)P>xk in the form[

a0

]=[b0

]+[A11 A12

A21 A22

] [b0

]=[b0

]+[A11bA21b

].

Since b > 0 and A21 ≥ 0, this equation implies that A21 = 0. This in turnimplies that A is reducible, which gives us a contradiction. We conclude thatxk+1 has fewer zero components than xk.

Since x0 has at most n − 1 zero components, xk has at most n − k − 1 zerocomponents for 1 ≤ k ≤ n− 1. In particular, xn−1 has no zero components; inother words, xn−1 > 0. This proves the first claim.

The second claim follows from the first by choosing x0 to be any axis vector. 2


Lemma 3.3-5:[89] Suppose that A ∈ Rn×n is irreducible, and A ≥ 0. If x ∈ Rn isnonzero and x ≥ 0, define

ρA(x) ≡ mini:xi>0

∑nj=1 Aijxj

xi. (3.3-2)

Then

1. ρA(x) ≥ 0, and

2. Ax ≥ xρ, x ≥ 0 and x 6= 0 imply that ρ ≤ ρA(x).

Proof: To prove the first claim, note that since A ≥ 0 and x ≥ 0, thedefinition (3.3-2) of ρA(x) implies that ρA(x) ≥ 0.

To prove the second claim, suppose that Ax ≥ xρ, x ≥ 0 and xi > 0. Then

n∑j=1

Aijxj/xi ≥ ρ .

It follows from the definition of ρA(x) that ρ ≤ ρA(x). 2

This lemma motivates the following definition.

Definition 3.3-6: Suppose that A ∈ Rn×n is irreducible, and A ≥ 0. Let

P ≡ x ∈ Rn : x ≥ 0 and ‖x‖2 = 1 (3.3-3)

Then the extremal value of A is

ρA ≡ supx∈P

ρA(x) . (3.3-4)

Lemma 3.3-7: Suppose that A ∈ Rn×n is irreducible, and A ≥ 0. Let P be defined by(3.3-3) and define

Q ≡ (I + A)n−1x : x ∈ P . (3.3-5)

Then the extremal value of A satisfies

ρA ≡ supy∈Q

ρA(y) .


Proof: Lemma 3.3-4 shows that if y ∈ Q, then y > 0. The definition of theextremal value shows that if x ∈ P then

Ax ≥ xρA(x) .

We can multiply this inequality by (I + A)n−1 to get

Ay ≥ yρA(x) ,

where y ≡ (I + A)n−1x ∈ Q. This implies that ρA(y) ≥ ρA(x), from which weeasily conclude the desired result. 2

Lemma 3.3-8: Suppose that A ∈ Rn×n is irreducible, and A ≥ 0. Let P be defined by(3.3-3) and let the extremal value ρA be defined by (3.3-4). Then

1. there exists z ∈ Rn such that z > 0 and Az ≥ zρA, and

2. there does not exist w ∈ Rn such that w ≥ 0 and Aw > wρA.

Proof: Let Q be given by (3.3-5). Since P is compact, so is Q. Since ρA(y)is continuous on Q, supy∈Q ρA(y) is attained by some z ∈ Q. 2

Definition 3.3-9: Suppose that A ∈ Rn×n is irreducible, and A ≥ 0. Define the extremalvalue ρA of A by (3.3-4). Avector z ∈ Rn that is nonzero and satisfies z ≥ 0 is anextremal vector of A if and only if Az ≥ zρA.

Lemma 3.3-10:[89, Theorem 2.3] Suppose that A ∈ Rn×n is irreducible, and A ≥ 0. IfρA is the extremal value of A, then

1. ρA > 0

2. if z is an extremal vector of A, then Az = zρA and z > 0.

Proof: First, let us prove that ρA > 0. Let e be the vector of ones. Then

ρA ≥ ρA(e/‖e‖2) = mini

n∑j=1

Aij > 0 .

Now we turn to the second claim. Let z be an extremal vector of A, and letw = (I+A)n−1z. By definition 3.3-9 of an extremal vector, y ≡ Az−zρA ≥ 0.


We will prove that y = 0 by contradiction. If y 6= 0, then some component ofy is positive. This fact and lemma 3.3-4 in turn imply that Aw − wρA > 0.This contradicts the definition of the extremal value. We conclude that y = 0,which in turn implies that Az = zρA.

The proof of lemma 3.3-4 also shows that z = Az/ρA cannot have any zerocomponents. 2

Lemma 3.3-11:[89, Theorem 2.4] Suppose that A ∈ Rn×n is irreducible, and A ≥ 0. LetρA be the extremal value of A. Finally, suppose that B ∈ Cn×n satisfies

|Bij | ≤ Aij ∀1 ≤ i, j ≤ n .

Then

1. If β is an eigenvalue of B, then |β| ≤ ρA.

2. There is an eigenvalue β of B such that |β| = ρA if and only if there exists a scalarφ ∈ R and a diagonal matrix C ∈ Cn×n with diagonal entries all of modulus one,such that

B = DAD−1eıφ .

Proof: We assume that y ∈ Cn is nonzero, and By = yβ for some scalar β.Define z ∈ Rn componentwise by zi = |yi|.Let us prove the first claim. Since

∀1 ≤ i ≤ n , yiβ =n∑j=1

Bijyj ,

we can take the modulus of both sides of this equation and use the hypothesesof the lemma to obtain

∀1 ≤ i ≤ n , zi|β| ≤n∑j=1

|Bij |zj ≤n∑j=1

Aijzj .

The definition (3.3-2) of ρA(z) and the definition (3.3-4) of the extremal valueimply that

|β| ≤ ρA(z) ≤ ρA .

Now we turn to the forward direction of the second claim. If the eigenvalue β ofB satisfies |β| = ρA, then the vector z determined at the beginning of the proof


from the eigenvector y corresponding to β satisfies

zρA = |yβ| = |By| ≤ Az

so z is an extremal vector of A. Lemma 3.3-10 now shows that z is an eigenvectorof A with eigenvalue ρA, and z > 0. Thus

Az = zρA = |B|z . (3.3-6)

By the hypotheses of the lemma, |B| ≤ A; since (A− |B|)z = 0 and z = 0, weconclude that

|B| = A . (3.3-7)

Define the diagonal matrix D ∈ Cn×n componentwise by

Dii = yi/|yi| .

Theny = Dz .

Let the polar form of the complex number β be

β = ρAeıφ ,

and define the matrixC = D−1BDe−ıφ ,

Then the eigenvalue equation By = yβ can be rewritten

Cz = zρA (3.3-8)

Equations (3.3-6) and (3.3-8) now imply that

Cz = zρA = Az = |B|z (3.3-9)

and equation (3.3-7) implies that

|C| = |B| = A . (3.3-10)

Equation (3.3-9) now implies that Cz = |C|z; since z > 0, we conclude thatC = |C| = A. We can invert the definition of C to get the claimed form for B.

Finally, we will prove the reverse direction of the second claim. Suppose thatB = DAD−1eıφ. Let z be an extremal vector of A, and let y = Dz. Then

zρA = Az ⇐⇒ DzρAeıφ = DAD−1Dzeıφ ⇐⇒ yρAeıφ = By .

This shows that ρAeıφ is an eigenvalue of B. 2


Corollary 3.3-12:[89, Corollary 2.5] Suppose that A ∈ Rn×n is irreducible, and A ≥ 0.Then the extremal value ρA is equal to the spectral radius ρ(A).

Proof: Let B = A in lemma 3.3-11. 2

Perron-Frobenius Theorem [89, p. 35] 3.3-13: Suppose that A ∈ Rn×n is irre-ducible, and A ≥ 0. Then there A has an eigenvector x with all positive entries, andcorresponding eigenvalue equal to the spectral radius.

Proof: This result follows immediately from lemma 3.3-10 and corollary 3.3-12.2

The following lemma will be used to prove lemma 3.3-16.

Lemma 3.3-14:[89, Theorem 2.20] Suppose that A ∈ Rn×n and A ≥ 0. Then thereexists a nonzero x ∈ Rn with x ≥ 0 so that Ax = xρ(A).

Proof: If A is irreducible, the claim follows from the Perron-Frobenius Theo-rem 3.3-13.

Otherwise, lemma 3.3-2 shows that there is a permutation matrix P ∈ Rn×n sothat

PAP> =

R11 R12 . . . R1m

R22 . . . R2m

. . ....

Rmm

where for all j, Rjj ∈ Rrj×rj is either irreducible or a 1× 1 null matrix. Obvi-ously, the eigenvalues of A are the eigenvalues of the diagonal matrices Rjj .

If there is an index j so that Rjj is irreducible, then the Perron-FrobeniusTheorem shows that Rjj has a positive eigenvalue equal to its spectral radius.For at least one such diagonal block, ρ(Rjj) = ρ(A). Let Rii be the firstirreducible diagonal block such that ρ(Rii) = ρ(A). Then there exists xi ≥ 0 sothat

Riixi = xiρ(A)

and Rkk − Iρ(A) is nonsingular for all k < i. Then we can (if necessary) back-solve R11 − Iρ(A) . . . R1,i−1

. . ....

Ri−1,i−1 − Iρ(A)

x1

...xi−1

=

−R1ixi...

−Ri−1,ixi


for xi−1, . . . ,x1. Then we have that

R11 . . . R1i R1,i+1 . . . R1m

. . ....

......

Rii Ri,i+1 . . . Rim

Ri+1,i+1 . . . Ri+1,m

. . ....

Rmm

x1...xi0...0

=

x1...xi0...0

ρ(A)

All that remains is for us to show that x1, . . . ,xi are nonnegative. We alreadyknow that xi > 0, as a result of the Perron-Frobenius Theorem. Suppose induc-tively that xi, . . . ,xk+1 ≥ 0 for k ≥ 1. Then

(Rkk − Iρ(A))xk = −i∑

`=k+1

Rk,`x`

so

(I−Rkk/ρ(A))xk =i∑

`=k+1

Rk,`x`/ρ(A) .

Now Rkk/ρ(A) ≥ 0 and ρ(Rkk/ρ(A)) < 1, so lemma 3.2-7 shows that [I−Rkk/ρ(A)]−1 ≥0. This proves that xk ≥ 0.

If there is no index j so that Rjj is irreducible, then every diagonal block Rjj

is a 1 × 1 null matrix. This implies that PAP> is strictly upper triangular,ρ(A) = 0, and PAP>e1 = 0. 2

The following lemma will be used in the next section to prove lemma 3.4-11, which showsthat replacing an entry of an M-matrix by zero yields an M-matrix.

Lemma 3.3-15:[89, Theorem 2.21] Suppose that A ∈ Rn×n, B ∈ Cn×n and

0 ≤ |B| ≤ A .

Thenρ(B) ≤ ρ(A) .

Proof: If A is irreducible, then the result follows from lemma 3.3-11.

Suppose that A is reducible, and the permutation matrix P reduces A to normalform. It is easy to see that

0 ≤ |PBP>| ≤ PAP> ,

3.4. M MATRICES 117

so we can apply lemma 3.3-11 to the diagonal blocks in the normal form to provethe lemma. 2

The following lemma will be used to prove lemma 3.4-7.

Lemma 3.3-16:[89, Theorem 3.16] Suppose that A ∈ Rn×n and A ≥ 0. Then ρ(A) < αif and only if Iα−A is nonsingular and (Iα−A)−1 ≥ 0.

Proof: Let us prove that the first statement implies the second. DefineM = Iα−A. Then Corollary 3.2-7 implies that M/α = I−A/α is nonsingular.As a result, M is nonsingular and

M−1 = (I−A/α)−1/α =

∞∑k=0

(A/α)k/α .

Since A ≥ 0, this shows that M−1 ≥ 0.

Now, let us prove that the second claim implies the first. Lemma 3.3-14 showsthat there exists x ≥ 0 so that

Ax = xρ(A) .

As a result,(Iα−A)x = x(α− ρ(A)) .

Since Iα−A is nonsingular, α 6= ρ(A). We also see that

0 ≤ (Iα−A)−1x = x1

α− ρ(A).

Since x ≥ 0 and x 6= 0, we conclude that α > ρ(A). 2

3.4 M Matrices

Our implicit discretizations of the heat equation in sections 2.2.4, 2.2.5 and 2.2.5 allproduced linear systems with positive diagonal entries and negative off-diagonal entries.Stronger statements about these matrices are often possible. Sometimes, we can examinethe matrices to see that the linear systems possess maximum principles. It is useful todetermine more general conditions under which a discrete maximum principle is valid.

Definition 3.4-1: A matrix A ∈ Rn×n is an M-matrix if and only if the off-diagonalentries of A are all non-positive, and for all x ∈ Rn, Ax ≥ 0 implies that x ≥ 0.


For a more detailed discussion of M-matrices, see [7].Example 3.4-2: Consider the matrix

A =

1 −1−1 2 −1

−1. . . . . .. . . 2 −1

−1 1

which arises in the centered difference discretization of the heat equation with Neumannboundary conditions in one dimension. Note that Ae = 0, where e is the vector of ones.Thus A(−e) = 0 but −e < 0. As a result, A is not an M-matrix. On the other hand,

A =

2 −1−1 2 −1

−1. . . . . .. . . 2 −1

−1 2

is an M-matrix, as we will see below.

Our first result illustrates why M-matrices are important in the numerical solution ofparabolic equations.

Lemma 3.4-3:(Maximum principle) Suppose that A is an M-matrix, Ax1 = b1, Ax2 =b2 and b1 ≤ b2. Then x1 ≤ x2.

Proof: Since A(x2−x1) = b2−b1 ≥ 0, the definition of an M-matrix impliesthat x2 − x1 ≥ 0. 2

The following result is related to our discussion of M-matrices.

Lemma 3.4-4:[7, Lemma 6.1] A ∈ Rn×n is nonsingular and the entries of A−1 are allnonnegative if and only if Ax ≥ 0 implies x ≥ 0.

Proof: It is easy to prove the forward implication. If the entries of A−1 areall nonnegative and Ax ≥ 0, then x = A−1(Ax) ≥ 0.

To prove the reverse implication, suppose that Ax ≥ 0 implies that x ≥ 0. First,we will show that A is nonsingular. Suppose that Ax = 0. Then Ax ≥ 0, sowe must have x ≥ 0. We also have that A(−x) ≥ 0, so we must have −x ≥ 0.Thus Ax = 0 implies that x = 0, so A is nonsingular.

3.4. M MATRICES 119

Next, note that for any axis vector ei, A(A−1ei) = ei ≥ 0, so our assumptionimplies that A−1ei ≥ 0. Thus each column of the inverse matrix has nonnegativeentries, and the lemma is proved. 2

The next lemma characterizes M-matrices. The only difficulty with the lemma is that itis generally difficult to verify that the entries of the inverse of a matrix are all nonnegative.

Lemma 3.4-5: A is an M-matrix if and only if A is nonsingular, the off-diagonal entriesof A are all non-positive, and the entries of A−1 are all nonnegative.

Proof: Let prove the forward implication. Suppose that A is an M-matrix.Then by definition, the off-diagonal entries of A are all non-positive. By lemma3.4-4, A is nonsingular and the entries of A−1 are all nonnegative.

Next, let us prove the other direction of the lemma. Suppose that A is nonsin-gular with non-positive off-diagonal entries and such that A−1 has nonnegativeentries. Then lemma 3.4-4 shows that Ax ≥ 0 implies x ≥ 0. This proves thatA is an M-matrix. 2

Next, we will find some cases in which we do not need to examine the entries of theinverse in order to guarantee that a matrix is an M-matrix.

Lemma 3.4-6:[7, Lemma 6.2] If A is a strictly diagonally dominant square matrix withpositive diagonal entries and non-positive off-diagonal entries, then A is an M-matrix.

Proof: Let D be the diagonal of A. Then B ≡ D−1(D−A) = I−D−1A hasnon-negative entries. Since A is strictly diagonally dominant,

∑j 6=i |Aij | < Aii

for all i; it follows that∑j

|Bij | = |Bii|+∑j 6=i

|Bij | = 0 +∑j 6=i

|Aij/Aii| < 1 ∀i .

Recall that the Gerschgorin circle theorem says that the eigenvalues of B arecontained in the union of the circles with centers |Bii| = 0 and radii

∑j |Bij < 1;

thus all of the eigenvalues of B are contained in the interior of the unit circle.It follows that

0 ≤∞∑k=0

Bk = (I−B)−1

It follows thatA−1 = (I−B)−1D−1 ≥ 0 .

Lemma 3.4-5 now shows that A is an M-matrix. 2


The following related result will be used to study sub-matrices of M-matrices.

Lemma 3.4-7:[89, Theorem 3.18] Suppose that A ∈ Rn×n has non-positive off-diagonalentries, and let D be the diagonal part of A. Then A is nonsingular and A−1 ≥ 0 if andonly if Aii > 0 for all 1 ≤ i ≤ n and ρ(I−D−1A) < 1.

Proof: Let us prove that the first statement implies the second. We will writeR−1 = A−1 ≥ 0. Since A has non-positive off-diagonal entries, the diagonalentries of the equation I = RA can be rewritten

1 =n∑j=1

RijAji = RiiAii −∑j 6=i

Rij |Aji| .

Since R = A−1 has nonnegative entries, it follows that

RiiAii = 1 +∑j 6=i

Rij |Aji| ≥ 1 .

This implies that Aii > 0 for all i, which in turn implies that D is nonsingular.Let B = I−D−1A. Then

Bii = 1−Aii/Aii = 0 ∀1 ≤ i ≤ n

andBij = −Aij/Aii ≥ 0 ∀j 6= i .

Thus, B ≥ 0. Since I−B = DA is nonsingular and (I−B)−1 = A−1D−1 ≥ 0,lemma 3.3-16 proves that ρ(B) < 1.

Next, we prove that the second statement implies the first. Since ρ(I−D−1A) <1, lemma 3.3-16 implies that I−(I−D−1A) = D−1A is nonsingular and D−1A ≥0. The first statement of the lemma follows immediately. 2

Example 3.4-8: Consider the matrix I + Aτ where

A =

1 −1−1 2 −1

−1. . . . . .. . . 2 −1

−1 1

.

This matrix arises in the centered difference discretization of the heat equation with Neu-mann boundary conditions in one dimension. For any τ > 0, I + Aτ is strictly diagonally

3.4. M MATRICES 121

dominant with positive diagonal entries and non-positive off-diagonal entries. Thus for anyτ > 0, I + Aτ is an M-matrix.

Lemma 3.4-9:[7, Lemma 6.4] If A is a square matrix with positive diagonal entries andnon-positive off-diagonal entries, then A is an M-matrix if and only if there exists x > 0so that Ax > 0.

Proof: To prove the first direction of the equivalence, suppose that A is anM-matrix. Then by lemma 3.4-5 A is nonsingular and A−1 has nonnegativeentries. Since e ≥ 0, the definition of M-matrices implies that x ≡ A−1e ≥ 0.If x does not have all positive entries, we will prove a contradiction. Supposethat xi = 0. Then 0 = e>i A−1e =

∑nj=1(A

−1)ij ; since the entries of the inverseare all nonnegative, this implies that e>i A−1 = 0. Since A is nonsingular, thisgives us a contradiction. As a result, we conclude that x > 0.

Next, let us prove the other direction of the lemma. Suppose that A is a squarematrix with positive diagonal entries and non-positive off-diagonal entries, andx > 0 is such that Ax > 0. Let D be the diagonal matrix whose diagonal entriesare the entries of x, and let B = AD. Then Be = ADe = Ax > 0. Since Bhas positie diagonal entries and non-positive off-diagonal entries, this impliesthat B is strictly diagonally dominant. Lemma 3.4-6 now shows that B is anM-matrix. It follows that A = BD−1 is nonsingular and A−1 = DB−1 ≥ 0.Lemma 3.4-5 implies that A is an M-matrix. 2

Example 3.4-10: Consider the matrix A ∈ Rn×n where

A =

2 −1−1 2 −1

−1. . . . . .. . . 2 −1

−1 1

.

This matrix arises in the centered difference discretization of the heat equation with Dirich-let boundary condition on the left and Neumann boundary condition on the right in onedimension. As we saw in lemma 2.2-1, the vector x with entries

xj = sinjπ

2n+ 1, ∀1 ≤ j ≤ n

is an eigenvector of A with eigenvalue

λ = 4 sin2 π

4n+ 2> 0 .


Since x has all positive entries and Ax = xλ has all positive entries, lemma 3.4-9 showsthat A is an M-matrix.

Lemma 3.4-11:[89, Theorem 3.25] Suppose that A ∈ Rn×n is an M-matrix, and C isany matrix obtained by replacing certain off-diagonal entries of A by zero. Then C is anM-matrix.

Proof: The diagonal parts of A and C are identical, and will be denoted byD. By the definition of C,

0 ≤ I−D−1C ≤ I−D−1A .

Since A is an M-matrix, lemma 3.4-5 shows that A is nonsingular and A−1 ≥ 0.Lemma 3.4-7 now shows that D > 0 and ρ(I−D−1A) < 1. Inequality (3.4) andlemma 3.3-15 imply that ρ(I−D−1C) < 1. Lemma 3.4-7 now implies that C isnonsingular and C−1 ≥ 0. Finally, lemma 3.4-5 implies that C is an M-matrix.2

Corollary 3.4-12: If A is an M-matrix, then the diagonal entries of A are positive.

Proof: Lemma 3.4-11 shows that the diagonal part D of A is an M-matrix.By the definition of an M-matrix, Dx ≥ 0 implies that x ≥ 0. Thus D ≥ 0.Lemma 3.4-5 shows that D is nonsingular, so we must have that its diagonalentries are positive. 2

In developing the Cholesky factorization of a positive-definite matrix, it is commonto show that the diagonal blocks of the matrix are also positive-definite, and the Schurcomplement is positive-definite. Here is a similar result for M-matrices.

Lemma 3.4-13:[7, Lemma 6.9] Suppose that A is an M-matrix, and that we have parti-tioned

A =[A11 A12

A21 A22

]where A11 is square. Then A11 and A22 are M-matrices. Further, If we factor

A =[

I 0A21A−1

11 I

] [A11 A12

0 S

]then the Schur complement S = A22 −A21A−1

11 A12 is an M-matrix.

Proof: Lemma 3.4-11 shows that the matrix

C =[A11 00 A22

]

3.4. M MATRICES 123

is an M-matrix. Lemma 3.4-5 now shows that C is nonsingular, the off-diagonalentries of C are all non-positive, and C−1 ≥ 0. It follows that A11 and A22

must be nonsingular with non-positive off-diagonal entries. Since

C−1 =[A−1

11 00 A−1

22

]we see that A−1

11 ≥ 0 and A−122 ≥ 0. Lemma 3.4-5 now implies that A11 and

A22 are M-matrices.

Next, we will show that the Schur complement S has non-positive off-diagonalentries. Note that

e>i Sej = e>i A22ej − e>i A21A−111 A12ej .

If i 6= j, then e>i A22ej ≤ 0 since A22 is an M-matrix. Recall that lemma 3.4-5shows that A11 is nonsingular, and that A−1

11 has nonnegative entries. Since A21

and A12 are off-diagonal blocks of the M-matrix A, all of the entries of e>i A21

and A12ej are non-positive. It follows that e>i Sej is a non-positive numberminus a non-negative number, so it must be non-positive.

Finally, we will assume that Sy ≥ 0, and prove that y ≥ 0. Define

x = −A−111 A12y .

Then

A[xy

]=[

I 0A21A−1

11 I

] [A11 A12

0 S

] [xy

]=[

I 0A21A−1

11 I

] [A11x + A12y

Sy

]=[

I 0A21A−1

11 I

] [0Sy

]=[

0Sy

]≥ 0 .

Since A is an M-matrix, by definition both x ≥ 0 and y ≥ 0. The latter impliesthat the Schur complement S is an M-matrix. 2

Example 3.4-14: Suppose that we have a discretization of a steady-state equation onsome domain Ω, with corresponding linear system Ax = b. Suppose that we subdivide Ω intotwo domains Ω1 and Ω2 (possibly for distributed computing), with interface I = Ω1 ∪ Ω2.We can subdivide the unknowns in the linear system into the set x1 occuring in Ω1, x2

occuring in Ω2, and xI lying in the interface I. Typically, we can reorder the equations andunknowns to get A11 A1I

A22 A2I

AI1 AI2 AII

x1

x2

xI

=

b1

b2

bI


We can eliminate x1 and x2 to get an equation for xI involving a Schur complement:

(AII −AI1A−111 A1I −AI2A−1

22 A2I)xI = bI −AI1A−111 b1 −AI2A−1

22 b2 .

If A is an M-matrix, then lemma guarantees that the linear system for the interface un-knowns is also an M-matrix.

Lemma 3.4-15:[7, Lemma 6.11] Suppose that A is symmetric positive-definite and hasnon-positive off-diagonal entries. Then A is an M-matrix.

Proof: We need only show that Ax ≥ 0 implies that x ≥ 0. The result isobvious for x = 0, so we assume that x 6= 0. We can write x = p−n where thevectors p and n are both nonnegative and complementary:

∀i , pini = 0 .

Since A is positive-definite,

0 ≥ −n>An = n>A(p− n)− n>Ap .

Since p and n are complementary, n,p ≥ 0 and the off-diagonal entries of Aare nonnegative,

n>Ap =∑i

∑j

Aijnipj =∑i

∑j 6=i

Aijnipj ≤ 0 .

Since Ax ≥ 0 and n ≥ 0,

n>A(p− n) = n>Ax ≥ 0 .

It follows that

0 ≥ −n>An = n>A(p− n)− n>Ap ≥ 0 .

Thus n>An = 0. Since A is positive-definite, we conclude that n = 0. Thusx = p ≥ 0. 2

Lemma 3.4-16:[7, Lemma 6.12] If A ∈ Rn×n has non-positive off-diagonal entries, thenA is an M-matrix if and only if all of the eigenvalues of A have positive real part.

Proof: Suppose that A is an M-matrix. Then lemma 3.4-9 shows that thereis a vector x > 0 so that Ax > 0. Let D be the diagonal matrix whose diagonalentries are the entries of x. Note that 0 < D−1Ax = D−1ADe, where the off-diagonal entries of D−1AD are non-positive. Thus D−1AD is strictly diagonallydominant with positive diagonal entries. The Gerschgorin circle theorem nowimplies that the real parts of the eigenvalues of D−1AD are all positive. Sincethis matrix is similar to A, the lemma is proved. 2

3.5. ITERATIVE IMPROVEMENT 125

3.5 Iterative Improvement

Suppose that we want to solve Ax = b. The iterative improvement algorithm forsolving this equation approximates x by xk where xk is updated by the iteration

rk = Axk − b , xk+1 = xk −Crk . (3.5-1)

Here the preconditioner C is some matrix that is convenient for computation.

Lemma 3.5-1:[43, Theorem 10.1-1] Suppose that b ∈ Rn, and that A,C ∈ Rn×n aresuch that the spectral radius of I − CA satisfies ρ(I − CA) < 1. Then the iterativeimprovement algorithm (3.5-1) converges for any initial guess.

Proof: Let us denote the error in the approximation by ek = xk − x. Thenrk = Axk −Ax = Aek, and the error after one step of the iteration is

ek+1 = [xk −Crk]− x = ek −CAek = [I−CA]ek .

We can solve this linear recurrence to get ek = (I − CA)ke0. Since we haveassumed that ρ(I−CA) < 1, corollary 3.2-5 shows that ek → 0. 2

Lemma 3.5-2:[7, Corollary 6.17] Suppose that I−CA ≥ 0, C is nonsingular and C ≥ 0.Then the iterative improvement algorithm is convergent for any initial guess if and only ifA is nonsingular and A−1 ≥ 0.

Proof: Suppose that I−CA ≥ 0, C is nonsingular, C ≥ 0 and that iterativeimprovement is convergent for any initial guess. Because we have assumed thatiterative improvement is unconditionally convergent, lemma 3.5-1 shows thatρ(I−CA) < 1. Corollary 3.2-7 implies that CA = I− (I−CA) is nonsingular;since I−CA ≥ 0, corollary 3.2-6 implies that

(CA)−1 =∞∑k=0

(I−CA)k ≥ 0 .

Since C and CA are nonsingular, A is nonsingular. Finally, since C ≥ 0 and(CA)−1 ≥ 0,

A−1 = (CA)−1C ≥ 0 .

This proves the first direction of the lemma.

To prove the converse, suppose that I−CA ≥ 0, A and C are nonsingular, andboth A−1 and C ≥ 0. For any matrix B, the geometric series satisfies k∑

j=0

Bj

(I−B) = I−Bk+1 .


Then if B = I−CA, it follows that C = (I−B)A−1 and k∑j=0

(I−CA)j

C = A−1 − (I−CA)k+1A−1 .

Since both I−CA ≥ 0 and A−1 ≥ 0, it follows that for any x > 0, k∑j=0

(I−CA)j

Cx = A−1x− (I−CA)k+1A−1x ≤ A−1x .

Since Cx > 0, the geometric series is nonnegative and bounded above; thereforeit converges. This implies that (I−CA)k → 0 as k →∞. 2

Recall the following definition.

Definition 3.5-3: If ‖ · ‖ is any norm on Rn and A ∈ Rn×n then the subordinatematrix norm is

‖A‖ = supx6=0

‖Ax‖‖x‖

= sup‖x‖=1

‖Ax‖ .

Lemma 3.5-4: Suppose that ‖b‖ is a norm on b ∈ Rn, that ‖M‖ is a subordinatematrix norm for M ∈ Rn×n, and that A,C ∈ Rn×n are such that ‖I −CA‖ < 1. Thenthe iterative improvement algorithm converges. Further, given any ε < 1, the number ofiterations of iterative improvement required to reduce the error by a factor of ε is at mostlog(ε)/ log(‖I−CA‖).

Proof: If e is the error in the initial guess, the error after k iterations ofiterative improvement satisfies

‖(I−CA)ke‖ ≤ ‖I−CA‖k‖e‖ .

It follows that if ‖I−CA‖ < 1, then the iteration converges. If

‖I−CA‖k ≤ ε

then the error after k iterations will be reduced by a factor of ε or less. Takinglogarithms of this inequality, we get

k log(‖I−CA‖) ≤ log ε ,

which is easily solved for k to prove the lemma. 2


3.5.1 Richardson’s Iteration

Richardson’s iteration is iterative improvement with preconditioner C = I 1µ , where µ is

an appropriately chosen scalar. This algorithm can be written in the form

do i = 1, nr(i) =

∑nj=1 A(i, j)x(j)− b(i)

enddodo i = 1, n

x(i) = x(i)− r(i)/µenddo

(3.5-2)

Example 3.5-5: Suppose that we discretize a heat equation in space, and we use theforward Euler scheme to integrate in time. We obtain a discretization of the form

un+1 − un

4t= −Aun + b

where A represents the spatial discretization of the heat equation, and b represents the inho-mogeneities in the boundary conditions, or heat sources within the body. This discretizationcan be rewritten in the form of a Richardson iteration

un+1 = un + (b−Aun)4t

in which µ = 1/4t.It is easy to compute the residual in the heat equation, by writing the difference equation

for the scheme. For example, in two dimensions the i, j entry of the residual for the backwardEuler method in the interior of the domain would be

ri,j = un+1i,j − uni,j +D4t

−un+1i+1,j + 2un+1

i,j − un+1i−1,j

4x2+D4t

−un+1i,j+1 + 2un+1

i,j − un+1i,j−1

4y2

Let us discuss conditions that will guarantee the convergence of Richardson’s iteration.

Lemma 3.5-6: Suppose that the eigenvalues of λ of A all satisfy 0 < λmin < λ <λmax, and let µ > λmax/2. Then iterative improvement with preconditioner C = I 1

µ (i.e.Richardson’s iteration) converges from any initial guess.

Proof: The assumptions imply that λ − µ < µ and that µ − λ < µ for alleigenvalues λ of A; it follows that |λ − µ| < µ for all eigenvalues λ of A. Thisinequality can also be written

|1− λ/µ| < 1 .


If x is an eigenvector of A with eigenvalue λ, then x is an eigenvector of I−A/µwith eigenvalue 1 − λ/µ. Thus the spectral radius of I − CA is strictly lessthan one, and lemma 3.5-1 guarantees that the iteration will converge from anyinitial guess. 2

It is not hard to see that the smallest value for ρ(I − CA) is obtained by taking µ =12(λmax + λmin). With this choice, if λ is any eigenvalue of A then

|1− λ

µ| ≤ λmax − λmin

λmax + λmin=κ(A)− 1κ(A) + 1

.

For this choice of µ, components of the error corresponding to the smallest and largesteigenvalues will be reduced the least, and the components of the error due to the averageeigenvalue will be reduced the most. If on the other hand we choose µ = λmax, then

|1− λ

µ| ≤ 1− λmin

λmax= 1− 1

κ(A).

With this choice of µ, we find that Richardson’s iteration will make the greatest reduction inthe component of the error associated with the largest eigenvalue, and the smallest reductionin the component of the error associated with the smallest eigenvalue.

Figures 3.5-1 and 3.5-3 show results for Richardson’s iteration in solving2 −1

−1 2. . .

. . . . . . −1−1 2

x1

x2...

xn

=

10...0

for µ = 4 ≈ λmax and µ = 2 ≈ 1

2(λmax + λmin) and initial guess chosen to be uniformlydistributed random in [0, 1]. Figures 3.5-2 and 3.5-4 show the reduction of the errors inthese iterations. Note that the larger value of µ produces a smoother but less accuratesolution than the smaller value of µ.

A program to perform Richardson’s iteration (and other forms of iterative improvement)for Laplace’s equation in one dimension can be found in Program 3.5-7: IterativeIm-provementMain.C. This program calls Fortran subroutine richardson in Program3.5-8: iterative improvement.f. Students can execute this program by clicking onExecutable 3.5-9: iterative improvement The executable selects random values be-tween zero and one for the initial solution values, and plots the numerical solution for eachiteration.

The Laplacian in two dimensions is solved in Program 3.5-10: IterativeError2D.C.This program calls Fortran subroutine richardson in Program 3.5-11: iterative improvement 2d.m4.

http://www.math.duke.edu/~johnt/math226/iterative_methods/IterativeImprovementMain.C


http://www.math.duke.edu/~johnt/math226/iterative_methods/iterative_improvement.f


http://www5.math.duke.edu/cgi-bin/startvnc?run=iterative_methods_iterative_improvement

http://www.math.duke.edu/~johnt/math226/iterative_methods/IterativeError2D.C

http://www.math.duke.edu/~johnt/math226/iterative_methods/iterative_improvement_2d.m4


(a) 1 iteration (b) 10 iterations

(a) 100 iterations (b) 1000 iterations

Figure 3.5-1: Computed solution with Richardson’s iteration, µ = λmax.


Figure 3.5-2: Error in Richardson’s iteration, µ = λmax: log error versus iteration number




Figure 3.5-3: Computed solution with Richardson’s iteration, µ = 12(λmax + λmin).


Figure 3.5-4: Error in Richardson’s iteration, µ = 12(λmax +λmin): log error versus iteration

number


This file is processed by the m4 macro processor to produce a Fortran .f file. Students canexecute this program by clicking on Executable 3.5-12: iterative error The executableselects random values between zero and one for the initial solution values on the grid, andcontours the numerical solution for each iteration. By setting the number of grid cells in onecoordinate direction to zero, the user can perform a mesh refinement study, and comparethe performance of several iterative methods.

3.5.2 Jacobi Iteration

Let us write A = D− L−U, where D is diagonal, L is strictly lower triangular andU is strictly upper triangular. If D has nonzero diagonal entries, the Jacobi iterationchooses the preconditioner to be C = D−1. With this choice, the iterative improvementalgorithm takes the form

rk = Axk − b , xk+1 = xk −D−1rk .

This algorithm can be written in the form

do i = 1, nr(i) =

∑nj=1 A(i, j)x(j)− b(i)

enddodo i = 1, n

x(i) = x(i)− r(i)/A(i, i)enddo

(3.5-3)

Note that all components of the residual are computed before any components of x areupdated.

There are some practical circumstances under which the Jacobi iteration will converge.

Lemma 3.5-13:[43] If A is strictly diagonally dominant, then the Jacobi iteration con-verges for any initial guess.

Proof: Note that the spectral radius of I−D−1A satisfies

ρ(I−D−1A) ≤ ‖D−1(D−A)‖∞ = maxi

∑j 6=i

|Aij/Aii| < 1 .

Then lemma 3.5-1 proves that the iteration converges. 2

Lemma 3.5-14:[7, Corollary 6.17] Suppose that A is an M-matrix. Then the Jacobiiteration converges for any initial guess.

http://www5.math.duke.edu/cgi-bin/startvnc?run=iterative_methods_iterative_error_2d


Proof: In this case, C = D−1 is the inverse of the diagonal of A; since A haspositive diagonal entries, C is nonsingular and nonnegative. Furthermore, sincethe off-diagonal entries of A are nonpositive, D −A = L + U ≥ 0. It followsthat I −D−1A = D−1(D −A) ≥ 0. Then lemma 3.5-2 with C = D−1 showsthat iterative improvement converges. 2

Example 3.5-15: For the centered differences with Crank Nicolson integration of theheat equation, the entries of I−D−1A are either zero or τ

1+2Dτ (except near boundaries),where τ = d4t/4x2 is the decay number and D is the number of spatial dimensions. Thusthe Gerschgorin circle theorem indicates that the spectral radius satisfies

ρ(I−D−1A) ≤ 2Dτ1 + 2Dτ

=1

1 + 12Dτ

.

Note that 1/(1+x) ≤ 1−x/2 for 0 < x < 1 and that log is strictly increasing. If we choose4t = O(4x), then for 4x sufficiently small,

− log(ρ(I−D−1A)) ≥ − log

(1

1 + 12Dτ

)≈ 1

2Dτ= O(4x) .

Although − log ρ(I−D−1A) is about twice as large as the value for Richardson’s iteration,the number of iterations required for convergence is of the same order. We do not expectJacobi iteration to make implicit schemes for the heat equation competitive with explicitschemes.

In figure 3.5-5 we show the computed solution for the Jacobi iteration for the sameproblem as in figures 3.5-1 and 3.5-3. Note that the error behaves the same as in theRichardson iteration with µ = 1

2(λmax+λmin). This is because in this example, the diagonalof A is constant, with value very nearly equal to 1

2(λmax + λmin).A program to perform Jacobi’s iteration can be found in Program 3.5-16: Itera-

tiveImprovementMain.C This program calls a Fortran subroutine jacobi in Program3.5-17: iterative improvement.f A related algorithm, typically used for smoothing inmultigrid, is given by jacobi omega, also in iterative improvement.f. Students canexecute this program by clicking on Executable 3.5-18: iterative improvement

The Laplacian in two dimensions is solved in Program 3.5-19: IterativeError2D.C.This program calls Fortran subroutine jacobi omega in Program 3.5-20: iterative improvement 2d.m4.Students can execute this program by clicking on Executable 3.5-21: iterative errorThe executable selects random values between zero and one for the initial solution valueson the grid, and contours the numerical solution for each iteration. By setting the numberof grid cells in one coordinate direction to zero, the user can perform a mesh refinementstudy, and compare the performance of several iterative methods.












Figure 3.5-5: Computed solution with Jacobi iteration


Figure 3.5-6: Error in Jacobi iteration: log error versus iteration number


3.5.3 Gauss-Seidel Iteration

It is natural to modify the Jacobi iteration to use the new information as soon as itbecomes available. The componentwise equations in the resulting Gauss-Seidel iterationare

xi− = (n∑j=1

Aijxj − bi)/Aii .

This algorithm takes the form

do i = 1, nr =

∑nj=1 A(i, j)x(j)− b(i)

x(i) = x(i)− r/A(i, i)enddo

(3.5-4)

Note that the entries of the right-hand side are computed using the current entries of x,and that r does not have to be stored as a vector.

Another way to write the Gauss-Seidel iteration is to split the matrix A into its diagonal,strictly lower and stricly upper triangular parts:

A = D− L−U .

Then Gauss-Seidel chooses C = (D− L)−1. This means that the iteration will converge ifand only if ρ(I−CA) = ρ

((D− L)−1U

)< 1.

Lemma 3.5-22:[7, p. 231] If A is an M-matrix, then Gauss-Seidel iteration converges forany initial guess.

Proof: First, we will show that D−L is an M-matrix. To see, this, note thatcorollary 3.4-12 shows that the diagonal entries of A are positive. Since D− Lis lower-triangular with positive diagonal entries and non-positive entries belowthe diagonal, the forward substitution algorithm shows that C = (D−L)−1 ≥ 0.Alternatively, note that

C = (D− L)−1 =[I−D−1L

]−1 D−1 =

[ ∞∑k=0

(D−1L)k]D−1 ≥ 0 .

Also note that I − CA = (D − L)−1U ≥ 0. This shows that the hypothesesof lemma 3.5-2 have been satisfied, and the convergence of the Gauss-Seideliteration is proved. 2

Lemma 3.5-23:[43, Theorem 10.1-2] If A is symmetric and positive definite, then theGauss-Seidel iteration converges for any initial guess.


Proof: Lemma 3.5-1 shows that need only prove that ρ((D − L)−1L>) < 1.Since A is positive definite, D is positive definite. It suffices to study the spectralradius of the similar matrix

D12 (D−L)−1L>D− 1

2 =(I + [D− 1

2 LD− 12 ])−1 (

D− 12 LD− 1

2

)>≡ (I+M)−1M> .

Suppose that (I + M)−1M>x = xλ with ‖x‖2 = 1; then M>x = (I + M)xλ,and so

x>M>x =(1 + x>Mx

)λ .

Let x>Mx = a+ bi ∈ C. Since D− 12 AD− 1

2 = I + M + M> is positive definite,

0 < x>D− 12 AD− 1

2 x = x>(I + M + M>)x = 1 + 2a .

We can solve for the eigenvalue λ to get

|λ|2 =∣∣∣∣ −a+ bi

1 + a+ bi

∣∣∣∣2 =a2 + b2

1 + 2a+ a2 + b2< 1 .

2

It is somewhat harder to estimate the spectral radius of I−CA for Gauss-Seidel iteration.Axelsson [7] claims that for matrices such as those that arise from discretization of the heatequation, − log ρ(I−CA) is about twice as large for Gauss-Seidel iteration as it is for Jacobiiteration. Although this means that Gauss-Seidel iteration would take about half as manyiterations as Jacobi to converge, the number of iterations is still too large to make implicitintegration of the heat equation competitive with explicit time integration.

In figure 3.5-7 we show the computed solution for the Gauss-Seidel iteration for thesame problem as in figures 3.5-1, 3.5-3 and 3.5-5 Note that the error smooths quickly, as inthe Richardson iteration with µ = λmax. However, the error also reduces more quickly thanin either Richardson’s iteration or the Jacobi iteration.

There are several variants of the Gauss-Seidel iteration. In some cases, the order of thetransversal of the unknowns is reversed:

do i = 1, nr =

∑nj=1 A(i, j)x(j)− b(i)

x(i) = x(i)− r/A(i, i)enddodo i = n, 1,−1

r =∑n

j=1 A(i, j)x(j)− b(i)x(i) = x(i)− r/A(i, i)

enddo

(3.5-5)




Figure 3.5-7: Computed solution with Gauss-Seidel iteration


Figure 3.5-8: Error in Gauss-Seidel iteration: log error versus iteration number


This helps to remove the bias toward one end of the problem domain. Another variant is tocycle through the unknowns in “red-black” ordering, for regular grids that can be relatedto a checkboard layout. The red unknowns could be processed first, followed by the blackunknowns.

A program to perform Gauss-Seidel iteration can be found in Program 3.5-24: Itera-tiveImprovementMain.C. This program calls a Fortran subroutine gauss seidel to froin Program 3.5-25: iterative improvement.f. A related algorithm is given by gauss seidel red black.Students can execute this program by clicking on Executable 3.5-26: iterative improvement

The Laplacian in two dimensions is solved in Program 3.5-27: IterativeError2D.C.This program calls Fortran subroutine gauss seidel to fro in Program 3.5-28: itera-tive improvement 2d.m4. Students can execute this program by clicking on Executable3.5-29: iterative error The executable selects random values between zero and one forthe initial solution values on the grid, and contours the numerical solution for each iteration.By setting the number of grid cells in one coordinate direction to zero, the user can performa mesh refinement study, and compare the performance of several iterative methods.

3.5.4 Successive Over-Relaxation

It is common to modify the Gauss-Seidel iteration by including a relaxation parameter.The residual in the midst of the Gauss-Seidel iteration is

r = (−Lxk+1 + Dxk −U)xk − b

and relaxation of the Gauss-Seidel iteration would take

xk+1 = xk −D−1rω

for some scalar ω. The resulting algorithm takes the form

do i = 1, nr =

∑nj=1 A(i, j)x(j)− b(i)

x(i) = x(i)− ωr/A(i, i)enddo

Again, we use the new values of xi as soon as they are computed. The term over-relaxationcomes from the fact that the optimal value of the relaxation parameter ω will turn out tobe greater than one.

In matrix-vector form, the SOR iteration can be written

Dxk+1 = Dxk − (Dxk − Lxk+1 −Uxk − b)ω ,











from which it follows that

(D− Lω)xk+1 = (D− Lω)xk − (Dxk − Lxk −Uxk − b)ω ,

which is equivalent to

xk+1 = xk − (D− Lω)−1(Axk − b)ω .

Similarly, the error e = x− x satisfies

Dek+1 = Dek − (Dek − Lek+1 −Uek)ω ,

orek+1 = (D− Lω)−1[D(1− ω) + Uω]ek .

This suggests that we should study the eigenvalues of

Gω ≡ (D− Lω)−1[D(1− ω) + Uω] . (3.5-6)

Our first result concerning the SOR iteration is too restrictive to be useful, because wealready know from lemma 3.5-22 that Gauss-Seidel (i.e. SOR with ω = 1) converges underthese circumstances.

Lemma 3.5-30:[7, p. 232] Suppose that A is an M-matrix and 0 ≤ ω ≤ 1. Then theiteration

xi + = ω(bi −n∑j=1

Aijxj)/Aii .

converges for any initial guess.

Proof: Note that since A is monotone, A = D− L−U where D is diagonalwith positive diagonal entries, and L and U are nonnegative and strictly loweror upper triangular. It follows that (D−Lω)−1 ≥ 0, and D(1−ω)+Uω ≥ 0 for0 ≤ ω ≤ 1. Lemma 3.5-2 now shows that the iteration converges for any initialguess. 2

The next lemma provides a restriction on the useful relaxation parameters.

Lemma 3.5-31:[7, Theorem 6.32] Suppose that Gω is defined by (3.5-6), where D isdiagonal and nonsingular, L is strictly lower triangular and U is strictly upper triangular.Then

ρ(Gω) ≥ |ω − 1| ,

so the SOR iteration diverges if ω < 0 or if ω > 2.


Proof: Since the determinant of Gω is the product of its eigenvalues λi,∏i

λni = det(I−D−1Lω)−1[I(1− ω) + D−1Uω]

= det(I−D−1Lω)−1 det[I(1− ω) + D−1Uω]

= det[I(1− ω) + D−1Uω] = (1− ω)n .

It follows that at least one eigenvalue must satisfy |λi| ≥ |1− ω|. 2

Wachspress [91] discusses ways to find the optimal relaxation parameter ω. These ap-proaches are seldom used nowadays, due to the development of other iterative linear solvers.

Note that for a constant-coefficient heat equation, we need only compute the optimalSOR relaxation factor once, and use it for all timesteps. Let J = J = D−1(L + L>) bethe Jacobi iteration matrix; then the errors in the Jacobi iteration satisfy ek+1 = Jek. Ifρ(J) ≈ 1− ε, then the optimal SOR relaxation factor satisfies

ρ(Gω) ≈ 1−√

8ε .

For the heat equation, we used the Gerschgorin circle theorem to estimate

ρ(J) ≤ 11 + 1

2Dτ

.

With the choice 4t = O(4x), we found that

− log(ρ(J)) ≤ 1− 11 + 1

2Dτ

≈ 12Dτ

= O(4x) .

Thus

− log(ρ(Gω)) ≈√

81

2Dτ= O(

√4x) .

This indicates that SOR iteration with the optimal relaxation factor can lead implicit iter-ation to a lower order of work than explicit time integration. However, we will find ways tosolve the linear systems even faster.

In figure 3.5-9 we show the spectral radius for the SOR iteration as a function of theover-relaxation factor, for the same problem as in figures 3.5-2 and 3.5-4. Note that thespectral radius is very sensitive to the choice of ω near the optimal value. In figure 3.5-10we show the computed solution for ω = 1.94, which is close to the optimal value. Note thatthe solution converges rapidly, but is not smoothed rapidly. The error for this iteration isshown in figure 3.5-11.

A program to perform SOR iteration can be found in Program 3.5-32: SORMain.CThis program calls a Fortran subroutine sor in Program 3.5-33: iterative improvement.f

http://www.math.duke.edu/~johnt/math226/iterative_methods/SORMain.C



Figure 3.5-9: Spectral Radius in SOR iteration: log radius versus over-relaxation factor ω




Figure 3.5-10: Computed solution with Gauss-Seidel iteration, ω = 1.94.


Figure 3.5-11: Error in Gauss-Seidel iteration, ω = 1.94: log error versus iteration number


The SORMain.C main program can plot the spectral radius of the iteration versus the over-relaxation factor. Students can execute this program by clicking on Executable 3.5-34:iterative improvement

The Laplacian in two dimensions is solved in Program 3.5-35: IterativeError2D.C.This program calls Fortran subroutine sor in Program 3.5-36: iterative improvement 2d.m4.Students can execute this program by clicking on Executable 3.5-37: iterative errorThe executable selects random values between zero and one for the initial solution valueson the grid, and contours the numerical solution for each iteration. By setting the numberof grid cells in one coordinate direction to zero, the user can perform a mesh refinementstudy, and compare the performance of several iterative methods.

3.5.5 Termination Criteria for Iterative Methods

Typically, when we solve Ax = b numerically, we iterate until we believe that thenumerical solution is close to the true solution. The following estimate from linear algebracan help us with this decision.

Lemma 3.5-38: Suppose that A ∈ Rn×n is nonsingular, b ∈ Rn is nonzero and x ∈ Rn

solves Ax = b. Given any approximate solution x ∈ Rn, its relative error satisfies

‖x− x‖‖x‖

≤ κ(A)‖Ax− b‖‖b‖

where κ(A) = ‖A‖‖A−1‖ is the condition number of A.

Proof: Note that x−x = A−1(Ax−b), so ‖x− x‖ ≤ ‖A−1‖‖Ax−b‖. Since‖b‖ = ‖Ax‖ ≤ ‖A‖‖x‖, we see that ‖x‖ ≥ ‖b‖/‖A‖, and conclude that

‖x− x‖‖x‖

≤ ‖A−1‖‖Ax− b‖‖b‖/‖A‖

= κ(A)‖Ax− b‖‖b‖

.

2

This lemma suggests that for well-conditioned linear systems, we can terminate an iterativemethod when the residual becomes small relative to the right-hand side of the linear system.If the linear system is not well-conditioned, relatively small residuals may not imply smallrelative errors in the numerical solution. For example, if we want to solve a linear systemwith a condition number of 103 within a relative error of ε, we could stop when

‖Ax− b‖/‖b‖ ≤ 10−3ε

and be assured of the desired relative error in the numerical solution. In this case, ε shouldbe at least as small as the local truncation error in the scheme that generated the linearsystem, and 10−3ε should be no smaller than the machine precision.







Stopping based on the change in the solution is more tricky.

Lemma 3.5-39: Suppose that for A,C ∈ Rn×n and b,x0 ∈ Rn we have an iteration ofthe form xk+1 = xk −C(Axk − b). If ‖I−CA‖ ≤ ρ < 1, then

‖xk+1 − x‖ ≤ ρ

1− ρ‖xk+1 − xk‖ .

Proof: As we have seen previously in lemma 3.5-1,

xk+1 − x = (I−CA)(xk − x) .

It follows that

(I−CA)(xk − xk+1) = (I−CA)(xk − x) + (I−CA)(x− xk+1)= [xk+1 − x] + [x−CAx− xk+1 + CAxk+1]= CA(xk+1 − x) .

Next, we note that corollary 3.2-6 implies that

(CA)−1 = [I− (I−CA)]−1 =∞∑j=0

(I−CA)j .

As a result,

‖(CA)−1‖ ≤∞∑j=0

‖(I−CA)j‖ ≤∞∑j=0

ρj =1

1− ρ; .

Putting our results together, we obtain

‖xk+1 − x‖ = ‖(CA)−1[CA(xk+1 − x)]‖ = ‖(CA)−1(I−CA)(xk − xk+1)‖

≤ ‖(CA)−1‖‖I−CA‖‖xk − xk+1‖ ≤ρ

1− ρ‖xk − xk+1‖ .

2

This lemma indicates that if we could estimate ‖I−CA‖, then we could use this estimateto safely terminate an interative improvement algorithm based on changes in the solution.For example, if ‖I −CA‖ = 1 − 10−2 and we want an absolute error in the solution of atmost ε, then we could stop when the change in the solution is at most 10−2ε.

Exercises 3.51. Formulate explicit time integration for the heat equation as an iterative improvement iteration for

the steady-state heat equation.


2. Consider the one-dimensional heat equation

∂u

∂t=

∂2u

∂x2, x ∈ (0, 1), t > 0

u(0, t) = 1 , u(1, t) = 0 , t > 0

u(x, 0) =

1, x < 1

2

0, x > 12

(a) Write a program to compute the analytical solution to this problem

(b) Program centered differences using forward Euler for this problem. Choose the timestep 4t sothat the scheme is positive.

(c) program centered differences and backward Euler for this problem. Use Gauss-Seidel iterationto solve the linear equations, and the solution at the previous timestep as the initial guess.

(d) Discuss strategies for choosing 4t for the implicit scheme.

(e) Discuss strategies for terminating the Gauss-Seidel iteration in the implicit scheme.

(f) For 4x = 10−1, 10−2, . . . , 10−6, plot the logarithm of the maximum error in these two numericalmethods at t = 1 versus the logarithm of the computer time.

3. Consider the two-dimensional Laplace equation

∂2u

∂x2+

∂2u

∂y2= 0 , 0 < x, y < 1

u(0, y) = 0 = u(1, y) , 0 < y < 1

u(x, 0) = 0 , u(x, 1) = sin(πx) sinh π , 0 < x < 1

(a) Use separation of variables to compute the analytical solution to this problem.

(b) Program centered differences for this problem, in order to compute the residual in Gauss-Seidel.

(c) Use Gauss-Seidel iteration to solve the linear equations. using zero for the initial guess.

(d) For 4x = 4y = 2−1, 2−2, . . . , 2−7, plot minus the logarithm of the maximum error in thenumerical solution on the vertical axis, versus logarithm of the computer time required toperform 1/(4x4y) iterations on the horizontal axis.

(e) Assume that the smallest eigenvalue of the discrete Laplacian accurately approximates thesmallest eigenvalue of the true Laplacian. Use your results from separation of variables toestimate the smallest eigenvalue of the true Laplacian, and use the Gerschgorin circle theoremto estimate the largest eigenvalue of the discrete Laplacian. Combine these to get an estimateof the condition number of the discrete Laplacian.

(f) Normally, as we refine the mesh, we expect the discretization of the Laplacian to become moreaccurate. In order to achieve a more accurate solution in solving the linear system, we wouldneed to use smaller tolerances in our error estimates as we refine the mesh. Suppose that theerror in the exact solution of the linear system for the discrete Laplacian is ε(4x) = 44x2. Useyour estimate of the condition number of the discrete Laplacian as a function of 4x to determinewhat relative error tolerance we should place on the residual in in order that the error in theGauss-Seidel iteration is no larger than ε(4x).

4. Read about how to choose the optimal relaxation paramter for SOR in Wachspress [91], and describehow you would apply this approach to implicit centered differences for the heat equation.

5. Read about Chebychev acceleration of iterative methods in [43] and [91]. Describe the basic algorithm,and the modifications that must be made to SOR so that we can apply Chebyshev acceleration.


3.6 Incomplete Factorization

3.6.1 Block Factorization of Block Tridiagonal Matrices

Recall from section 2.6.1 that the 2D implicit centered difference (or Crank-Nicolson)discretization of the heat equation on an n1 × n2 grid leads to a linear system with blocktridiagonal form

A =

A1 B1 . . . 0

B>1 A2

. . ....

.... . . . . . Bn2−1

0 . . . B>n2−1 An2

,

Each of the matrices Aj and Bj are n1 × n1 banded matrices with small band width; inthis case, the nonzero entries in each row occur either on the diagonal or on the first sub-or super-diagonal.

We could factorA1 B1 . . . 0

B>1 A2

. . ....

.... . . . . . Bn2−1

0 . . . B>n2−1 An2

=

D1

B>1 D2...

. . . . . .0 . . . B>

n2−1 Dn2

D1

D2

. . .Dn2

−1D1 B1 . . . 0

D2. . .

.... . . Bn2−1

Dn2

by computing the diagonal blocks Dj recursively as in the following algorithm:

D1 = A1

for 1 ≤ j < n2

Dj+1 = Aj+1 −B>j D−1

j Bj

Note that this factorization is similar to the LDL> factorization. Afterward, we could solveAx = b by forward- and back-solving. First, we solve

D1

B>1 D2...

. . . . . .0 . . . B>

n2−1 Dn2

z1

z2...

zn2

=

b1

b2...

bn2

3.6. INCOMPLETE FACTORIZATION 151

by performingsolve D1z1 = y1 ≡ b1

for 1 ≤ j < n2

compute uj+1 = bj+1 −B>j zj

solve Dj+1zj+1 = uj+1

Next, we easily solve D1

D2

. . .Dn2

−1

y1

y2...

yn2

=

z1

z2...

zn2

to get yj = Djzj for 1 ≤ j ≤ n2. Finally, we could solve

D1 B1 . . . 0

D2. . .

.... . . Bn2−1

Dn2

x1

x2...

xn2

=

y1

y2...

yn2

by performing

solve Dn2xn2 = yn2

for n2 > j ≥ 1compute wj = yj −Bj+1xj+1

solve Djxj = wj

We can save some work by combining some of the operations:

solve D1z = b1 (store zj in the same work array for all j)for 1 ≤ j < n2

u = bj+1 −B>j zj (store uj+1 in same work array for all j)

solve Dj+1z = uj+1

xn2 = zn2 (xj replaces yj)for n2 > j ≥ 1

u = Bj+1xj+1

solve Djw = u (wj stored in u)xj = z−w

The difficulty with this approach is that the matrices Dj cannot be represented by sparsestorage. Thus the computation of the diagonal blocks involves work that is proportional ton3

1, and the computation of zj and xj involves work that is proportional to n21. In addition,

the algorithm would require significantly more storage than that needed for the matrix A.


3.6.2 Approximate Factorization

Following [6, 7, 35], we will approximateA1 B1 . . . 0

B>1 A2

. . ....

.... . . . . . Bn2−1

0 . . . B>n2−1 An2

≈

D1

B>1 D2...

. . . . . .0 . . . B>

n2−1 Dn2

D1

D2

. . .Dn2

−1D1 B1 . . . 0

D2. . .

.... . . Bn2−1

Dn2

,

where the diagonal blocks in the approximate factorization are defined by

D1 = A1

for 1 ≤ j < n2

factor Dj

compute [D−1j ](p)

Dj+1 = Aj+1 −B>j [D−1

j ](p)Bj

factor Dn2

Here [D−1j M ](p) is the matrix formed from the diagonal and p sub- or super-diagonals of

D−1j M .

It is not obvious that this algorithm can be performed. We must guarantee that Dj isinvertible at each step of the algorithm.

Lemma 3.6-1:[7, Theorem 7.6] Suppose that A is symmetric and positive definite withnon-positive off-diagonal entries. Partition

A =[A1 B1

B>1 A2

]where A1 is square. Let Y1 be any symmetric positive definite matrix such that 0 ≤ Y1 ≤A−1

1 , and letX2 = A2 −B>

1 Y1B1 .

Then X2 is symmetric positive definite with non-positive off-diagonal entries.

Proof: Since A is symmetric and positive definite, we know that A1 and A2 areboth symmetric and positive definite. Thus A−1

1 exists and is positive definite.


Also, lemma 3.4-15 implies that A is an M-matrix, so lemma 3.4-13 implies thatA1 and A2 are M-matrices, and that the Schur complement A2 −B>

1 A−11 B1 is

an M-matrix.

It is easy to see that the Schur complement is symmetric and positive definite.If b2 6= 0 then

0 <[−b>2 B>

1 A−11 b>2

] [A1 B1

B>1 A2

] [−A−1

1 B1b2

b2

]= b>2 (A2 −B>

1 A−11 B1)b2 .

Since 0 ≤ Y1 ≤ A−11 , it follows that for all b2 6= 0

b>2 X2b2 = b>2 (A2 −B>1 Y1B1)b2 ≥ b>2 (A2 −B>

1 A−11 B1)b2 > 0 .

Thus X2 is positive-definite.

Since the off-diagonal entries of A2 are non-positive, all of the entries of B1

are non-positive and the entries of Y1 are non-negative, it follows that the off-diagonal entries of X2 are all non-positive. 2

This lemma implies that the approximate factorization

A =[A1 B1

B>1 A2

]≈[X1

B>1 X2

] [X1

X2

]−1 [X1 B1

X2

]can be performed whenever 0 ≤ Y1 ≤ X−1

1 = A−11 . The proof that the more general

incomplete factorization succeeds involves only repeated application of this lemma.This lemma points out a crucial aspect of the design of the incomplete factorization

algorithm. Recall that we compute

Dj+1 = Aj+1 −B>j [D−1

j ](p)Bj .

If the diagonal blocks Dj are all M-matrices, then the entries of D−1j are all nonnegative. It

follows that the banded part satisfies [D−1j ](p) ≤ D−1

j , since we are throwing away nonneg-ative entries. If we had instead taken the inverse of the banded part of Dj , the inequalitieswould have gone the other way and the factorization might not succeed.

Note that since Aj+1, Bj and [D−1j ](p) are banded, the diagonal block Dj+1 = Aj+1 −

B>j [D−1

j ](p)Bj can be computed in O(n1) work. In section 3.6.3 we will see how to computethe banded part of an inverse of a n1 × n1 matrix with band width q by using at mostn1(q + pq + p2) multiplies and n1(2 + q + pq + p2) adds. The total work in the incompletefactorization involves n2 factorizations of diagonal blocks Dj , n2 − 1 computations of thebanded parts of the inverses [D−1

j ](p), and n2 − 1 computations of new diagonal blocksDj+1. Thus the total work in the factorization is O(n1n2), which is proportional to thetotal number of unknowns in the problem. However, the total work is also proportional topq + p2, so it is important to keep the bandwidth p of the approximate inverse small.


3.6.3 Banded Part of an Inverse

The banded part of D−1 can be computed recursively. Suppose that we factor

D = (I−U)>E−1(I−U) , (3.6-1)

where E is diagonal and U is strictly upper triangular. Then (I−U)D−1 = E(I−U>)−1,so

D−1 = E(I−U>)−1 + UD−1 .

In particular, (D−1

)ij

= Eii

([I−U>

]−1)ij

+n1∑

`=i+1

Ui`

(D−1

)`j.

Since U is strictly upper triangular, we have(I−U>)

ij= 0 for i < j, so

for i < j ,(D−1

)ij

=n1∑

`=i+1

Ui`

(D−1

)`j.

For i = j we have(I−U>)

ij= 1, so

(D−1

)ii

= Eii +n1∑

`=i+1

Ui`

(D−1

)`i.

Next, suppose that U is both strictly upper triangular and banded with band width q:

Ui` = 0 for 0 ≤ ` ≤ i or i+ q < ` .

Then (D−1

)ij

=

∑minn1,i+q`=i+1 Ui`(D−1)`j , i < j

Eii +∑minn1,i+q

`=i+1 Ui`(D−1)`i , i = j

We will use the symmetry of D−1 to evaluate the off-diagonal terms recursively. If i+ q <j ≤ i+ p then (

D−1)ij

=minn1,i+q∑

`=i+1

Ui`

(D−1

)`j

expresses (D−1)ij in terms of known entries Ui` and entries of D−1 in the same column butrows below the desired entry. If i < j ≤ i+ q then

(D−1

)ij

=minn1,i+q∑

`=i+1

Ui`

(D−1

)`j

=j∑

`=i+1

Ui`

(D−1

)`j

+minn1,i+1∑

`=j+1

Ui`

(D−1

)j`


expresses(D−1

)ij

in terms of known entries Ui` and entries of D−1 either in the same

row but nearer the diagonal, or in the same column but rows below the desired entry. Thecomputation of the diagonal entry

(D−1

)ii

depends on off-diagonal entries of D−1 in thesame row.

In summary, if we want p ≥ q bands of D−1 where q is the band width of D, we compute

for n1 ≥ i ≥ 1for i+ 1 ≤ j ≤ mini+ q, n1

(D−1)ij =j∑

`=i+1

Ui`(D−1)`j +minn1,i+q∑

`=j+1

Uil(D−1)j`

for i+ q < j ≤ mini+ p, n1

(D−1)ij =i+q∑`=i+1

Ui`(D−1)`j

(D−1)ii = Eii +mini+q,n1∑

`=i+1

Ui`(D−1)`i .

This algorithm involves at most n1(q + pq + p2) multiplies and n1(2 + q + pq + p2) adds.Such estimates give us some idea how the bandwidth in an approximate factorization mightbe chosen.

3.6.4 Approximate Factorization and Iterative Improvement

We could use the incomplete factorization in an iterative improvement algorithm:

x := x− (LDL>)−1(Ax− b) .

The errors in the iterates would then satisfy

ek+1 = [I− (LDL>)−1A]ek .


The convergence of the algorithm thus depends on the spectral radius of I− (LDL>)−1A.This in turn depends on the errors involved in replacing D−1

j by its banded part [D−1j ](p).

Example 3.6-2: The inverse of

A =

1 −1 0 . . . 0−1 2 −1 . . . 0

0 −1 2. . .

......

.... . . . . . −1

0 0 . . . −1 2

is

A−1 =

n n− 1 . . . 2 1

n− 1 n− 1 . . . 2 1...

.... . .

......

2 2 . . . 2 11 1 . . . 1 1

Note that A corresponds to the discretization of the steady-state heat equation in one di-mension, with Neumann boundary condition at the left and Dirichlet boundary condition atthe right. Further, A is symmetric and positive-definite, irreducibly diagonally-dominant,and an M-matrix. However, the entries of A−1 do not decay rapidly as we move away fromthe diagonal.

Suppose that A is symmetric and positive-definite with eigenvalues λ satisfying 0 < a ≤λ ≤ b. Define

σ =1−

√a/b

1 +√a/b

.

Axelsson [7, p. 360] shows that the entries of A−1 a distance r or more from the maindiagonal decay with increasing r at the rate O(σr) for r ≤

√b/a, and O(σr

√r) for r >√

b/a.

3.6.5 Incomplete Factorization for Three-Dimensional Problems

In three-dimensional problems, the diagonal blocks Dj are themselves block tridiagonal.The incomplete factorization therefore involves an application of the two-dimensional in-complete factorization in order to determine block-banded factorizations of these diagonalblocks. This is complicated to program, and difficult to debug.

3.6.6 Incomplete Factorization Software

An alternative approach to incomplete factorization is due to Jones and Plassman [54].It uses sparse matrix storage to determine which entries are most significant to use in the

3.7. CONJUGATE GRADIENTS 157

approximate factorization. This algorithm involves substantial indirect addressing, but isfairly easy to apply to general problems. The algorithm is available from Program 3.6-3:netlib or can be found locally in Program 3.6-4: dicf.f. Another useful software packagefor solving linear systems that arise in elliptic partial differential equations is PETSc. Thispackage is very effective for distributed computing. PETSc is available online at Program3.6-5: Argonne Labs.

A program to perform iterative improvement with the Jones and Plassman incompletefactorization algorithm can be found in Program 3.6-6: IterativeError2D.C. This pro-gram calls Fortran subroutine Program 3.6-7: dicf.f. Students can execute this programby clicking on Executable 3.6-8: iterative error The executable selects random valuesbetween zero and one for the initial solution values on the grid, and contours the numericalsolution for each iteration. By setting the number of grid cells in one coordinate directionto zero, the user can perform a mesh refinement study, and compare the performance ofseveral iterative methods.

Exercises 3.6

1. Show that the diagonal block for the two-dimensional steady-state heat equation is a scalar multipleof the n× n matrix

D =

2666644 −1 . . . 0

−1 4. . .

......

. . .. . . −1

0 . . . −1 4

377775Compute D−1 for n = 8, 16, 32, 64, and plot

logmax1≤i≤n−r(D

−1)i,i+r

(D−1)i,i

as a function of r. Compare your results to those in section 3.6.4.

2. Get a copy of the Jones-Plassman incomplete factorization and their incomplete factorization algo-rithm. Describe how to apply the Jones-Plassman incomplete factorization to the 3D implicit centereddifference scheme for the heat equation.

3.7 Conjugate Gradients

All of our previous iterative methods for solving linear equations have been based oniterative improvement iteration. In this section, we will discuss a completely differentapproach. Suppose that A is symmetric and positive definite. Instead of viewing theproblem as a linear equation Ax = b, we will view it as an optimization problem: minimizeφ(x) where

φ(x) ≡ 12x>Ax− b>x .

http://www.netlib.org/toms/740

http://www.netlib.org/toms/740

http://www.math.duke.edu/~johnt/math226/iterative_methods/dicf.f

http://http://www-unix.mcs.anl.gov/petsc



http://www.math.duke.edu/~johnt/math226/iterative_methods/dicf.f



Using the residual r = Ax− b, we can write

φ(x) =12(x−A−1b)>A(x−A−1b)− 1

2b>A−1b =

12r>A−1r− 1

2b>A−1b .

Note that the first-order conditions for the minimum require that

0 = ∇xφ(x) = Ax− b = r .

Thus for symmetric positive-definite matrices, the minimization problem is equivalent tothe original linear system.

We will develop an algorithm for minimizing φ(x) over subspaces in Rn of strictlyincreasing dimension. This will imply that the algorithm will terminate in at most nsteps. For banded matrices A with O(1) nonzero entries in each row, each step will involveO(n) operations; in this case the total work will be at most O(n2) operations for bandedsymmetric positive definite systems. In many cases, the algorithm will produce good resultsin far fewer iterations.

3.7.1 Self-Adjoint Systems

For additional information on conjugate gradient methods, see [7, 43, 61, 91]. Our initialdiscussion will follow that in Luenberger [61].

The basic idea behind conjugate gradients is to minimize φ(x) by searching along anappropriate set of linearly independent directions. The following definition will tell us what“appropriate” means in this case.

Definition 3.7-1: Suppose that A ∈ Rn×n is symmetric and that p0, . . . ,pm ∈ Rn arenonzero. Then p0, . . . ,pm are A-conjugate if and only if i 6= j implies that p>i Apj = 0.

This definition allows us to state and prove the following lemma.

Lemma 3.7-2: Suppose that A ∈ Rn×n is symmetric and positive definite, and thatp0, . . . ,pm ∈ Rn are A-conjugate. Then p0, . . . ,pm are linearly independent, and m < n.

Proof: Suppose that∑m

i=0 piαi = 0 is a linear combination of the A-conjugatevectors. Since the vectors pi are A-conjugate,

∀0 ≤ j ≤ m , 0 =m∑i=0

p>j Apiαi = p>j Apjαj .

Note that since pj 6= 0 and A is positive definite, p>j Apj > 0. We concludethat αj = 0 for all j. Since any set of more that n vectors in Rn is linearlydependent, we must have m < n. 2


The next theorem shows us how to use A-conjugate vectors to minimize the quadraticform.

Conjugate Direction Theorem [61, p. 170] 3.7-3: Suppose that A ∈ Rn×n issymmetric and positive-definite, and that p0, . . . ,pn−1 ∈ Rn are A-conjugate vectors.Given any x0 ∈ Rn, compute the scalars αk and vectors xk by the algorithm

for 0 ≤ k < n , αk = −p>k (Axk − b)

p>k Apkand xk+1 = xk + pkαk . (3.7-1)

Then Axn = b and xn = x.

Proof: Since the vectors p0, . . . ,pn−1 form a basis for Rn, we can write

x− x0 =n−1∑i=0

piβi

for some scalars βi. By multiplying this equation by p>k A, we can solve for βkto get

βk =p>k A(x− x0)

p>k Apk= −

p>k (Ax0 − b)p>k Apk

.

This gives us a relation between the solution x, the initial guess x0 and the A-conjugate vectors p0, . . . ,pn−1. Next, we will show inductively that xk, definedby (3.7-1) satisfies

xk = x0 +k−1∑i=0

piβi .

This statement is obviously true for k = 0. Assuming that the statement is truefor k, we will prove that it is true for k+ 1. Note that the inductive hypothesisand the A-conjugacy of the pj imply that

p>k A(xk − x0) =k−1∑i=0

p>k Apiβi = 0 .

Thus

αk = −p>k (Axk − b)

p>k Apk= −

p>k A(xk − x0) + p>k (Ax0 − b)p>k Apk

= −p>k (Ax0 − b)

p>k Apk= βk .

2


Expanding Subspace Theorem 3.7-4:[61, p. 171] Suppose that A ∈ Rn×n is sym-metric positive definite, and that p0, . . . ,pn−1 ∈ Rn are A-conjugate. Also suppose thatgiven x0 ∈ Rn we perform the algorithm in equation (3.7-1). to compute the scalars αkfor 0 ≤ k < n and vectors xk for 1 ≤ k ≤ n. Then for 1 ≤ k ≤ n, xk minimizes

φ(x) ≡ 12x>Ax− b>x

over the setsLk ≡ xk−1 + pk−1β : β ∈ R

and

Mk ≡

x0 +

k−1∑i=0

piβi : βi ∈ R

.

Furthermore,∀0 ≤ i < k , p>i (Axk − b) = 0 .

Proof: Since Lk ⊂ Mk, it suffices to show that xk minimizes φ(x) over Mk.Recall that φ(x) = 1

2x>Ax − b>x, so ∇xφ(x) = Ax − b. At a minimum of

φ(x0 +

∑k−1i=0 piβi

)with respect to all possible choices of the coefficients βi, we

want the first derivatives with respect to the βi to be zero:

∀0 ≤ i < k 0 =∂φ(xk)∂βi

= p>i Axk − b>pi = p>i (Axk − b) = p>i ∇xφ(xk) .

We also want the matrix of second derivatives to form a symmetric nonnegativematrix; this matrix has entries

∀0 ≤ i, j < k ,∂2φ(xk)∂βi∂βj

= p>i Apj .

If P = [p0, . . . ,pn−1] is the matrix of A-conjugate vectors, then the matrixof second derivatives of φ with respect to the β values is P>AP. Since Ais positive definite and P is nonsingular, this matrix of second derivatives ispositive definite. Thus, we need only show that the first-order conditions for aminimum are satisfied:

∀x ∈Mk ∀0 ≤ i < k , 0 = p>i ∇xφ(xk) = p>i (Axk − b) . (3.7-2)

We will prove this statement by induction on k.


For k = 0, Mk consists of a single vector and the set of first-order conditions fora minimum is empty. Assume that the first-order conditions 3.7-2 are satisfiedfor all iterates from 0 to k. We will prove that equations 3.7-2 are true for k+1.Note that

∇xφ(xk+1) = Axk+1 − b = (Axk − b) + Apkαk .

Thus the definition of αk shows that

p>k ∇xφ(xk+1) = p>k (Axk − b) + p>k Apkαk = 0 .

Furthermore, the inductive hypothesis and the A-conjugacy of the pi show that

∀0 ≤ i < k p>i ∇xφ(xk+1) = p>i (Axk − b) + p>i Apkαk = 0 .

This proves the first-order condition 3.7-2 for the minimum over Mk. 2

The proof of the previous theorem suggests that we seek A-conjugate directions pk thatare related to the gradients −∇xφ(xk), since the latter are orthogonal to all previous searchdirections and point in the directions of steepest descent.

Unfortunately, the previous theorems do not show how to compute the A-conjugatevectors p0, . . . ,pn−1. The next theorem does.


Conjugate Gradient Theorem 3.7-5:[61, p. 174] Suppose that A ∈ Rn×n is symmetricpositive definite and that x0 ∈ Rn. Suppose that we perform the following algorithm:

p0 = −r0 = b−Ax0 ,while 0 ≤ k < n and rk 6= 0

αk = − p>k rk

p>k Apk

xk+1 = xk + pkαkrk+1 = Axk+1 − b

βk = p>k Ark+1

p>k Apk

pk+1 = −rk+1 + pkβk

(3.7-3)

Then if rk 6= 0,

1. the span of the gradients

<r0, r1, . . . , rk>≡

k∑i=0

riγi : γi ∈ R

is the Krylov subspace

Kk ≡<r0,Ar0, . . . ,Akr0> ;

2. <p0,p1, . . . ,pk>= Kk;

3. the vectors p0, . . . ,pk−1 are A-conjugate:

∀0 ≤ i < k , p>i Apk = 0 .

Proof: We will prove all three results simultaneously by induction. All threeof these results are obvious for k = 0. Inductively, we will assume that theyare true for all iteration up to and including k, and show that they are true fork + 1.

Note that the definition of the residual rk and the iterative expression for xk+1

imply that

rk+1 ≡ Axk+1 − b = A(xk + pkαk)− b = (Axk − b) + Apkαk = rk + Apkαk .

Inductively, we have assumed that rk =∑k

i=0 Air0ρi and that pk =∑k

i=0 Air0πifor some scalars ρi and πi. Thus rk+1 ∈ Kk+1. Since the inductive hypotheses


guarantee that the hypotheses of the expanding subspace theorem (3.7-4) arevalid up to xk+1, the expanding subspace theorem shows that rk+1 is orthogonalto p0, . . . ,pk. It follows that rk+1 is orthogonal to the Krylov subspace Kk. Sincethe induction assumes that rk+1 6= 0, it follows that < r0, r1, . . . , rk+1 >= Kk+1.This proves the first claim.

Next, we note that the algorithm (3.7-3) chooses pk+1 = −rk+1 + pkβk. Sincethe inductive hypothesis implies that pk ∈ Kk, and since we have just provedthat rk+1 ∈ Kk+1, it follows that pk+1 ∈ Kk+1. This proves the second claim.

By the evaluation of pk+1 in the algorithm (3.7-3) we find that

∀0 ≤ i ≤ k , p>i Apk+1 = p>i A(−rk+1 + pkβk) .

When i = k, the definition of βk shows that p>k Apk+1 = 0. When i < k,the inductive hypothesis shows that p>i Apk = 0, and the expanding subspacetheorem (3.7-4) shows that p>i Ark+1 = 0. This proves the final claim. 2

The following corollary gives us some alternative forms for computing the terms in theconjugate gradient algorithm.

Corollary 3.7-6: Suppose that A ∈ Rn×n is symmetric positive definite and that x0 ∈Rn. Suppose that we perform the algorithm (3.7-3). Then if rk 6= 0,

rk+1 = rk + Apkαk , αk = r>k rk/p>k Apk and βk = r>k+1rk+1/r>k rk .

Proof: The first claim was contained in the proof of the conjugate gradienttheorem (3.7-3). The second claim is obvious for k = 0. To prove the secondclaim for k > 0, note that the definition of pk in algorithm 3.7-3 implies that

r>k pk = r>k (−rk + pk−1βk−1) .

Since the expanding subspace theorem 3.7-4 shows that r>k pk−1 = 0, we see that

αk ≡ −p>k rk

p>k Apk=

r>k rkp>k Apk

.

This is the second claim.

To prove the third claim, note that the conjugate gradient theorem (3.7-5) im-plies that rk ∈ Kk, and that the span of the search directions < p0, . . . ,pk > isKk is Since the expanding subspace theorem 3.7-4 shows that rk+1 is orthogonal


to < p0, . . . ,pk >, it follows that r>k+1rk = 0. Since the first claim implies thatApk = (rk+1 − rk)/alphak, it follows that

βk ≡r>k+1Apkp>k Apk

=r>k+1(rk+1 − rk)

p>k Apk

1αk

=r>k+1rk+1

p>k Apk

p>k Apkr>k rk

=r>k+1rk+1

r>k rk.

2

Several variations of the conjugate gradient algorithm have appeared in the literature.The most efficient and accurate form appears to be the following:

s = b−Axp = sγ = p>suntil convergence do

z = Apα = γ/p>zx = x + pαs = s− zαδ = s>sp = s+ pδ/γγ = δ

(3.7-4)

This form requires only one matrix-vector multiply per iteration.The conjugate gradient algorithm should be terminated whenever one of the following

four conditions is satisfied. First, we should stop if α ≤ 0; this indicates that eitherthe residual is zero or A is not positive definite. Second, we should stop if ‖p‖∞α issmall compared to ‖x‖∞; this indicates that the conjugate gradient algorithm will makelittle change in its approximation to the solution. Third, we should stop of ‖r‖∞ is smallcompared to ‖b‖∞ (provided that A is reasonably well-scaled); the standard error estimate

‖x− x‖‖x‖

≤ κ(A)‖A(x− x)‖

‖b‖

for linear systems indicates that the relative error in the solution is as small as the condi-tioning the system will allow. Finally, we should stop if more than n iterations have beenperformed, since the conjugate gradient algorithm should converge to the exact solution inat most n iterations with exact arithmetic.

Next, let us examine some estimates for the convergence of the conjugate gradientiteration.


Theorem 3.7-7: Suppose that A ∈ Rn×n is symmetric and positive definite, and thatgiven x0 ∈ Rn the vectors xk for 1 ≤ k ≤ n are computed by the conjugate gradientiteration 3.7-3. Then

‖xk+1 − x‖2A ≡ (xk+1 − x)>A(xk+1 − x) = min

q∈Pk‖ [I + Aq(A)] (x0 − x)‖2

A .

Proof: The conjugate gradient algorithm 3.7-3 shows that xk+1 − x0 is in thespan of the vectors p0, . . . ,pk, and the conjugate gradient theorem 3.7-3 showsthat the span of the vectors p0, . . . ,pk is Kk. Thus xk+1 = x0 + qk(A)r0 forsome polynomial qk ∈ Pk. This in turn implies that

xk+1 − x = x0 − x + qk(A)A(x0 − x) = [I + qk(A)A] (x0 − x) .

This result implies that

‖xk+1 − x‖2A = ‖ [I + qk(A)A] (x0 − x)‖2

A .

Next, note that the expanding subspace theorem 3.7-4 implies that xk+1 mini-mizes ‖x−x‖2

A over all x−x0 ∈< p0, . . . ,pk >. The conjugate gradient theoremimplies that xk+1 minimizes ‖x − x‖2

A over all x − x0 ∈< Kk, where Kk is theKrylov subspace Kk =< r0,Ar0, . . . ,Akr0 >. Equivalently, this says that xk+1

minimizes ‖x− x‖2A over all x− x = [I + q(A)A] (x0 − x) for some polynomial

q ∈ Pk. 2

Corollary 3.7-8: The conjugate gradient iterates satisfy

‖xk − x‖A ≤ ‖x0 − x‖A minq∈Pk,q(0)=1

maxλ an eigenvalue of A

|q(λ)| .

Proof: Since A is symmetric and positive definite, the spectral theorem impliesthat there is an orthogonal matrix X and a positive definite diagonal matrix Λsuch that AX = XΛ. Let us define y0 = X>(x0 − x). Then

‖x0 − x‖2A = (x0 − x)>A(x0 − x) = (x0 − x)>XΛX>(x0 − x) = y>0 Λy0 ,


and

‖xk − x‖2A = (x0 − x)>A [I + qk−1(A)A]2 (x0 − x) = y>0 Λ [I + qk−1(Λ)Λ]2 y0

=n∑i=1

q(λi)2λiη2i ≤ max

iq(λi)2

n∑i=1

λiη2i

=[maxi|q(λi)|

]2

‖x0 − x‖2A .

Note that 1 + qk−1(λ)λ ∈ Pk, and q(0) = 1. The previous theorem 3.7-7 verifiesthe use of the minimum in the conclusion of this corollary. 2

This result implies that if the eigenvalues of A occur in some number of clusters, then theiteration will converge in at most that number of iterations. That is because we could let qbe the polynomial with those cluster values as its zeros, and use the previous corollary tosee that the error in conjugate gradients is zero after that number of iterations.

Axelsson [7, chapter 13] proves several estimates for the convergence rate of conjugategradients. One way to estimate the bound in corollary 3.7-8 is to assume that the eigenvaluesλ of A satisfy 0 < a ≤ λ ≤ b. Then

minq∈Pk,q(0)=1

max

λ an eigenvalue of A|q(λ)|

≤ min

q∈Pk,q(0)=1

maxa≤λ≤b

|q(λ)|

=1

Tk( b−ab+a)=

2σk

1 + σ2k,

where Tk is the Chebychev polynomial of degree k,

σ ≡√κ(A)− 1√κ(A) + 1

and κ(A) = b/a is the condition number of A. This estimate is pessimistic in practice.Axelsson [7, p. 591] summarizes his discussion of the convergence of conjugate gradients

by claiming that there are three phases in the convergence: First, an initial sublinearlyconvergent phase, in which

‖xk − x‖A =√

r>k A−1rk = O([

1k + 1

]2

) ,

Next, an intermediate linearly convergent phase, in which

‖xk − x‖A =√

r>k A−1rk = O(σk) where σ =

√κ(A)− 1√κ(A) + 1

,

Finally, a typically superlinearly convergent phase, which may only be seen if the conver-gence tolerances are very small and the condition number of the matrix is large.


When conjugate gradients is applied to discretizations of partial differential equationsin which the finite difference stencil goes at most one unknown in any direction, then forarbitrary initial guesses conjugate gradients cannot be expected to converge in fewer thanO(1/4x) iterations. Consider the following scenario. We want to solve a homogeneous prob-lem with inhomogeneous Neumann data on the left boundary and zero Dirichlet boundarydata elsewhere. Our initial guess for the conjugate gradient method will be x = 0. Theevaluation of r will put nonzero values at the left boundary. The first iteration will putnonzero values in x at the left boundary. Each successive iteration will move the supportof x over one cell to the right. The iteration cannot converge before the influence of thezero Dirichlet boundary condition on the right is felt. In practice, the number of itera-tions required for convergence of conjugate gradients on discretizations of steady-state heatequations is roughly twice the largest number of grid cells in any one of the coordinatedirections.

3.7.2 Preconditioned Conjugate Gradients

Theorem 3.7-7 indicates that the convergence of conjugate gradients should be improvedif the eigenvalues of the matrix A are clustered in some way. In many cases, we can transformthe system of equations Ax = b to an equivalent system

Ax ≡(L−1AL−>

)(L>x

)= L−1b ≡ b ,

where L is some nonsingular matrix. In general, it is too expensive to find matrices L sothat L−1AL−> ≈ I; instead, we will look for matrices L so that systems involving L areeasy to solve, and the eigenvalues of A = L−1AL−> are more tighly clustered than theeigenvalues of A. Also, it would help if the condition numbers satisfied κ(A) κ(A).

The matrix Q = LL> will be called a preconditioner for A if it is used to improvethe convergence of some basic iterative algorithm, such as conjugate gradients. Let us seehow we could use a preconditioner in conjugate gradients. If we apply the basic conjugate


gradient algorithm to Ax = b, we obtain the algorithm

s = b− Ax = L−1sp = s ≡ L>pγ = p>s = p>LL−1s = p>suntil convergence do

z = Ap = L−1AL−>L>p = L−1Ap = L−1zα = γ/p>z = γ/p>LL−1Ap = γ/p>Ap = αx = x + pα = L>(x + pα)s = s− zα = L−1(s− zα)y = s = L>yδ = y>s = y>LL−1s = y>s = δ

p = y + pδ/γ = L>(y + pδ/γ)γ = δ = δ

We can compute the same quantities by the following algorithm:

s = b−Axsolve Qp = sγ = p>suntil convergence do

z = Apα = γ/p>zx = x + pαs = s− zαsolve Qy = sδ = y>sp = y + pδ/γγ = δ

(3.7-5)

In this form of the algorithm, it does not matter if the preconditioner Q is factored in someform Q = LL>. It is important, however, that Q be symmetric and positive definite, andallow fast solutions of linear systems.

The preconditioned conjugate gradient algorithm minimizes

φ(x) ≡ r>A−1r = r>L−>(L−1AL−>

)−1L−1r = r>A−1r ,

which is the same objective function minimized by the (unpreconditioned) conjugate gradi-ent algorithm. However, the step directions now lie in the preconditioned Krylov subspace< r0,AQ−1r0, . . . , (AQ−1)kr0 >. Thus the preconditioned conjugate gradient iteration will


find that the error at the k iteration is bounded by 2σk/(1 + σ2k) where

σ =

√κ(AQ−1)− 1√κ(AQ−1) + 1

If the preconditioner Q is such that κ(AQ−1) κ(A), then preconditioning will substan-tially reduce the number of iterations.

Golub and van Loan [43, p. 532] suggest several candidates for preconditioners. Averysimple preconditioner is Q = diag(A), the diagonal of A; this corresponds to the Jacobiiteration. If A has block tridiagonal form

A =

A1 B1 . . . 0

B>1 A2

. . ....

.... . . . . . Bn2−1

0 . . . B>n2−1 An2

,

where the diagonal blocks A1, . . . ,An2 are banded, we could try

Q =

A1 0 . . . 0

0 A2. . .

......

. . . . . . 00 . . . 0 An2

;

this corresponds to the block Jacobi iteration. When A is an M-matrix, we could also letQ be an incomplete factorization of A; see section 3.6. Also, if we write A = D + U + U>

where D is diagonal and U is strictly upper triangular, we can let Q be given by thesymmetric successive over-relaxation (SSOR) iteration

Q =(D + U>ω

)D−1 (D + Uω) .

A program to perform preconditioned conjugate gradients can be found in Program3.7-9: IterativeError2D.C. This program calls Fortran subroutine matrix multiply inProgram 3.7-10: iterative improvement 2d.m4. Note that this Fortran routine looksmuch like a finite difference computation on a grid, rather than like a matrix-vector multi-plication. The remainder of the conjugate gradient algorithm in procedure runOnce of fileIterativeError2D.C uses LaPack BLAS routines, or loops that ignore the grid structure.Students can execute this program by clicking on Executable 3.7-11: iterative errorStudents can select several preconditioners for conjugate gradients, including no precon-

ditioner (i.e. the identity matrix for the preconditioner). The executable selects randomvalues between zero and one for the initial solution values on the grid, and contours thenumerical solution for each iteration. By setting the number of grid cells in one coordi-nate direction to zero, the user can perform a mesh refinement study, and compare theperformance of several iterative methods.






3.8 Minimum Residual Methods

Given b ∈ Rn and A ∈ Rn×n not necessarily symmetric or positive-definite, we maywant to solve Ax = b by an iterative process. Such a situation may arise if we are solvinga convection-diffusion problem, for example.

3.8.1 Orthomin

Given a symmetric positive-definite matrix M of the same size as A, we will solveAx = b by minimizing

φ(x) ≡ 12‖Ax− b‖2

M =12(Ax− b)>M(Ax− b) . (3.8-1)

The minimization process will take the form of a recurrence

xk+1 = xk +k∑j=0

pjαj,k , (3.8-2)

3.8. MINIMUM RESIDUAL METHODS 171

and the basic idea is contained in the following lemma.

Lemma 3.8-1: Suppose that b ∈ Rn and A ∈ Rn×n. Given that M ∈ Rn×n is symmetricand positive definite and P = [p0, . . . ,pk−1] ∈ Rn×k, define the Gram matrix G ∈ Rk×k

byG = (AP)>M(AP) .

Then

1. G is symmetric.

2. G is positive definite if and only if the columns of AP are linearly independent.

3. If x ∈ Rn and x = x+Pa minimizes φ(x) = 12(Ax−b)>M(Ax−b) over all a ∈ Rk,

then a solvesGa = g where g = −(AP)>M(Ax− b) (3.8-3)

4. If the columns of AP are linearly independent and a solves Ga = g, then x = x+Pauniquely minimizes φ(x) over all possible values for a.

5. If x = x + Pa minimizes φ(x) over all possible a ∈ Rk, then the residual r = Ax− bsatisfies

(AP)>Mr = 0 . (3.8-4)

6. The residual at the minimum satisfies the equation

r = (Ax− b) + APa . (3.8-5)

Proof:

1. The symmetry of G is obvious.

2. Suppose that the columns of AP are linearly independent. If a ∈ Rk, then

a>Ga = (APa)>M(APa) ≥ 0 ,

since M is positive definite. If a>Ga = 0, then the positive-definiteness ofM implies that APa = 0; since the columns of AP are linearly indepen-dent, we must have that a = 0. This proves that G is positive-definite. Ifthe columns of AP are not linearly independent, then there exists a 6= 0so that APa = 0. Then Ga = (AP)>M(APa) = 0, so G is not positivedefinite.


3. If x = x + Pa minimizes φ(x) over all choices of a, then the first-orderconditions for the minimum of φ imply that

0 = ∇aφ(x + Pa) = ∇a[12(Ax + APa− b)>M(Ax + APa− b)]

= (AP)>M(Ax− b + APa) = Ga− g .

4. If the columns of AP are linearly independent, then G is nonsingular, andGa = g has a unique solution a. Since G = ∇a∇

>a φ is positive-definite, a is

the unique minimizer of 12‖A(x + Pa)− b‖2

M. This is equivalent to sayingthat x = x + Pa is the unique minimizer of φ(x) over all choices of a.

5. If x = x + Pa minimizes φ(x) over all choices of a, then we have alreadyshown that

0 = (AP)>M(Ax− b + APa) = (AP)>Mr .

6. The equation for the residual is obvious from the equation for x in thefourth claim.

2

The next lemma will show us how to generate the p-vectors.

Lemma 3.8-2: Suppose that A ∈ Rn×n, and that M ∈ Rn×n is symmetric and pos-itive definite. Suppose that the columns of P = [p0, . . . ,pk−1] ∈ Rn×k are such that(AP)>M(AP) is diagonal and nonsingular. Given r ∈ Rn, define q ∈ Rk by

(AP)>M(AP)q = (AP)>M(Ar) .

Letp = −r + Pq . (3.8-6)

Then the vector p satisfies the conjugacy condition

(AP)>MAp = 0 ; (3.8-7)

further, (A[P p])>M(A[P p]) is diagonal.

Proof: First, the definition of p shows that

(AP)>M(Ap) = −(AP)>M(Ar) + (AP)>M(APq) = 0 .


Since (AP)>M(AP) is diagonal, so is

(A[P p])>M(A[P p]) =[(AP)>M(AP) (AP)>M(Ap)(Ap)>M(AP) (Ap)>M(Ap)

]=[(AP)>M(AP) 0

0 (Ap)>M(Ap)

].

2

As a result, we obtain the following general form of a minimum residual algorithm, givenan initial guess x0:

r0 = Ax0 − bchoose p0 (typically p0 = −r0)for k = 0, 1, . . . until convergence

solve the (k + 1)× (k + 1) linear system[(Api)>M(Apj)

] [αj,k

]= −

[(Api)>Mrk

], 0 ≤ i, j ≤ k

xk+1 = xk +∑k

j=0 pjαj,krk+1 = rk +

∑kj=0(Apj)αj,k

for j = 0, . . . , k

ωj,k = (Apj)>M(Ark+1)

(Apj)>M(Apj)

pk+1 = −rk+1 +∑k

j=0 pjωj,k(3.8-8)

We need to find ways to simplify the calculation of αj,k, xk+1 and rk+1. We will see thatthe αj,k solve a diagonal system in which all but one of the entries of the right-hand side


are zero.

Lemma 3.8-3:[7, Lemma 12.1] Suppose that A ∈ Rn×n and M ∈ Rn×n is symmetricpositive definite. Given b,x0 ∈ Rn and r0 = Ax0−b = −p0, suppose that we compute thesolution vectors x1, . . . ,xk+1, the residual vectors r1, . . . , rk+1 and the search directionsp1, . . . ,pk+1 by the recurrences

solve [(Api)>M(Apj)][αj,k] = −[(Api)>Mrk] for αj,k, 0 ≤ j ≤ k (3.8-9a)

xk+1 = xk +k∑j=0

pjαj,k (3.8-9b)

rk+1 = rk +k∑j=0

Apjαj,k (3.8-9c)

ωj,k =(Apj)>M(Ark+1)(Apj)>M(Apj)

, 0 ≤ j ≤ k (3.8-9d)

pk+1 = −rk+1 +k∑j=0

pjβj,k . (3.8-9e)

Finally, suppose that the matrix G` ≡ [(Api)>M(Apj)] ∈ R(`+1)×(`+1) is nonsingularfor 0 ≤ ` ≤ k. Then

∀0 ≤ i ≤ k , (Api)>Mrk+1 = 0 ;

∀0 ≤ i ≤ k , (Ari)>Mrk+1 = 0 ;

∀0 ≤ i ≤ k + 1 , (Api)>Mrk+1 = −(Ari)>Mrk+1 .

Proof: The first claim was proved in equation (3.8-4). Since equation (3.8-6)implies that for 0 ≤ i ≤ k, ri =

∑i−1j=0 pjβj,i−1−pi, the orthogonality of rk+1 to

the vectors Api for 0 ≤ i ≤ k implies that

(Ari)>Mrk+1 = (i−1∑j=0

Apjβj,i−1 −Api)>Mrk+1 = 0 .

Since equation (3.8-6) implies that for 0 ≤ i ≤ k + 1, pi + ri =∑i−1

j=0 pjβj,i−1,the orthogonality of rk+1 to the vectors Api for 0 ≤ i ≤ k implies that

(A[pi + ri])>Mrk+1 = (i−1∑j=0

Apjβj,i−1)>Mrk+1 = 0 .


2

Next, we will prove a condition on the matrix M that guarantees that the iterativemethod will succeed.

Lemma 3.8-4:[7, Theorem 12.2] Suppose that the hypotheses of lemma 3.8-3 are sat-isfied, and that MA + A>M is symmetric and positive definite and that for 0 ≤ i ≤k , ri 6= 0. Then the vectors Ap0, . . . ,Apk are linearly independent and the matrixGk ≡

[(Api)>M(Apj)

]is nonsingular.

Proof: We will prove the lemma by contradiction. Suppose without lossof generality that i = k is the smallest index such that Gi is singular. Thenlemma 3.8-1 shows that the set Apj : 0 ≤ j ≤ k is linearly dependent andApj : 0 ≤ j < k is linearly independent. Thus there are scalars λj so that

Apk =k−1∑j=0

Apjλj .

Since recurrence (3.8-6) implies that rk = −pk +∑k−1

j=0 pjβj,k−1, we have

Ark =k−1∑j=0

Apj(βj,k−1 − λj) .

It follows from lemma 3.8-3 that

(Ark)>Mrk = (k−1∑j=0

Apj [βj,k−1 − λj ])>Mrk = 0 .

This implies that

0 = (Ark)>Mrk + r>k M(Ark) = r>k (A>M + MA)rk .

Since rk 6= 0 and A>M + MA is symmetric and positive definite, we have acontradiction. 2

Note that the conjugacy of the search directions (3.8-7) implies that the matrix Gk,which is used in 3.8-3 to determine the coefficients αj,k, is diagonal. In this case, the


minimum residual algorithm simplifies to

r0 = Ax0 − bp0 = −r0

for k = 0, 1, . . . until convergencecompute (Apk)>M(Apk)αk,k = (Apk)>Mrk

(Apk)>M(Apk)

xk+1 = xk + pkαk,krk+1 = rk + (Apk)αk,kcompute Ark+1

for j = 0, . . . , k βj,k = (Apj)>M(Ark+1)

(Apj)>M(Apj)

pk+1 = −rk+1 +∑k

j=0 pjβj,k

(3.8-10)

This algorithm at step k requires storage for a total of 2k + 5 vectors, namely vectors pjfor 0 ≤ j ≤ k, Apj for 0 ≤ j ≤ k, xk+1, rk+1 and Ark+1.

Example 3.8-5: [7, p. 520] Suppose that we want to solve

Ax = b .

The orthomin iteration chooses M = I. Lemma 3.8-4 shows that this algorithm convergesif A+A> is positive definite. If A+A> is not positive definite, we can apply the iterationto

Ax ≡ QAx = Qb ≡ b .

where Q is appropriately chosen (e.g., Q = A>). The program domn to implement precon-ditioned orthmin iteration can be found at Program 3.8-6: Preconditioned Orthominor locally at Program 3.8-7: domn.f

Example 3.8-8: [7, p. 527] If A is symmetric positive-definite and M = A−1, then

A>MA = A>A−1A = A .

In this case, the minimum residual algorithm reduces to the standard preconditioned conju-gate gradient algorithm.

Example 3.8-9: [7, p. 527] If M = [12(A+A>)]−1 is symmetric and positive-definite,we obtain the CGW method of Concus, Golub and Widlund. In this case,

A>MA = A>[12(A + A>)]−1A .

The following lemma describes the convergence of the minimum residual algorithm.

http://www.netlib.org/slap/

http://www.math.duke.edu/~johnt/math226/iterative_methods/dlap/domn.f


Lemma 3.8-10:[7, Theorem 12.5] Suppose that the hypotheses of lemma 3.8-4 are satis-fied, and that rk 6= 0. Let

B ≡ M1/2AM−1/2

If λmin(A) is the smallest eigenvalue of A, let

ξ ≡ λmin

(12

[B + B>

])λmin

((12

[B−1 + B−>

]).

Then ‖rk+1‖2M ≤ (1− ξ)‖rk‖2

M. If ξ < 12 , then we have the better estimate

‖rk+1‖2M ≤

(1− ξ

1− ξ

)‖rk‖2

M .

3.8.2 GMRES

An alternative approach to the recurrence (3.8-6) for the search directions is to requirethese vectors to be mutually orthogonal, and generate them by some other process. Forexample, we could require

pk+1 = (Apk −k∑j=0

pjηj,k)/ηk+1,k

where the scalar coefficients ηj,k are chosen so that

‖pj‖2 = 1 ∀0 ≤ j ≤ k + 1 and p>k+1pj = 0 ∀0 ≤ j ≤ k .

We can use the modified Gram-Schmidt process to compute pk+1:

pk+1 = Apkfor 0 ≤ j ≤ kηj,k = p>j pk+1

pk+1− = pjηj,kηk+1,k = ‖pk+1‖pk+1 = pk+1/ηk+1,k

(3.8-11)

When pk+1 is chosen to be in the span < p0, . . . ,pk,Apk >, this orthogonalization schemeis called the Arnoldi process. Note that if ‖pk+1‖ = 0 and k + 1 < n, then pk+1 mustbe generated by some other means (called a restart) so that pk+1 is orthogonal to theprevious vectors pj . The next lemma summarizes this process.


Lemma 3.8-11: LetPk = [p0, . . . ,pk] ∈ Rn×(k+1)

be the matrix of search directions, and let

Hk = [ηi,j ] ∈ R(k+1)×(k+1)

be the upper Hessenberg matrix of coefficients computed in the Arnoldi process (3.8-11).Then

1. ∀0 ≤ j ≤ k , pj+1 ⊥< p0,Ap0, . . . ,Ajp0 >;

2. Hk = P>k APk;

3. ∀ 0 ≤ j ≤ k , APkej = Pk+1Hk+1ej .

Proof:

1. The first claim is obvious from the orthogonalization process and the con-struction of the search directions in a Krylov subspace.

2. Next, we will prove the second claim. If i ≤ j ≤ k, the definition of

ηi,j = p>i

(Apj −

i−1∑`=0

p`η`,j

)

shows that p>i Apj = ηi,j . If i = j + 1, then

p>i+1Apj =p>j+1Apjηj+1,j

=1

ηj+1,jp>j+1[pj+1 +

j∑i=0

pjηi,j ]

=1

ηj+1,j‖pj+1‖2 = ‖pj+1‖ = ηj+1,j .

Finally, if i > j + 1, then the first claim shows that p>i Apj = 0.

3. To prove the third claim, note that for 0 ≤ j ≤ k, the Arnoldi process tocompute pj+1 implies that

APkej = Apj = pj+1ηj+1,j +j∑i=0

piηi,j =j+1∑i=0

piηi,j .

2


We can use these results to construct an iterative algorithm for solving Ax = b.

Lemma 3.8-12: Suppose we are given A ∈ Rn×n, b ∈ Rn and x0 ∈ Rn so that r0 =Ax0−b 6= 0. Let p0 = −r0/‖r0‖2. Suppose that p0,p1, . . . ,pk are mutually orthogonalvectors generated by the Arnoldi process (3.8-11). Define Pk = [p0, . . . ,pk] and assumethat the upper Hessenberg matrix Hk = P>

k APk is nonsingular. Define yk to be thesolution of Hkyk = e0‖r0‖. Then

1. xk = x0 + Pkyk and rk = Axk − b implies that rk ⊥< r0,Ar0, . . . ,Akr0 >;

2. there is a constant λ so that rk = pkλ;

3. r0, . . . , rk are mutually orthogonal.

Proof:

1. Note that

P>k rk = P>

k (r0 + APkyk) = e0‖r0‖+ Hkyk = 0 .

Since the range of Pk is the Krylov subspace < r0,Ar0, . . . ,Akr0 >, thefirst claim is proved.

2. Note that

rk+1 = r0 + APkyk = −p0‖r0‖+k∑j=0

(APkej)(e>j yk)

= −Pk+1e0‖r0‖+k∑j=0

Pk+1Hk+1eje>j yk

= Pk+1

([−e0

0

]‖r0‖+

[Hk hk

ηi+1,ke>k ηk+1,k+1

] [yk0

])= Pk+1

[−e0‖r0‖+ Hkykηk+1,ke>k yk

]= pk+1ηk+1,ke>k yk .

3. Since the Arnoldi process computes p0, . . . ,pk to be mutually orthogo-nal, and since we have just shown that each rj is a scalar multiple of thethe corresponding pj , the third claim is proved.

2


As a result, we have the following algorithm for solving Ax = b:Arnoldi Algorithm 3.8-13:

r0 = Ax0 − bp0 = −r0/‖r0‖2

for k = 1, 2, . . . until convergencepk = Apk−1

for 0 ≤ j < kηj,k−1 = p>j pkpk− = pjηj,k−1

ηk,k−1 = ‖pk‖2

pk = pk/ηk,k−1

solve Hkyk = e0‖r0‖2

zk =∑k

j=0 pje>j ykrk = r0 + Azkcompute ρk = ‖rk‖2

if ρk ≤ ε breakxk = x0 + zk

The difficulty in this algorithm lies in four expensive steps, beginning with solving for ykand ending with computing ‖rk‖. Since ‖rk‖ is needed for the termination test, it cannotbe avoided. Instead, we will see that the cost of computing yk, zk and rk can be avoided.

Consider the least squares problem to minimize

‖APkyk + r0‖ = ‖PkHkyk + r0‖

over all yk ∈ Rk+1. The normal equations for this least squares problem are

0 = H>k P>

k (PkHkyk + r0) = H>k

(Hkyk + P>

k r0

).

If Hk is nonsingular, then the solution of the normal equations satisfies

Hkyk = −P>k r0 = e0‖r0‖ .

This is the same equation that lemma 3.8-12 showed would generate mutually orthogonalresiduals.

Choosing xk so that rk is orthogonal to the Krylov subspace < r0,Ar0, . . . ,Akro >is equivalent to computing xk − x0 = Pkyk as that vector in the Krylov subspace thatminimizes

‖Axk − b‖ = ‖APkyk + r0‖ = ‖APkyk − p0‖r0‖‖ = ‖PkHkyk − p0‖r0‖‖ (3.8-12)

=∥∥∥∥Pk+1

(Hk+1

[yk0

]−[−e0

0

]‖r0‖

)∥∥∥∥ =∥∥∥∥Hk+1

[yk0

]−[e0

0

]‖r0‖

∥∥∥∥


Suppose that Hk+1 = Qk+1Rk+1 where Qk+1 is orthogonal and Rk+1 is right-triangular.Then

‖Azk + r0‖ =∥∥∥∥Rk+1

[yk0

]−Q>

k+1

[e0

0

]‖r0‖

∥∥∥∥Let us partition the vector inside the norm on the right:

Rk+1

[yk0

]−Q>

k+1

[e0

0

]‖r0‖ =

[R r0 ρ

] [yk0

]−[gγ

]‖r0‖ .

We make this vector as small as possible by choosing yk to solve Ryk = g‖r0‖. With thischoice, there is only one non-zero component left in the vector, so

‖Azk + r0‖ = |γ|‖r0‖ .

Consider the first step, in which

H1 =[η0,0 η0,1

η1,0 η1,1

]We can compute a plane reflector G01 so that

G01

[η0,0

η1,0

]=[ρ0,0

0

].

We also compute

g1 = G01

[10

]=[γ0,1

γ1,1

].

Then

G01H1 =[ρ0,0 ρ0,1

0 ρ1,1

]≡ R

is right-triangular. Choose y0 so that

ρ0,0y0 = γ0,1 .

Then

minyk

‖AP0y0 − p0‖r0‖ ‖ =∥∥∥∥[η0,0

η1,0

]y0 −

[10

]‖r0‖2

∥∥∥∥2

=∥∥∥∥G0,1

([η0,0 η0,1

η1,0 η1,1

] [y0

0

]−[10

]‖r0‖2

)∥∥∥∥2

=∥∥∥∥[ρ0,0 ρ1,0

0 ρ1,1

] [y0

0

]−[γ0,1

γ1,1

]‖r0‖2

∥∥∥∥2

=∥∥∥∥[ 0γ1,1

]∥∥∥∥ r0‖2 = |γ1,1|‖r0‖2


because the minimum is achieved at y0 = γ0,1‖r0‖/ρ0,0. Thus ‖r1‖ is the absolute value ofthe last component of g1.

Inductively, suppose that we have computed plane reflectors G01,G12, . . . ,Gk−1,k sothat

Gk−1,k . . .G0,1

[Hk−1

ηk,k−1e>k−1

]=[Rk−1

0

]is right-trapezoidal. Also assume that we have computed

gk = Gk−1,k . . .G01

[e0

0

]‖r0‖ . (3.8-13)

If we apply this sequence of plane reflectors at the next step in the algorithm, we get

[Gk−1,k . . .G0,1 0

0 1

] [Hk−1 hk

ηk,k−1e>k−1 ηk,k

]=[

Rk−1 Gk−1,k . . .G0,1hkηk,k−1e>k−1 ηk,k

].

We choose the plane reflector Gk,k+1 to zero the ηk,k−1e>k−1 entry in the right-hand side ofthis expression:

Gk,k+1 . . .G0,1

[Hk−1 hk

ηk,k−1e>k−1 ηk,k

]= Rk ,

where Rk is right-triangular. We also update

gk+1 = Gk,k+1 . . .G01

[e0

0

]‖r0‖ = Gk,k+1

[gk0

].

Then

‖rk+1‖ = ‖APkyk + r0‖ = ‖P>k (PkHkyk + r0)‖ = ‖Hkyk − e0‖r0‖‖

= ‖Gk,k+1 . . .G0,1(Hkyk − e0‖r0‖)‖ =∥∥∥∥[Rkyk

0

]− gk+1‖r0‖

∥∥∥∥Thus ‖rk+1‖ is the absolute value of the last entry of gk+1 times ‖r0‖. This describes howto compute the residual norm in the Arnoldi algorithm.

If we solve the least squares problem 3.8-12 for yk by using the full (k + 1)× k matrixHk, the resulting algorithm is the following [74]


GMRES Algorithm 3.8-14:

r0 = Ax0 − bp0 = −r0/‖r0‖2

g0 = e0

for k = 0, 1, . . . until convergencepk+1 = Apkfor 0 ≤ j ≤ kηj,k = p>j pk+1

pk+1− = pjηj,kηk+1,k = ‖pk+1‖2

pk+1 = pk+1/ηk+1,k

update the QR factorization Hk

update the reflector gk of e0

ρk+1 = |e>k gk|‖r0‖2

if ρk+1 ≤ ε breakback-solve Rkyk = −gk where gk defined by (3.8-13)xk+1 = x0 + Pkyk‖r0‖2

The program dgmres to implement preconditioned GMRES can be found at Program3.8-15: Preconditioned GMRES or locally at Program 3.8-16: gdmres.f

Exercises 3.81. Suppose that we want to solve Ax = b where A is nonsingular. Show that we can transform this to

a system Ax = b, where for ν > 0 sufficiently large, A = A + νA>A is positive definite.

2. Program the orthomin iteration for the two-dimensional heat equation on a unit square, with pre-conditioning by the identity matrix. Plot the number of iterations needed to reach convergence fromx0 = 0 for various values of the timestep and number of grid blocks.

3. Suppose that we choose βk−1 = 0, instead of the choice in lemma ??.

(a) Show that the matrix [p>i Apj ] is upper Hessenberg.

(b) Show that ∀0 ≤ j ≤ k − 2,p>j Mpk = 0.

(c) Describe the orthogonal projection algorithm in this case.

4. Consider the GCR algorithm [38] for solving Ax = b:

r0 = Ax0 − bp0 = −r0

for k = 0, 1, . . . until convergence

αk = −r>k Apk/(Apk)>(Apk)xk+1 = xk + pkαk

rk+1 = rk + Apkαk

pk+1 = −rk+1 +Pk

j=0 pjβj,k where βj,k is chosen so that (Apk+1)>A(pk) = 0

(3.8-14)

(a) What is the objective function that this algorithm is trying to minimize?

http://www.netlib.org/slatec/lin/

http://www.netlib.org/slatec/lin/

http://www.math.duke.edu/~johnt/math226/iterative_methods/dlap/dgmres.f


(b) What is the the conjugacy condition for the search directions?

(c) Is this a minimum residual or an orthogonal error method?

(d) Show that if A =

»0 1−1 0

–and b =

»11

–, then the algorithm will break down before finding the

solution.

3.9 Nonlinear Systems

In some cases, we may want to solve a nonlinear parabolic partial differential equation.Implicit discretization of this equation will lead to a large nonlinear system of equations

f(u) = 0 .

Often these systems are large and sparse ( i.e., for each i, e>i f(u) depends on very fewentries of u), so efficient numerical methods for these problems require special formulation.

3.9.1 Newton Algorithms

A natural approach to solving f(u) = 0 is to use the Newton iteration

solve Jkdk = −f(uk) , uk+1 = uk + dk ,

whereJk ≡

∂f∂u

(uk) .

This requires that we solve a large system of linear equations at each iteration. A naturalapproach to this problem is to use an iterative method to solve Jkdk = −f(uk). This meansthat we have two nested iterations, over Newton steps and over linear solver iterations,inside each timestep of the integration of the partial differential equation. In order for asecond-order implicit method to be competitive with a first-order explicit method, the totalnumber of iterations of both types will need to be at most O(4x).

When f(uk) is large, we are far from the solution of the nonlinear equations, and thereis not much need to solve the linear system Jkdk = −f(uk) to high accuracy. See [79] foradditional details. However, it may be necessary to provide a global convergence strategyfor Newton’s method when f(uk) is large. In practice, the latter problem can be overcomeby cautious selection of the timestep in the integration of the parabolic equation. We expectthat as the timestep gets smaller, the previous solution (or some extrapolant of previoussolutions) of the differential equation becomes a good initial guess for the zero of f , andthe number of iterations needed to solve f(u) = 0 should reduce. For example, one couldcut the timestep in half whenever the Newton iteration required more that 4 iterations forconvergence. However, it is difficult to determine the relative efficiency of such a schemecompared to explicit integration.

3.9. NONLINEAR SYSTEMS 185

Another difficulty with Newton’s method is the need to evaluate the matrix of partialderivatives Jk at each Newton iteration. If analytical values are not easily computed, thesederivatives can be computed by finite differences. In order to be efficient, we need to avoidperforming O(n) finite difference calculations of the form Jkej ≈ (f(uk + ejδ) − f(uk))/δfor each column of Jk, for a total of O(n2) work to compute Jk. One approach is to use thewidth of the finite difference stencil in the discretization of the partial differential equationto perform several of these finite differences at once. For example, suppose that we aresolving a one-dimensional problem with a stencil of width at most 3. In other words,

∀i 6∈ j − 1, j, j + 1 , e>i f(u + ejδ) = e>i f(u) .

This means that a perturbation in entry 1 of u causes perturbations in at most entries 1and 2 of f , and so on. As a result, we could perturb entries 1, 4, 7, and so on simultaneouslyand still compute the correct values for the derivatives in columns 1, 4, 7 and so on of Jk.In this case, we could compute all of the columns of Jk by computing at most 4 values off : one at u and three at perturbed values of u.

3.9.2 Nonlinear Krylov Algorithms

Let us consider a different approach for solving f(u) = 0. We will combine the Newtoniteration and the linear solver. The approach is due to Brown and Saad [18, 19, 20].3.9.2.1 Iterative Solution of the Newton Equations

Suppose that we have an initial guess d0 for d. Our initial residual for the linear systemis

r0 = f(u) + Jd0 .

We will seek approximations for d of the form

dm = d0 + z ,

wherez ∈ Km ≡< r0,Jr0, . . . ,Jm−1r0 > .

We could do this either by an Arnoldi iteration, or a GMRES iteration. With Arnoldi’smethod, we choose z = zm where (see algorithm 3.8-13)

f(u) + Jdm = r0 + Jzm ⊥ Km ,

while with GMRES we choose z = zm where (see algorithm 3.8-14)

‖f(u) + Jdm‖2 = ‖r0 + Jzm‖2 = minz∈Km

‖r0 + Jz‖2 .

3.9.2.2 Avoiding the Jacobian


One of the advantages of the generalized conjugate gradient methods is that we neverneed to form the Jacobian matrix J explicitly. Rather, these algorithms only require thatwe evaluate Jw for some vector w. This allows us to approximate

J(u)w ≈ [f(u + wδ)− f(u)]/δ ,

where δ is some carefully chosen small scalar. In the discussion that follows, we will assumethat J(u)w is always computed in this way, with δ =

√ε‖u‖/‖w‖ and with ε equal to the

machine roundoff error. In particular, the Arnoldi algorithm or GMRES algorithm will beassumed to compute Jw in this way.

Another issue is how long we should continue to build the Krylov expansion for theiterates, versus restarting the Arnoldi or GMRES portion of the iterative process at thecurrent value of the increment dm, or even restarting the both the nonlinear iterative processand the generalized conjugate gradient process at the current iterate u. Asm becomes large,the storage and work for the generalized conjugate gradient iteration increases, in order tohandle the upper Hessenberg system for the search vectors.3.9.2.3 Descent Directions

We must also consider using a global convergence strategy for the iteration. First, letus consider developing a strategy for minimizing

φ(σ) =12‖f(u + pσ)‖2

2

where p is some search direction. Recall that p is a descent direction if and only if

0 >dφ

dσ(0) = f(u)>

∂f∂u

(u)p = f>Jp .

Note that the generalized conjugate gradient methods generate approximations d to thesolution of Jd = −f(u). If d approximates the solution to this equation with residual

r = Jd + f ,

thenf>Jd = −f>(f − r) .

Thus the generalized conjugate gradient search direction d will be a descent direction atNewton iterate u whenever

f>r < f>f .

However, if we start the Arnoldi process at d = 0, then we have the following result:


Lemma 3.9-1: [18, Theorem 3.5] Suppose that J(u) = ∂f∂u(u) is nonsingular, that f 6= 0,

and thatdm = −PmH−1

m P>mf

is the search direction determined by the Arnoldi method with initial guess d0 = 0. If dmexists, then it is a descent direction for φ(σ) = 1

2‖f(u + pσ)‖2, and

f>Jdm = −f>f .

Proof: Note that if d0 = 0, then r0 = f . Since the Arnoldi algorithm choosesdm so that rm = f+JPm(P>

mJPm)−1P>mf is orthogonal to< f ,Jf , . . . ,Jm−1f >,

we must have that∀m ≥ 1 , 0 = f>rm < f>f .

2

In passing, we would like to find a simple way to compute the directional derivative of

φm(σ) =12‖f(u + dmσ)‖2

2

when the search directions are computed by the Arnoldi algorithm. Since we have shownthat in the Arnoldi algorithm

Jdm + f = rm = pme>mym ,

it follows thatdφmdσ

= f>Jdm = f>pme>mym − ‖f‖2 .

Next, let us consider search directions computed by the GMRES algorithm.

Lemma 3.9-2: Suppose that J(u) = ∂f∂u(u) is nonsingular, and that

dm = −PmH−1m P>

mf

is the search direction determined by the GMRES method with initial guess d0 = 0. Ifdm exists, then it is a descent direction for φ(σ) = 1

2‖f(u + pσ)‖, and

f>Jdm = −f>f + ‖rm‖2 .


Proof: Since the GMRES algorithm chooses dm so that rm = f−JPm(P>mJPm)−1P>

mfminimizes ‖JPmy + f‖2 over all vectors y, it follows that rm = JPmym + f ⊥R(JPm). Thus

∀m ≥ 1 , 0 = (JPm)>rm = P>mJ>(Jdm + f) .

Since rm − f = Jdm ∈ R(JPm),

∀m ≥ 1 , (rm)>f = (rm)>((rm + Jdm) = ‖rm‖22 .

Since f 6= 0, it follows that

dm = JPmym = −JPm(P>mJ>JPm)−1P>

mJ>f 6= 0 .

The Pythagorean theorem implies that

‖rm‖2 = ‖f‖2 − ‖JPmym‖2 < ‖f‖2 .

2

We would also like to find a simple way to compute the directional derivative of

φm(σ) =12‖f(u + dmσ)‖2

2

when the search directions are computed by the GMRES algorithm. Recall that we com-puted a sequence of plane rotations so that

Gm−1,m . . .G0,1

[Hm−1

hm,m−1e>m−1

]=[Rm

0

]where Hm = P>

mJPm. We also computed

gm = Gm−1,m . . .G0,1e0‖f‖2 .

Then

rm = JPmym + f = Pm(Hmym + P>mPme0‖f‖2) = PmG>

0,1 . . .G>m,m−1(

[Rm

0

]ym + gm)

= (PmG>0,1 . . .G

>m,m−1em)(e>mgm) ,

It follows that the directional derivative is

dφmdσ

= f>Jdm = f>(rm − f) = (f>PmG>0,1 . . .G

>m,m−1em)(e>mgm)− ‖f‖2 .


This lemma points out the fact that whenever the GMRES generates a nonzero searchdirection dm, it is a descent direction. The problem is that when J is not positive definite,we may have to take m large before we get a nonzero search direction. However, as m getslarge, the computational cost and storage required by the GMRES algorithm gets large.

Brown and Saad suggest the following strategy for solving f(u) = 0. First, they useeither the Arnoldi algorithm or GMRES to generate a descent direction. Given a scalar0 < η < 1 and the step number n, they choose m as small as possible so that

‖Jdm + f‖2 < η‖f‖2

and‖rm‖2 < ηn‖f‖2

where ηn < 1 is a sequence decreasing to 0 as the Newton step number n→∞. Next, theyperform an inexact line search in the search direction dm. In other words, they replace uby u + dmλ where λ satisfies the Goldstein-Armijo conditions for a sufficient decrease

φ(λ) ≤ φ(0) + αλdφ

dσ(0) (3.9-1)

and a significant step size

φ(λ) ≥ φ(0) + βλdφ

dσ(0) (3.9-2)

for some scalars 0 < α < β < 1. Their algorithm for conducting this inexact line search isthe following:

λ = 1while the Goldstein-Armijo conditions are not satisfied

if λ ≤ 1 and we do not have a sufficient decrease in φ thenlet λ∗ minimize the quadratic interpolant to φ(0), φ′(0) and φ(λ)λhi = λλ = max0.1, λ∗λlo = λ

else if λ ≥ 1 and we have a sufficient decrease in φ thenλlo = λλ∗ = 2λhi = λ

else (λ < 1 or sufficient decrease or λ > 1 and sufficient decreaseuse linear interpolation between λlo andλhito find λ satisfying the Goldstein-Armijo conditions

(3.9-3)

The algorithm can break down if J is singular; in such a case the root-finding algorithmneeds to be restarted with a different initial guess. Brown and Saad also discuss using amodel trust region approach for global convergence.


The Krylov subspace iterative method for nonlinear equations appears to be available inPETSc. PETSc is available online at Program 3.9-3: Argonne Labs. This algorithmalso appears to be part of DASSL, which is available from Program 3.9-4: netlib.

3.10 Algebraic Multigrid

In this section, we will return to iterative techniques for solving linear equations. Unlikeour previous iterative methods, the algorithms in this section will take advantage of thestructure of the underlying grids.

3.10.1 Algebraic Multigrid Algorithm

Suppose that we want to solve Afxf = bf on some fine grid discretization of a partialdifferential equation. We will assume that Af is symmetric and positive definite. If possible,we would like to approximate the solution of this equation by solving a related equation ona coarser grid Acxc = bc. Suppose that for each fine grid we are given a restriction Rthat maps vectors in the range of Af to vectors in the domain of Ac, and a prolongationP that maps vectors in the range of Ac to vectors in the domain of Af . We assume thatR = P>β for some scalar β that depends on the grid and the number of dimensions. Weassume that the coarse matrix is determined from the fine matrix by

Ac = RAfP .


http://http://netlib.org/ode/daskr.tgz

3.10. ALGEBRAIC MULTIGRID 191

Also suppose that on each grid we are given a smoother Sf ≈ A−1f . The multigrid

algorithm is typically begun on the finest level, and takes the form

if there is no finer grid then r(0)f = Afx

(0)f − bf = Af (x

(0)f − xf )

if there exists a coarser gridrf = r(0)

f

df = Sfrf =⇒ x(1)f ≡ x(0)

f − d(1)f

=⇒ x(1)f − xf = (I− SfAf )(x

(0)f − xf )

rf = r(0)f −Afdf ≡ r(1)

f = Af (x(1)f − xf )

r(0)c = Rrf = RAf (x

(1)f − xf )

call coarser multigrid with initial residual r(0)c =⇒ dc = VcRAf (x

(1)f − xf )

df += Pdc x(2)f ≡ x(1)

f − d(2)f

=⇒ x(2)f − xf = (I−PVcRAf )(x

(1)f − xf )

rf = r(0)f −Afdf ≡ r(2)

f = Af (x(2)f − xf )

df += S>f rf =⇒ x(3)f = x(2)

f − S>f r(2)f

=⇒ x(3)f − xf = (I− S>f Af )(x

(2)f − xf )

elsesolve the coarsest linear system =⇒ dc = VcRr(1)

f

if there is no finer level then xf−= df(3.10-1)

We will discuss choices of the restriction, prolongation and smoother later in this section.

3.10.2 Multigrid Error

Our description of the multigrid algorithm shows that the errors in the solution atvarious stages of the algorithm (in the right-hand column) are

x(1)f − xf = (I− SfAf )(x

(0)f − xf )

x(2)f − xf = (I−PVcRAf )(x

(1)f − xf )

x(3)f − xf = (I− S>f Af )(x

(2)f − xf )

The matrix Vc corresponds to the action of the multigrid V-cycle on coarser levels of thealgorithm, which takes a residual and returns a correction to the solution. On the coarsestgrid, we have

Vc = A−1c .


It follows that the error at the end of the algorithm is related to the error at the beginningof the algorithm by

x(3)f − xf = (I− S>f Af )(I−PVcRAf )(I− SfAf )(x

(0)f − xf ) ≡ VfAf (x

(0)f − xf ) .

This leads to the following lemma.

Lemma 3.10-1: Consider the multigrid algorithm (3.10-1). Suppose that Af ∈ Rnf×nf

is symmetric, nf = ρnc and P ∈ Rnf×nc . Given a constant β > 0 (possibly dependent onnf ), let R = P>β. Then Ac = RAfP ∈ Rnc×nc is symmetric. Further, if Af is positive-definite and P has zero null space, then bfAc is positive-definite. Finally, if Vc = A−1

c

on the coarsest level in the multigrid algorithm, and for each consecutive pair of levels wedefine Vf ∈ Rnf×nf by

Vf = (I− S>f Af )(I−PVcRAf )(I− SfAf )A−1f

Then Vf is symmetric.

Proof: If Af is symmetric and R = P>β, then

A>c = (RAfP)> = P>AfR> =

1βRAfPβ = RAfP = Ac ,

so Ac is symmetric.

Next, suppose that Af is symmetric positive-definite, P has zero null space,R = P>β with β > 0, and Ac = RAfP. Then for any x ∈ Rnc we have

x>Acx = xRAfPx = β(Px)>Af (Px) ≥ 0 .

Further, if x>Acx = 0, then since β > 0 and Af is symmetric positive-definite,

0 = β(Px)>Af (Px) =⇒ (Px)>Af (Px) =⇒ Px = 0 .

Since P has zero null space, this implies that x = 0. Thus Ac is positive-definite.

Suppose that Af is symmetric positive definite on each level of the multigridalgorithm, and Vc = A−1

c on the coarsest level. Then Vc is symmetric on thecoarsest level. We will prove that Vf is symmetric on finer levels by induction.


If Vc is symmetric, then

V>f =

[(I− S>f Af )(I−PVcRAf )(I− SfAf )A−1

f

]>= A−1

f (I−AfS>f )(I−AfR>VcP>)(I−AfSf )

= (I− S>f Af )A−1f (I−AfβPVcR

1β

)(I−AfSf )

= (I− S>f Af )(I−PVcRAf )A−1f (I−AfSf )

= (I− S>f Af )(I−PVcRAf )(I− SfAf )A−1f = Vf .

2

Note that one important consequence of this lemma is that the multigrid V-cycle can beused as a preconditioner in a conjugate gradient iteration.

3.10.3 Coarse Grid Correction

In a two-level algorithm, we have V = (RAfP)−1. We will begin by studying how thecoarse grid solve affects the error on the fine grid.

Lemma 3.10-2:[86, Lemma A.2.1] Suppose that the coarse grid matrix RAfP is nonsin-gular, and define

Kf = I−P(RAfP)−1RAf ∈ Rnf×nf

Then N (Kf ) = R(P) and R(Kf ) = N (RAf ). Further, Kf is the oblique projection ontoN (RAf ) along R(P).

Proof: Note that since RAfP is nonsingular, N (P) = 0 and N (R>) = 0.In other words, P and R have full rank. Since

KfP =[I−P(RAfP)−1RAf

]P = P−P(RAfP)−1RAfP = P−P = 0

we see that R(P) ⊂ N (Kf ). To see that N (Kf ) ⊂ R(P), we note that ifz ∈ N (Kf ) then

0 = Kfz = z−P(RAfP)−1RAfz =⇒ z = P(RAfP)−1RAfz

so z ∈ R(P). This proves that N (Kf ) = R(P).

Since

RAfKf = RAf −RAfP(RAfP)−1RAf = RAf −RAf = 0


we see that R(Kf ) ⊂ N (RAf ). Next, we will show that N (RAf ) ⊂ R(Kf ).Note that if z ∈ N (RAf ), then

Kfz =[I−P(RAfP)−1RAf

]z = z

so z ∈ R(Kf ). This proves that R(Kf ) = N (RAf ).

To see that Kf is a projection, we compute

KfKf =[I−P(RAfP)−1RAf

] [I−P(RAfP)−1RAf

]= I−P(RAfP)−1RAf −P(RAfP)−1RAf + P(RAfP)−1RAfP(RAfP)−1RAf

= I−P(RAfP)−1RAf −P(RAfP)−1RAf + P(RAfP)−1RAf

= I−P(RAfP)−1RAf = Kf

2

Lemma 3.10-3: ([86, Corollary A.2.1]) Suppose that Af is symmetric positive-definite,that R = P>β for some scalar β > 0, and that RAfP is nonsingular. Then Kf ≡I−P(RAfP)−1RAf is symmetric with respect to the inner product (u,v)Af

= v>Afugenerated by Af . Further, N (RAf ) = R(Kf ) ⊥Af

N (Kf ) = R(P). In fact, Kf is anorthogonal projection with respect to this inner product. In addition,

‖Kf‖2Af

≡ supx6=0

‖Kfx‖2Af

‖x‖2Af

= 1

and∀x ∈ Rnf ‖Kfx‖Af

= minyc

‖x−Pyc‖Af.

Proof: For completeness, we first show that (u,v)Af≡ v>Afu generates an

inner product. Obviously, (u,v)Afis a real scalar. Since Af is symmetric,

(u,v)Af= v>Afu = u>Afv = (v,u)Af

.

It is also obvious that (u,v)Afis linear in each of its arguments. Further,

(u,u)Af= u>Afu ≥ 0 and (u,u)Af

= 0 implies that u = 0, since Af ispositive definite. Thus (u,v)Af

is an inner product.

It is easy to see that Kf is symmetric with respect to this inner product. For


all u and v,

(Kfu,v)Af= v>AfKfu = v>Af

[I−P(RAfP)−1RAf

]u

= v>Afu− v>AfP(RAfP)−1RAfu

= v>Afu− v>A>f P(P>A>

f R>)−>RAfu

= v>Afu− v>A>f R> 1

β

(R

1βAfPβ

)−>P>βAfu

= v>Afv − v>A>f R>(RAfP)−>P>Afu

= v>[I−P(RAfP)−1RAf

]>Afu = (u,Kfv)Af.

Next, we show that N (RAf ) ⊥AfR(P). Suppose that zf ∈ N (RAf ), which

can be rewritten as RAfzf = 0. Then for any Pxc ∈ R(P),

(zf ,Pxc)Af= (Pxc)>Afzf =

1βx>c RAfzf = 0 .

Lemma 3.10-2 shows that Kf is a projection, with R(Kf ) = N (RAf ) andN (Kf ) = R(P). Since N (RAf ) ⊥Af

R(P), Kf is an orthogonal projectionwith respect to the inner product generated by Af .

We claim next that Rnf = R(Kf ) ⊕R(P). In other words, for any xf ∈ Rnf ,there exists yf ∈ Rnf and xc ∈ Rnc so that

xf = Kfyf + Pxc .

In fact, we can takexc = (RAfP)−1RAfxf

andyf = xf −P(RAfP)−1RAfxf .

ThenRAfyf = RAfxf − (RAfP)(RAfP)−1RAfxf = 0

and

Kfyf + Pxc =[I−P(RAfP)−1RAf

]yf + Pxc

= xf −P(RAfP)−1RAfxf + P(RAfP)−1RAfxf = xf .

Since Rnf = R(Kf )⊕R(P) and R(Kf ) ⊥AfR(P), the Pythagorean theorem

implies that for all xf ∈ Rnfand all xc ∈ Rnc ,

‖Kfxf + Pxc‖2Af

= ‖Kfxf‖2Af

+ ‖Pxc‖2Af

.


Given any x ∈ Rnf, we can write x = Kfxf + Pxc and compute

‖Kfx‖2Af

= ‖Kfxf‖2Af

≤ ‖Kfxf‖2Af

+ ‖Pxc‖2Af

= ‖x‖2Af

.

Thus ‖Kfx‖2Af

≤ ‖x‖2Af

for all x. For all x ∈ R(Kf ) we have ‖Kfx‖2Af

=‖x‖2

Af. Thus ‖Kf‖2

Af= 1.

Finally, we will prove the last claim of the lemma. Given any x ∈ Rnf , we canwrite x = Kfxf + Pxc. Then for any yc ∈ Rnc ,

‖Kfx‖2Af

= ‖Kfxf‖2Af

≤ ‖Kfxf‖2Af

+ ‖P(xc − yc)‖2Af

= ‖x−Pxc + P(xc − yc)‖2Af

= ‖x−Pyc‖2Af

If yc = xc, then ‖Kfx‖2Af

= ‖x−Pyc‖2Af

. 2

This lemma shows that the coarse grid correction Kf minimizes the natural Af norm of theerror in the multigrid iteration, with respect to all variations produced by the prolongation.As a result, if the smoother satisfies ‖I − SfAf‖Af

≤ 1 and ‖I −AfSf‖Af≤ 1, then the

two-level algebraic multigrid iteration can never diverge.

3.10.4 Pre Smoothing

Lemma 3.10-4: ([86, Corollary A.5.1]) Suppose that Af ∈ Rnf×nf is symmetric positive-definite with diagonal part D, that nf > nc, that R ∈ Rnf×nc and P ∈ Rnc×nf satisfyR = P>β ∈ Rnf×nc for some scalar β > 0, and that RAfP is nonsingular. Let Kf ≡I−P(RAfP)−1RAf . Then

‖Kf (I−AfSf )‖Af≤ η

if and only if∀xf ∃yc ‖(I−AfSf )xf −Pyc‖Af

≤ η‖xf‖Af

Proof: Recall from lemma 3.10-3 that ∀xf ∈ Rnf ‖Kfxf‖Af= minyc ‖xf −

Pyc‖Af. Suppose that ‖Kf (I−AfSf )‖Af

≤ η. Then

minyc

‖(I−AfSf )xf −Pyc‖Af= ‖Kf (I−AfSf )‖Af

≤ η‖xf‖Af.

To prove the converse, suppose that for all xf there exists xc so that

‖(I−AfSf )xf −Pxc‖Af≤ η‖xf‖Af

.


Then

η‖xf‖Af≥ ‖(I−AfSf )xf−Pxc‖Af

≥ minyc

‖(I−AfSf )xf−Pyc‖Af= ‖Kf (I−AfSf )xf‖Af

2

This lemma shows that the speed of convergence depends on the cooperation of smoothingand prolongation. What is really important to the performance of the algorithm is howclose the range of the prolongation is to the range of I−AfSf .

One successful approach is to use Gauss-Seidel iteration as a smoother, in the followingway. In the pre-smoother, we first treat the coarse points with Gauss-Seidel improvement,then the fine grid points. In the post-smoother, we treat the points in the reverse order.

3.10.5 Prolongation and Restriction

Suppose that A is the matrix associated with some level of the multigrid algorithm,that the diagonal entries of A are all positive and the off-diagonal entries of A are allnon-positive. For example, A could be an M-matrix. Subdivide the indices into disjointsets C and F , where C corresponds to points shared with the coarse grid. For each i ∈ Flet Ni− = j 6= i : Aij < 0 be the set of stencil neighbors corresponding to negativeoff-diagonal entries in the ith row of A. For some given τ ≥ 1, suppose that for all i ∈ Fwith N−

i 6= ∅ we choose Πi ⊂ C ∩ N−i such that∑

j∈N−i

|Aij | ≤ τ∑j∈Π−i

|Aij | . (3.10-2)

For problems in 1D, it is common to choose Πi = N−i . However, for problems in 2D for

which the coarsened multigrid matrix has a different sparsity pattern than the finest matrix,is may be convenient to choose Πi to be the same set of neighbors as on the finest grid.

For i ∈ C, define the prolongation operator to copy the coarse value to the same finegrid location. For i ∈ F with N−

i 6= ∅, define the prolongation operator P by

(Px)i = −

∑j∈N−

iAij∑

j∈ΠiAij

∑j∈Πi

Aij(Px)jAii

.

After these prolongation steps have been performed, consider i ∈ F is such that N−i ∩C = ∅.

If there is a subset Πfi ⊂ N−

i such that (3.10-2) is satisfied, define

(Px)i = −

∑j∈N−

iAij∑

j∈ΠfiAij

∑j∈Πf

iAij(Px)j

Aii


We assume that this process of indirect interpolation can be continued until the prolongationis defined at all i ∈ F .

More general prolongation cases are considered in section A.4.2 of [86]. This includes adiscussion of how to handle matrices that are not M-matrices.

3.10.6 Steady-State Diffusion in One Dimension

Suppose we use finite differences to approximate the solution of the steady state heatequation in one dimension,

− d

dx

(δ(x)

du

dx

)= f , 0 < x < 1 ,

u(0) = 1 , u(1) = 0 .

We will use centered differences:

−(δi+1/2

ui+1 − ui4x

− δi−1/2ui − ui−1

4x

)14x

= fi ,

u0 = 1 , un = 0 .

Suppose that our refinement ratio is ρ = 2, and n = 2N is even. In the associated linearsystem, the matrix entries are

Aij =

−δi−1/2/dx

2, j = i− 1(δi−1/2 + δi+1/2)/4x2, j = i

−δi+1/2/dx2, j = i+ 1

In order to determine the prolongation, we will follow the ideas in section 3.10.5. Wewill subdivide the indices for the unknowns into

C = 2I : 1 ≤ I < N and F = 2I + 1 : 0 ≤ I < N .

Note that the finite difference scheme implies that for i ∈ F

N−i =

2, i = 1

i− 1, i+ 1, i = 2I + 1 and 0 < I < Nn− 2, i = n− 1

.

Note that for all i ∈ F , N−i ∩ C = N−

i , so we take Πi = N−i . Thus for i ∈ F we define the

prolongation weights by

wij = −Aij

Aii=

δi−1/2

δi−1/2+δi+1/2, j = i− 1

δi+1/2

δi−1/2+δi+1/2, j = i+ 1


For constant diffusion and a uniform grid, we have wi,i±1 = 12 .

It follows that the prolongation operator P is defined by

e>i Py =

w1,2y1, i = 1

w2I+1,2IyI + w2I+1,2I+2yI+1, i = 2I + 1 and 1 ≤ I < Nw2N+1,2NyN , i = 2N + 1

yI , i = 2I and 1 ≤ I ≤ N

For constant diffusion coefficient δ and a uniform grid, this corresponds to performing linearinterpolation of the coarse data to obtain the fine data. Note that if y = e is the vector ofones, then for 1 ≤ I < N we have

e>2I+1Pe = w2I+1,2I + w2I+1,2I+2

=δ2I+1/2

δ2I+1/2 + δ2I+3/2+

δ2I+3/2

δ2I+1/2 + δ2I+3/2= 1 ,

ande>2IPe = 1 .

Thus the rows of the prolongation matrix sum to one, except for the first and last rows.In other words, the prolongation of a coarse constant grid function is that fine constant,except at the nodes next to the boundary. In order to determine the coarse matrix, it willbe useful to note that the Ith column of the prolongation matrix is.

PeI = e2I +∑

i:2I∈Πi

eiwi,2I = e2I−1w2I−1,2I + e2I + e2I+1w2I+1,2I , 1 ≤ I ≤ N .

The restriction operator is the adjoint of the prolongation operator in the followingsense:

∀x ∈ Rnf ∀y ∈ Rnc∑I

(Rx)IyI ≡ (Rx,y)c = β(x,Py)f ≡ β∑i

xi(Py)i .

This implies that

e>I Rx = (Rx, eI)c = β(x,PeI)f = βx>PeI= β(x2I−1w2I−1,2I + x2I + x2I+1w2I+1,2I) .

This formula is useful for programming the restriction. This formula can be reinterprettedas

Rei =12

w1,2e1, i = 1

w2I+1,2IeI + w2I+1,2I+2eI+1, i = 2I + 1 and 1 ≤ I < N − 1w2N−1,2N−2eN−1, i = 2N − 1

eI , i = 2I and 1 ≤ I < N


So that the restriction is an average on a uniform mesh with constant diffusion, we chooseβ = 1/2. The choice of β has no effect on the convergence of the multigrid algorithm, so longas it is positive. However, when multigrid is used in combination with domain decompositionfor adaptive mesh refinement, the scaling of the coarse grid equations in multigrid can becrucial to the performance of the linear solver.

We can compute the coarse grid matrix RAP columnwise:

RAPeI = RA(e2I−1w2I−1,2I + e2I + e2I+1w2I+1,2I)

If 1 < I < N , we have

RAPeI = R [w2I−1,2I(A2I−2,2I−1e2I−2 + A2I−1,2I−1e2I−1 + A2I,2I−1e2I)+ (A2I−1,2Ie2I−1 + A2I,2Ie2I + A2I+1,2Ie2I+1)+w2I+1,2I(A2I,2I+1e2I + A2I+1,2I+1e2I+1 + A2I+2,2I+1e2I+2)]

= R [w2I−1,2IA2I−2,2I−1e2I−2 + (w2I+1,2IA2I−1,2I−1 + A2I−1,2I)e2I−1

+ (w2I−1,2IA2I,2I−1 + A2I,2I + w2I+1,2IA2I,2I+1)e2I

+(A2I+1,2I + w2I+1,2IA2I+1,2I+1)e2I+1 + w2I+1,2IA2I+2,2I+1e2I+2]= R [w2I−1,2IA2I−2,2I−1e2I−2 + (w2I−1,2IA2I,2I−1 + A2I,2I + w2I+1,2IA2I,2I+1)e2I

+w2I+1,2IA2I+2,2I+1e2I+2]

=12

[w2I−1,2IA2I−2,2I−1eI−1 + (w2I−1,2IA2I,2I−1 + A2I,2I + w2I+1,2IA2I,2I+1)eI

+w2I+1,2IA2I+2,2I+1eI+1]

Note that the choice of the prolongation weights wij caused the evaluation of APeI toinvolve values at coarse grid nodes (e2I−2, e2I and e2I+2) only.

Note that the I − 1, I entry of the coarse matrix is

12w2I−1,2IA2I−2,2I−1 = −

δ2I−1/2

δ2I−3/2 + δ2I−1/2

δ2I−3/2

24x2= − 1

(24x)22

1/δ2I−1/2 + 1/δ2I−3/2.

Similarly, the I, I entry of the coarse matrix is

12(w2I−1,2IA2I,2I−1 + A2I,2I + w2I+1,2IA2I,2I+1) =

12

−

δ2I−1/2

δ2I−3/2 + δ2I−1/2

δ2I−1/2

24x2+δ2I−1/2 + δ2I+1/2

4x2−

δ2I+1/2

δ2I+3/2 + δ2I+1/2

δ2I+1/2

24x2

=

124x2

δ2I−1/2

δ2I−3/2 + δ2I−1/2

[−δ2I−1/2 + δ2I−3/2 + δ2I−1/2

]+

δ2I+1/2

δ2I+3/2 + δ2I+1/2

[−δ2I+1/2 + δ2I+3/2 + δ2I+1/2

]=

1(24x)2

2

1/δ2I−1/2 + 1/δ2I−3/2+

21/δ2I+1/2 + 1/δ2I+3/2

.


In general, the coarse matrix will correspond to a similar finite difference discretizationof the steady-state heat equation, with coarse diffusion coefficients chosen to be harmonicaverages of fine coefficients.

On a uniform mesh with constant diffusion coefficient, the diagonal entries of the coarsematrix are

e>I (RAP)eI = w2I−1,2IA2I,2I−1 + A2I,2I + w2I+1,2IA2I,2I+1

=12

[12

(− δ

4x2

)+

2δ4x2

+12

(− δ

4x2

)]=

2δ(24x)2

and the off-diagonal entries of the coarse matrix are

e>i+1(RAP)eI =12w2I+1,2IA2I+2,2I+1 =

14

(− δ

4x2

)= − δ

(24x)2.

Thus the multigrid coarsening of the fine matrix produces the same matrix on the coarsegrid as the difference scheme.

A program to perform the algebraic multigrid algorithm for the steady-state heat equa-tion in 1D is given in Program 3.10-5: Multigrid.C This program uses a C++classLevel to contain the data arrays on the hierarchy of grids, and to perform the multigridoperations using recursion. In particular, note that the member function Level::setupshows how to compute the coarse grid matrix directly as RAfP in work proportional tothe number of unknowns. The procedures Level::prolong and Level::restrict are cho-sen to be adjoints of each other, so that the overall multigrid cycle is symmetric. Thisallows the multigrid V-cycle to be used as a preconditioner for conjugate gradients.

In general, it is pretty cumbersome to compute the coarse matrix by hand. Instead, wecan use the basic multigrid operations to compute this matrix. Note that for any coarseindex I, e>i PeI = 0 for all |i−2I| > 1. the prolongation. Next, note that for any fine indexi, e>j Aei = 0 for |j − i| > 1. Thus e>j APeI = 0 for |j − 2I| > 2. Finally, the restrictionshows that for any coarse indices I and J , e>J RAPeI = 0 for |J − I| > 1. This means thatthe range of nonzero entries of RAPeI is separate from the nonzero entries of RAPeK for|K − I| ≥ 3. This suggests that we define a vector on the coarse level with ones in everythird entry, apply the prolongation, followed by the fine matrix, followed by the restriction.The resulting vector contains the values for the columns of RAP corresponding to thelocations of the ones. In other words, we can compute the entries of RAP in three suchapplications of prolongation, fine matrix multiply and restriction. Procedure Level::setupin file multigrid.C illustrates this computational approach.

Figure 3.10-1 shows the maximum error versus iteration number for the multigrid al-gorithm applied to a one-dimensional steady-state heat equation with random diffusioncoefficient uniformly distributed in the unit interval. The discretization involved 1024 gridcells. The initial guess was random and uniformly distributed in the unit interval. The

http://www.math.duke.edu/~johnt/math226/iterative_methods/Multigrid.C


multigrid smoother used one pass of the Gauss-Seidel iteration, and the prolongation wasgiven by algebraic multigrid. The error was reduced by 17 orders of magnitude in 29 itera-tions. When we switch to the Gauss-Seidel red-black smoother, the error is essentially zeroafter a single iteration.

Figure 3.10-1: L∞ Error in Multigrid versus Step Number: 1D steady-state diffusion withrandom diffusion coefficient, 1024 grid cells

3.10.7 Heat Equation in One Dimension

Next, suppose that we want to solve the heat equation

∂u

∂t− ∂

∂x(δ∂u

∂x) = f , 0 < x < 1 , 0 < t

u(0, t) = 0 = u(1, t) , 0 < t

u(x, 0) = u0(x) , 0 < x < 1 .


Straightforward centered finite differences with backward Euler time integration leads tothe scheme

un+1i − uni

4t− δ

[un+1i+1 − un+1

i

4x−un+1i − un+1

i−1

4x

]14x

= fi .

This can be rewritten

−τun+1i−1 + (1 + 2τ)un+1

i − τun+1i+1 = uni + fi4t

where

τ ≡ δ4t

4x2

is the fine grid decay number.Next, let us describe our multigrid prolongation. Suppose that we choose C, F and Πi

as in section 3.10.6. We compute the prolongation weights by

wij = −Aij

Aii=

τ

1 + 2τ≡ ω .

The prolongation operator P has columns

PeI = eI +∑

i:2I∈Πi

eiwi,2I = ωe2I−1 + e2I + ωe2I+1 , 1 ≤ I < N .

Note that for small decay numbers, the algebraic multigrid prolongation differs significantlyfrom being an interpolation.

We will take the restriction operator to be half the transpose of P. In other words,

e>I Rx =12x2I +

ω

2(x2I−1 + x2I+1)

This implies that

Rei =

ω2 e1, i = 1

ω2 (eI + eI+1), i = 2I + 1 and 1 ≤ I < N − 1

ω2 eN−1, i = 2N − 1

12eI , i = 2I and 1 ≤ I < N

Let us compute the coarse grid matrix RAP, defined columnwise by

RAPeI = RA(ωe2I−1 + e2I + ωe2I+1)


For 1 < I < N − 1 we have

RAPeI = R [ω(−τe2I−2 + (1 + 2τ)e2I−1 − τe2I) + (−τe2I−1 + (1 + 2τ)e2I − τe2I+1)+ω(−τe2I + (1 + 2τ)e2I+1 − τe2I+2)]

= R[− τ2

1 + 2τe2I−2 + (1 + 2τ − 2τ2

1 + 2τ)e2I −

τ2

1 + 2τe2I+2

]= − τ2

2(1 + 2τ)eI−1 + (

1 + 2τ2

− τ2

1 + 2τ)eI −

τ2

2(1 + 2τ)eI+1

The coarse matrix will also be tridiagonal, with constant diagonal entries 12 +τ−τ2/(1+2τ)

and constant sub- and super-diagonal entries τ2/2(1+2τ). This means that the multigridcoarsening of the fine grid equations will produce a linear system that does not correspondto the same finite difference scheme on the coarse grid.

Instead, suppose that we use a backward Euler discretization of the form

16un+1i−1 − uni−1

4t+

23un+1i − uni

4t+

16un+1i+1 − uni+1

4t

−

[un+1i+1 − un+1

i

4x−un+1i − un+1

i−1

4x

]δ

4x= fi .

This can be rewritten

(16− τ)un+1

i−1 + (23

+ 2τ)un+1i + (

16− τ)un+1

i−1

=16uni−1 +

23uni +

16uni+1 + fi4t .

Unfortunately, this discretization does not lead to an M-matrix in the linear system. Forτ < 1/6, our algebraic multigrid prolongation would produce zero values at the fine gridpoints. On the other hand, this discretization of the heat equation will correspond to afinite element method. Accordingly, we will use the prolongation and restriction commonlyemployed by finite element multigrid (see section 5.13.)

Suppose that we take the prolongation operator P to have columns

PeI = eI +∑

i:2I∈Πi

eiwi,2I =12e2I−1 + e2I +

12e2I+1 , 1 ≤ I < N .

This prolongation corresponds to a linear interpolation of the coarse grid data, and is notthe algebraic multigrid prolongation we presented in section 3.10.5. The restriction operatorwill be taken to be 1

2P>, or

Rei =12

12e1, i = 1

12eI + 1

2eI+1, i = 2I + 1 and 1 ≤ I < N − 112eN−1, i = 2N − 1

eI , i = 2I and 1 ≤ I < N


Let us compute the coarse grid matrix for 1 < I < N :

RAPeI = RA

12e2I−1 + e2I +

12e2I+1

= R

12

[(16− τ

)e2I−2 +

(23

+ 2τ)

e2I−1 +(

16− τ

)e2I

]+[(

16− τ

)e2I−1 +

(23

+ 2τ)

e2I +(

16− τ

)e2I+1

]+

12

[(16− τ

)e2I +

(23

+ 2τ)

e2I+1 +(

16− τ

)e2I+2

]= R

(112− 1

2τ

)e2I−2 +

12e2I−1 +

(56

+ τ

)e2I +

12e2I+1 +

(112− 1

2τ

)e2I+2

=

12

(112− 1

2τ

)eI−1 +

14(eI−1 + eI) +

(56

+ τ

)eI +

14(eI + eI+1) +

(112− 1

2τ

)eI+1

=12(13− 1

2τ)eI−1 + (

43

+ τ)eI + (13− 1

2τ)eI+1

= (16− 1

4τ)eI−1 + (

23

+12τ)eI + (

16− 1

4τ)eI+1 .

Thus with this choice of the discretization of the heat equation and of the prolongationand restriction, the coarse grid matrix corresponds to the same discretization of the heatequation, with decay numbers related by τc = 1

4τf .A program to perform the algebraic multigrid algorithm for the heat equation in 1D

is given in Program 3.10-6: Multigrid.C This program uses a C++class Level tocontain the data arrays on the hierarchy of grids, and to perform the multigrid operationsusing recursion. In particular, note that the member function Level::setup shows howto compute the coarse grid matrix directly as RAfP in work proportional to the numberof unknowns. The procedures Level::prolong and Level::restrict are chosen to beadjoints of each other, so that the overall multigrid cycle is symmetric. This allows themultigrid V-cycle to be used as a preconditioner for conjugate gradients.

3.10.8 2D Laplace Equation

Consider the standard five-point discretization of the Laplacian:

−ui+1,j − 2uij + ui−1,j

4x2− ui,j+1 − 2uij + ui,j−1

4y2= fij .

In this case, the set of neighbors of point (i, j) is

N−(i,j) = (i− 1, j), (i+ 1, j), (i, j − 1), (i, j + 1) .

http://www.math.duke.edu/~johnt/math226/iterative_methods/Multigrid.C


In this case, the coarse points have indices of the form (2I, 2J), and fine points have an oddindex in at least one coordinate direction. For (i, j) = (2I+1, 2J) we will prolong via interpo-lation from coarse points in Π(2I+1,2J) = (2I, 2J), (2I+2, 2J), and for (i, j) = (2I, 2J+1)we will prolong via interpolation from coarse points in Π(2I,2J+1) = (2I, 2J), (2I, 2J + 2),For (i, j) = (2I + 1, 2J + 1) we will prolong via interpolation from fine points in Πf

i =(2I, 2J + 1), (2I + 2, 2J + 1), (2I + 1, 2J), (2I + 1, 2J + 2).

Thus the prolongation operator is defined by

(Py)2I,2J = yI,J

(Py)2I+1,2J = −A2I,2J + A2I+2,2J + A2I+1,2J−1 + A2I+1,2J+1

A2I,2J + A2I+2,2J

A2I,2J(Py)2I,2J + A2I+2,2J(Py)2I+2,2J

A2I+1,2J=

12[(Py)2I,2J + (Py)2I+2,2J ]

(Py)2I,2J+1 = −A2I,2J + A2I,2J+2 + A2I−1,2J+1 + A2I+1,2J+1

A2I,2J + A2I,2J+2

A2I,2J(Py)2I,2J + A2I,2J+2(Py)2I,2J+2

A2I,2J+1=

12[(Py)2I,2J + (Py)2I,2J+2]

(Py)2I+1,2J+1 = −A2I,2J+1 + A2I+2,2J+1 + A2I+1,2J + A2I+1,2J+2

A2I,2J+1 + A2I+2,2J+1 + A2I+1,2J + A2I+1,2J+2

A2I,2J+1(Py)2I,2J+1 + A2I+2,2J+1(Py)2I+2,2J+1 + A2I+1,2J(Py)2I+1,2J + A2I+1,2J+2(Py)2I+1,2J+2

A2I+1,2J+1=

14[(Py)2I,2J+1 + (Py)2I+2,2J+1 + (Py)2I+1,2J + (Py)2I+1,2J+2]

Here the formula for the prolongation with two odd indices was derived from indirect inter-polation. For Dirichlet boundary conditions, with obvious modifications to these expresionsmust be made near the boundaries.

We take the restriction to be one fourth the transpose of the prolongation. Then

(Rx)I,J = x2I,2J+12

(x2I−1,2J + x2I+1,2J + x2I,2J−1 + x2I,2J+1)+14

(x2I−1,2J−1 + x2I+1,2J−1 + x2I−1,2J+1 + x2I+1,2J+1) .

It is convenient to describe the difference scheme, prolongation and restriction withstencils. If 4x = 4y, then the finite difference stencil times 4x2 is

−1−1 4 −1

−1

The prolongation stencil is1/4 1/2 1/41/2 1 1/21/4 1/2 1/4

In computing the coarse grid matrix RAP, we first compute AP, for which the stenciltimes 4x2 is

−1/4 −1/2 −1/4−1/4 0 1/2 0 −1/4−1/2 1/2 2 1/2 −1/2−1/4 0 1/2 0 −1/4

−1/4 −1/2 −1/4


Applying the restriction to these results gives us the coarse matrix stencil times (24x)2:

−1/4 −1/2 −1/4−1/2 3 −1/2−1/4 −1/2 −1/4

Note that the algebraic multigrid coarsening of the five-point discretization of the Lapla-cian has become a nine-point discretization on the coarse grid. In such a case, it may beconvenient to continue to use the fine grid prolongation and restriction on coarser grids.Nevertheless, the formulas for the prolongation must be modified to reflect the fact thatthe set N−

i is larger with the coarse matrix.The standard finite element approximation to the Laplacian, using piecewise linear

elements and exact integration on a uniform square grid has stencil times 4x2

−1/3 −1/3 −1/3−1/3 8/3 −1/3−1/3 −1/3 −1/3

.

The finite element prolongation is based on linear interpolation from the coarse space tofine grid locations, so it is the same as the stencil described in the previous paragraph. Inthis case, the multigrid coarsening of the finite element stencil produces the same stencil onthe coarse grid.

3.10.9 Work Estimates

Let us estimate the work involved in the multigrid algorithm.

Lemma 3.10-7: Let nk be the number of unknowns on level k, and let Wk be the workin one iteration of the k’th level of the scheme. Suppose that we use a fixed refinementration ρ > 1 between levels in D spatial dimensions, with p < ρD repetitions of thecoarser multigrid procedure on each level, and m repetitions of the multigrid procedureon a given level before returning. Finally, suppose that the smoother requires at mostCSnk operations, the residual computation requires at most Crnk operations and theprolongation or restriction requires at most CInk operations on level k. Then the totalwork in level k of the multigrid algorithm satisfies

Wk ≤2m(CS + Cr + CI + 1)(1− pρ−D)(1− ρ−D)

nk +W0

Proof: It is easy to see that

Wk ≤ 2(CS + Cr + CI + 1)nk + pWk−1 = Cnk + pWk−1 .


Since we use a fixed refinement ratio ρ > 1 between levels,

nk = ρDnk−1 .

This recurrent inequality implies thatWk = Cnk+p(Cnk−1+pWk−2); continuingin this way we obtain

Wk ≤ Cnk

k−1∑j=0

(pρ−D)j + pkW0 = Cnk1− (p/ρD)k

1− p/ρD+ pkW0 .

Then we see thatWk ≤ Cnk

11− p/ρD

+ pkW0 .

Let Wk be the total work in level k of the multigrid algorithm. Then

Wk ≤ mWk + Wk−1 ≤ mWk +mWk−1 + Wk−2 ≤ . . .

≤ m

k∑j=1

Wj +W0 ≤mC

1− p/ρD

k∑j=1

nj +W0 =mC

1− p/ρDnk

k∑j=1

rho−Dj +W0

=mC

1− p/ρDnk

1− ρ−Dk

1− ρ−D+W0 ≤

mC

1− p/ρDnk

11− ρ−D

+W0 .

2

This shows that the total work to produce a O(4xk) error in the solution of the linearsystem is proportional to the product of the number of fine unknowns nk and the numberof multigrid iterations m.

3.10.10 Multigrid Debugging Techniques

There are a number of programming errors that can cause multigrid to fail. To removethese errors, it is useful to perform a variety of program tests.

1. Check that the fine matrix produces a symmetric linear operator. In other words,with the right-hand side set to zero, the computation of the residual given xf shouldprovide Afxf . Then for random values of xf and yf we should have

y>f Afxf = (Afxf ,yf ) = (xf ,Afyf ) = [Afyf ]>xf .

We can use a random number generator to produce xf and yf , apply the residualcomputation with bf = 0 to get Afxf and Afyf , then form inner products to checkthe symmetry of Af . If the test fails, then the test should be repeated for xf and yfequal to arbitrary axis vectors until the problem is isolated.


2. Check that the restriction is the correct scalar multiple of the transpose of the pro-longation. If xf and yc are random vectors, we should have

y>c [Rxf ] = (Rxf ,yc) = β(xf ,Pyc) = β[Pyc]>xf .

On a uniform grid with constant coefficients, the prolongation should produce averagesof the coarse grid values at intermediate fine grid points, and the restriction shouldaverage fine grid values to produce coarse grid values.

3. Check that the coarse matrix is symmetric. This is similar to the symmetry testfor Af . However, this test depends on the relationship between the prolongationand restriction, and on the code used to compute the coarse grid matrix from thefine grid matrix. For constant coefficients on uniform grids, we can often design thediscretization so that the coarse grid matrix corresponds to the same difference schemeon the coarse grid.

4. Check that the pre-smoother and post-smoother are adjoints of each other. If xf andyf are random vectors, we should have

y>f [Sfxf ] = (Sfxf ,yf ) = (xf ,S>f yf ) = [S>yf ]>xf .

We can apply the pre-smoother to xf to get Sfxf , and the post-smoother to yf toget S>yf . Then we can take appropriate inner products to check the smoothers.

5. Check that the coarse grid projection is a projection. Given a random vector xc, wewant to check that [

I−P(RAfP)−1RAf

]Pxc = 0 .

This test begins with a prolongation to compute Pxc, then with initial residual set tozero we perform the steps in the multigrid V-cycle that update the residual, restrict,recurse and prolong. Note that the subscript c here corresponds only to the coarsestlevel. It is generally too hard to check that vectors in the null space of RAf areunchanged by the coarse grid projection.

6. Check that the coarse grid projection is self-adjoint in the inner product generated byAf . Given random vectors xf and yf , we compute the coarse grid projections

Kfxf = [I−P(RAfP)−1RAf ]xf

and Kfyf . Then we check that

(Kfxf ,yf )Af= [yfAf ]>[Kfxf ] = [Kfyf ]>[Afxf ] = (xf ,Kfyf )Af

by computing appropriate inner projects.


7. Check the the V-cycle is symmetric. If rf and sf are random, apply the multigrid V-cycle to compute the resulting corrections df = Vfrf and ef = Vfsf . Then comparethe inner products s>f df and r>ef .

8. Check that the V-cycle reduces the error in the solution. If xf is random, apply themultigrid V-cycle to initial residual Afxf with bf = 0. The resulting vector

Vfxf = (I− S>f Af )(I−PVcRAf )(I− SfAf )xf

should have components that are significantly smaller than xf . This could also bechecked by taking xf to be an arbitrary axis vector, and checking that Vfxf hasentries that are small compared to one.

Exercises 3.101. Modify the one-dimensional steady-state heat equation multigrid algorithm to use Richardson’s iter-

ation, Jacobi-ω iteration, Gauss-Seidel iteration, Gauss-Seidel red-black iteration, and Gauss-Seidelto-and-fro iteration. Discuss how you chose the relaxation parameters for Richardson’s iteration andthe Jacobi-ω iteration. Compare the speed of convergence for constant diffusion and random diffusion,using 2k grid cells with k ≥ 8. (Make sure that your pre-smoother and post-smoother are adjoints ofone another.)

2. In section 3.10.6 we described a multigrid algorithm for a finite difference discretization of the steady-state heat equation, and found that the coarse grid matrix Ac = RAfP involved a similar finitedifference discretization of the steady-state heat equation, with harmonic averages of the fine-griddiffusion coefficients. Suppose that we scale the finite difference equations in a way that is naturalfor the finite element method:

−δi+ 12

ui+1 − ui

4x+ δi− 1

2

ui − ui−1

4x= fi4x .

and suppose that we define the restriction and prolongation as adjoints of each other so that thecoarse-grid equations reduce to the same form, with harmonic averages of the fine-grid diffusioncoefficients.

(a) How are the restriction and prolongation related in this case? In other words, R = P>β, butwhat is β in this case?

(b) Modify the multigrid algorithm for this restriction and prolongation, and compare its perfor-mance to the original algorithm.

3. Use the multigrid algorithm for the steady-state heat equation as a preconditioner for the conjugategradient algorithm. Plot the log of the L∞ norm of the error versus iteration number and thelog of the L∞ norm of the error versus the computer time for both the preconditioned conjugategradient iteration and the original multigrid iteration. Does the use of conjugate gradients speed theconvergence?

4. Modify the one-dimensional steady-state heat equation multigrid algorithm to sove the unsteady heatequation. Compare the speed of convergence for decay numbers τ = 10−1, 100, 101 and 102. Discussyour choice of smoother.

5. Program a multigrid algorithm to solve a 2D steady-state heat equation on the unit square with zeroboundary data and fixed forcing fi,j = 1. Use the standard 5-point discretization of the Laplaceoperator. Describe the prolongation and restriction, and explain why they are adjoints. Discuss yourchoice of smoother, and the convergence results.

Chapter 4

Finite Element Methods

Instead of using finite difference approximations to the partial derivatives in our dif-ferential equations, we will approximate the solution of the equation by another functionthat is easier to use in computation. In practice, we will approximate the solution of thesedifferential equations by piecewise polynomials (splines). In order to understand how wellwe can approximate some functions by other functions, we will use some results from func-tional analysis. This approach will have advantages that finite differences cannot offer.For example, we will be able to develop approximations to differential equations will Diracdelta-function forcing. In multiple dimensions, we will develop natural ways to deal withcurved boundaries.

This approach will require a number of mathematical developments. We will need todiscuss the well-posedness of the formulation, and understand how the solution depends onits data. We will also need to understand how to estimate errors in approximating functionswith finite energy by piecwise polynomials. In order to implement the numerical methods,we will need to understand how to compute the equations for the numerical approximation,and to solve these equations efficiently.

4.1 Galerkin Examples

As an example of the techniques we will develop in this chapter, consider the boundary-value problem

−∇x · k∇xu = f in Ω ⊂ Rd (4.1-1a)u = be on Γe ⊂ ∂Ω (4.1-1b)

n · k∇xu = bn on ∂Ω \ Γe (4.1-1c)

The subscript “e” indicates that the boundary condition u = be is “essential,” meaningthat it must be imposed (as a constraint) on the solution of the problem. The subscript

211

212 CHAPTER 4. FINITE ELEMENT METHODS

“n” indicates that the other boundary condition n · k∇xu = bn is “natural,” meaning thatit will be enforced naturally on solutions.

4.1.1 Weak Formulation

Our next step will be to develop the weak form of the boundary-value problem. First,we define the set of test functions.

Definition 4.1-1: The set C∞e (Ω) consists of all functions v with an arbitrary number ofcontinuous derivatives satisfying v(x) = 0 for all x ∈ Γe.

Next, we use the test functions to develop an alternative formulation of the problem.

Lemma 4.1-2: Suppose that we are given functions f ∈ C0(Ω) and k ∈ C1(Ω) defined inthe interior of an open set Ω ⊂ Rd. Suppose that the boundary ∂Ω of Ω is such that thedivergence theorem∫

Ωψ∇x · k∇xφ dx =

∫∂Ωψn · k∇xφ ds−

∫Ω∇xψ · k∇xφ dx (4.1-2)

is satisfied for all φ ∈ C2(Ω) and all ψ ∈ C1(Ω). If Γe ⊂ ∂Ω, suppose that we are givenfunctions be ∈ C(Γe) and bn ∈ C(Ω\Γe. Then u ∈ C2(Ω) solves the boundary value problem(4.1-1) if and only if u ∈ C2(Ω) is such that u(x) = be(x) for all x ∈ Γe and u satisfies theweak form

∀v ∈ C∞e (Ω), A(v, u) ≡∫

Ω∇xv · k∇xudx =

∫Ωvf dx +

∫Ω\Γe

vbn ds . (4.1-3)

Proof: First, suppose that u solves the boundary-value problem (4.1-1), andv ∈ C∞e (Ω). Then the differential equation, the divergence theorem (4.1-2) andthe boundary conditions on v and u imply that

−∫

Ωvf dx =

∫Ωv∇x · k∇xu dx =

∫∂Ωvn · k∇xu ds−

∫Ω∇xv · k∇xu dx

=∫

Γe

vn · k∇xu ds+∫∂Ω\Γe

vn · k∇xu ds−∫

Ω∇xv · k∇xu dx

=∫∂Ω\Γe

vbn ds−∫


The weak form (4.1-3) is an easy consequence of this equation.

To prove the converse, suppose that the weak form is satisfied by u ∈ C2(Ω),and u satisfies the essential boundary condition (4.1-1b). Then for any v ∈

4.1. GALERKIN EXAMPLES 213

C∞e (Ω) the weak form (4.1-3), the divergence theorem (4.1-2) and the boundarycondition on v imply that∫

Ωvf dx +

∫∂Ω\Γe

vbn ds =∫


=∫∂Ωvn · k∇xu ds−

∫Ωv∇x · k∇xu dx

=∫∂Ω\Γe

vn · k∇xu ds−∫

Ωv∇x · k∇xu dx

This can be rewritten in the form∫Ωv(f + ∇x · k∇xu) dx +

∫∂Ω\Γe

v(bn − vn · k∇xu) ds = 0 . (4.1-4)

Note that f + ∇x · k∇xu is continuous in Ω. If this function were nonzero atsome x ∈ Ω, then there would be an open set Nx ⊂ Ω containing x so thatf +∇x · k∇xu would be nonzero and have the same sign at all points of Nx. Wecould choose v ∈ C∞e (Ω) to be zero outside Nx and positive at x. This wouldcontradict the result (4.1-4). Thus u satisfies the partial differential equation(4.1-1a). This result and (4.1-4) now imply that

∫∂Ω\Γe

v(bn − n · k∇xu) ds = 0 . (4.1-5)

If bn − n · k∇xu were nonzero at some x ∈ ∂Ω \ Γe, then it would be nonzeroand have the same sign at all points in some open set Sx ⊂ ∂Ω \ Γe. We couldchoose v ∈ C∞e (Ω) to be zero outside Sx and positive at x. This would contradict(4.1-5). Thus n · k∇xu = bn at all points of ∂Ω \ Γe. As a result, u solves thenatural boundary condition (4.1-1c). We have now shown that u solves theproblem (4.1-1). 2

The weak form (4.1-3) is far more suitable for our purposes than the original differentialequation (4.1-1). We will be able to solve problems with fewer continuity assumptions onthe coefficients in the differential equation, the inhomogeneities in the differential equationor boundary conditions, or the boundary of the problem domain. However, at this pointwe do not want to complicate the discussion by delving further into those issues.

Instead, we would like to relate the weak form to energy minimization principles.


Lemma 4.1-3: Suppose that Ω ⊂ Rd is open and Γe ⊂ ∂Ω. Also assume that k(x) isintegrable on Ω and positive for all x ∈ Ω, f is integrable on Ω, and bn is integrable on∂Ω \ Γe. Let

E(w) ≡ 12

∫Ω∇xw · k∇xw dx−

∫Ωwf dx−

∫∂Ω\Γe

wbn ds . (4.1-6)

Then u satisfies the weak form (4.1-3) and E(u) is finite if and only if u minimizes E(w)over all measurable functions with E(w) finite.

Proof: Suppose that u minimizes E(w) over all measurable functions for whichE(w) is finite. Then for any v ∈ C∞(Ω) and any ε we have

E(u) ≤ E(u+ vε) = E(u) + ε

[∫Ω∇xv · k∇xu ; dx−

∫Ωvf dx−

∫∂Ω−\Γe

vbn ds

]

+ε2

2

∫Ω∇xv · k∇xv dx .

If the terms inside the square brackets do not sum to zero (i.e., if u does notsatisfy the weak form (4.1-3)), then we can make ε sufficiently small that theε2 term is small compared to the ε term, and contradict the assumption that uminimizes E . It follows that a minimizer of E satisfies the weak form (4.1-3).

Next, suppose that u satisfies the weak form (4.1-3) and E(u) is finite. Then forany w such that E(w) is finite,

E(w)− E(u) =12

∫Ω∇xw · k∇xw − ∇xu · k∇xu dx−

∫Ω(w − u)f dx

−∫∂Ω\Γe

(w − u)bn ds

=∫

Ω∇x(w − u) · k∇xu dx−

∫Ω(w − u)f dx−

∫∂Ω\Γe

(w − u)bn ds

+12

∫Ω∇x(w − u) · k∇x(w − u) dx

≥∫



∫∂Ω\Γe

(w − u)bn ds

Given any ε > 0, we can approximate w − u by a smooth function v ∈ C∞ sothat∣∣∣∣∣∫

Ω∇x(w − u− v) · k∇xu dx−

∫Ω(w − u− v)f dx−

∫∂Ω\Γe

(w − u)bn ds

∣∣∣∣∣ < ε .


Then since u satisfies the weak form (4.1-3),

E(w)− E(u) ≥∫



∫∂Ω\Γe

(w − u)bn ds

=

[∫Ω∇x(w − u− v) · k∇xu dx−

∫Ω(w − u− v)f dx−

∫∂Ω\Γe

(w − u− v)bn ds

]

+

[∫Ω∇xv · k∇xu dx−

∫Ωvf dx−

∫∂Ω\Γe

vbn ds

]> −ε .

Since ε is arbitrary, we must have E(w)− E(u) ≥ 0. Thus u minimizes E . 2

We will discuss general developments of energy minimization in section ??. We willalso discuss ways to convert the constrained minimization problem into an unconstrainedextremum problem in section 5.14.

Example 4.1-4: Recall that we have previously discussed heat flow in section 1.2.1.Suppose that we have a thermally-conductive material, and we want to find the steady-statedistribution of the temperature. We assume that Fourier’s Law of Cooling applies, sothat the flux of the energy is proportional to the temperature gradient: f = −k∇xT . Theproportionality factor k is called the thermal conductivity, and is positive. The steady-state form of conservation of energy gives us the equation ∇x · f = q, where q representsinternal steady-state sources or sinks of heat, and has the units of energy per volume pertime. When we substitute the expressions for the energy and the energy flux, we obtain∇x · (−k∇xT ) = q.

Let us assume that the boundary of the material is insulated:

0 = n · (−k∇xT ) on ∂Ω .

We will multiply the steady-state equation by an arbitrary temperature variation δT andobtain∫

ΩδTq dx =

∫ΩδT∇x · (−k∇xT ) = −

∫∂ΩδTn · (−k∇xT ) ds+

∫Ω(∇xδT ) · k∇xT dx

=∫

Ω(∇xδT ) · k∇xT dx .

This equation is equivalent to the first-order condition for the minimum of

E(T ) ≡ 12

∫Ω(∇xT ) · k∇xT dx−

∫ΩTq dx .


Here E has units of energy times degrees per time.Example 4.1-5: Let us consider a different physical problem. For single-phase flow in

porous media, Darcy’s law states that the volumetric flow rate per cross-sectional area is

v = −K(∇xp)1µ.

Here p is the fluid pressure, the viscosity µ is positive, and the permeability tensor K isassumed to be symmetric and positive-definite. For incompressible flow, we have

∇x · v = w .

Here w represents source terms, for example well rates. In other words, the pressure satisfiesthe elliptic equation

−∇x · [K(∇xp)/µ] = w .

Suppose that we have no flow at the boundary of the domain:

0 = n · v = −n ·K(∇xp)/µ on ∂Ω .

If we multiply the pressure equation by an arbitrary pressure variation δp and integrate overthe domain, we obtain∫

Ωδp w dx = −

∫Ωδp ∇x · [K(∇xp)/µ] dx = −

∫∂Ωδp n · K

µ∇xp ds+

∫Ω(∇xδp) ·

Kµ∇xp dx

=∫

Ω(∇xδp) ·

Kµ∇xp dx .

This equation is equivalent to the first-order condition for the minimum of

E(p) ≡ 12

∫Ω(∇xp) ·

Kµ∇xp dx−

∫Ωpw dx .

Here E has units of energy per time. The first integral represents the rate of work done bythe system due to the flow, and the second term represents the rate of work done on thesystem by the wells.

Example 4.1-6: For simplicity, we will consider a linearly elastic material, in which thestress is a linear function of the strain. Given the infinitesimal displacement vector u ∈ R3

as a function of space and time, we can define the deformation gradient F(u) = ∂u∂x and

the infinitesimal strain

E(u) =12[F + F>] =

12[∇xu> + (∇xu>)>] .


Both F and E are dimensionless. Next, given the constant bulk modulus κ and theconstant shear modulus we can define the Lame constant λ = κ−2µ/3 and the linearlyelastic stress

S(u) =[F(u) + F>(u)

]µ+ I tr [F(u)]λ .

The material moduli κ, µ and λ all have units of force per area, so the stress S has theseunits. In fact, the pressure in the material is the trace of the stress tensor. It is not hardto see that the stress is a linear function of the deformation gradient:

S11

S21

S31

S12

S22

S32

S13

S23

S33

=

λ+ 2µ 0 0 0 λ 0 0 0 λ0 µ 0 µ 0 0 0 0 00 0 µ 0 0 0 µ 0 00 µ 0 µ 0 0 0 0 0λ 0 0 0 λ+ 2µ 0 0 0 λ0 0 0 0 0 µ 0 µ 00 0 µ 0 0 0 µ 0 00 0 0 0 0 µ 0 µ 0λ 0 0 0 λ 0 0 0 λ+ 2µ

F11

F21

F31

F12

F22

F32

F13

F23

F33

≡ K

∂u1/∂x1

∂u2/∂x1

∂u3/∂x1

∂u1/∂x2

∂u2/∂x2

∂u3/∂x2

∂u1/∂x3

∂u2/∂x3

∂u3/∂x3

In this equation, it is easy to see that the matrix K has units of force per unit area. Newton’ssecond law says that the rate of change of momentum per volume is equal to the applied forceper volume, which is a sum of the external force per volume f and the internal restoringforce per volume ∇x · S:

∂

∂t(ρ∂u∂t

) = f + ∇x · S .

The steady-state equation is −∇x · S = f .Suppose that the material is fixed around the boundary, so that the displacement is zero

there. If we multiply this equation by an arbitrary variation of the displacement vector δu,assumed to be zero on ∂Ω, and integrate over the domain, we obtain the∫

Ωδu · f dx = −

∫Ω(∇x · S)δu dx = −

∫∂Ω

n · Sδu ds+∑k

∫Ω(Sek) · ∇x(δuk)dx

=∫

Ωtr[S

∂δu∂x

]dx =∫

Ωtr[S

12(∂δu∂x

+∂δu∂x

)>]dx .

This weak formulation is equivalent to the first-order condition for the minimization of thefunctional

E(u) =12

∫Ω

tr[S(u)E(u)] dx−∫

Ωu · f dx .

Since S has units of energy per volume, E has units of energy. The integral involving thestress is called the strain energy, and the integral involving the body load f is the potentialenergy. The steady-state displacement minimizes the total energy E.


4.1.2 Galerkin Methods

In order to approximate the solution of Laplace’s equation, we will make two assump-tions.

Assumption 4.1-7: Given an open set Ω ⊂ Rd, Γe ⊂ ∂Ω, functions k(x) and f(x) forx ∈ Ω, and function bn(x) for x ∈ ∂Ω \ Γe, we can find a finite dimensional space V offunctions continuous on Ω such that for all v ∈ V,

E(v) =12

∫Ω∇xv · k∇xv dx−

∫Ωvf dx−

∫∂Ω\Γe

vbn ds

is finite and v(x) = 0 for all x ∈ Γe.

Assumption 4.1-8: Suppose that Ω ⊂ Rd is open, Γe ⊂ ∂Ω, and be is defined on Γe.Then we assume that there is a function b defined on Ω so that E(b + v) is finite for allv ∈ V.

(We have put a tilde over the space V and the function b because it may be necessary laterfor one or both to satisfy the essential boundary condition approximately.) The Galerkinmethod for approximating the solution of the weak form (4.1-3) involves minimizing E overall functions of the form b + v where v ∈ V, rather than over all functions for which E isfinite.

Lemma 4.1-9: Suppose that we are given an open set Ω ⊂ Rd, a nonempty subsetΓe ⊂ ∂Ω, functions k(x) and f(x) for x ∈ Ω, and function bn(x) for x ∈ ∂Ω \ Γe. Further,suppose that assumptions 4.1-7 and 4.1-8 are satisfied. Then there exists a unique functionu ∈ b+ V satisfying the Galerkin equations∫

Ω∇xv · k∇xu dx =

∫Ωvf dx +

∫∂Ω\Γe

vbn ds for all v ∈ V . (4.1-7)

Further, u minimizes the energy E over b+ V, where E is defined by (4.1-6).

Proof: Let v1, . . . , vn be a basis for V. Then the n× n matrix

Aij =∫

Ω∇xvi · k∇xvj dx


is obviously symmetric. If u ∈ Rn, then

u>Au =n∑i=1

n∑j=1

ui

∫Ω∇xvi · k∇xvj dxuj

=∫

Ω

(n∑i=1

viui

)· k

n∑j=1

vjuj

dx ≥ 0 .

In fact, if

0 = u>Au∫

Ω∇x

(n∑i=1

viui

)· k∇x

n∑j=1

vjuj

dx

then we must have ∇x (∑n

i=1 viui) = 0 almost everywhere in Ω. This impliesthat

∑ni=1 viui = 0 almost everywhere in Ω. Since Γe is nonempty, and vi is zero

on Γe for all i, it follows that this constant must be zero. Since the functionsvi are linearly independent, this implies that u = 0. Thus A is symmetricpositive-definite. If we define the n-vector

bi =∫

Ωvif dx +

∫∂Ω\Γe

vibn ds−∫

Ω∇xvi · k∇xb dx

then there is a unique solution u to Au = b. If we define u = b +∑n

j=1 vjuj ,then∫

Ω∇xvi · k∇xu dx =

∫Ω∇xvi · ∇xb dx +

n∑j=1

∫Ω∇xvi · ∇xvj dxuj

=∫

Ω∇xvi · k∇xb dx +

∫Ωvif dx +

∫∂Ω\Γe

vibn ds−∫

Ω∇xvi · k∇xb dx

=∫

Ωvif dx +

∫∂Ω\Γe

vibn ds ,

so u uniquely satisfies the Galerkin equations. Further, for all v ∈ V we have

E(u+ v) = E(u) +

[∫Ω∇xv · k∇xu dx−

∫Ωvf dx−

∫∂Ω\Γe

vbn ds

]+

12

∫Ω∇xv · k∇xv dx

= E(u) +12

∫Ω∇xv · k∇xv dx ≥ E(u) .

This shows that u minimizes E over b+ V. 2


Corollary 4.1-10: Suppose that we are given an open set Ω ⊂ Rd, a nonempty subsetΓe ⊂ ∂Ω, functions k(x) and f(x) for x ∈ Ω, and function bn(x) for x ∈ ∂Ω \ Γe. Further,suppose that assumptions 4.1-7 and 4.1-8 are satisfied. If u solves the weak form (4.1-3)and u solves the Galerkin equations (4.1-7) then the energy E (see equation (4.1-6)) satisfies

E(u) ≤ E(u) .

Proof: This result is an immediate consequence of the fact that u minimizes Eover all functions with finite energy satisfying the essential boundary conditions,and u minimizes E over a finite dimensional (affine) subspace of those function.2

The next lemma tells us a little bit about the error in the Galerkin approximation.

Lemma 4.1-11: Suppose that we are given an open set Ω ⊂ Rd, a nonempty subsetΓe ⊂ ∂Ω, functions k(x) and f(x) for x ∈ Ω, and function bn(x) for x ∈ ∂Ω \ Γe. Further,suppose that assumptions 4.1-7 and 4.1-8 are satisfied. Finally, suppose that u solves theweak form (4.1-3) and u solves the Galerkin equations (4.1-7). Then the energy in theerror u− u is the smallest possible:

E(u− u) ≤ E(w − u) for all w ∈ b+ V .

Proof: If w ∈ b + V, then w = u + v for some v ∈ V. Then the Galerkinequations (4.1-7) imply that

E(w − u) = E([u− u] + v)

= E(u− u) +∫

Ω∇xv · k∇x[u− u] dx−

∫Ωvf dx−

∫∂Ω\Γe

vbn ds

+∫

Ω∇xv · k∇xv dx

= E(u− u) +∫

Ω∇xv · k∇xv dx ≥ E(u− u) .

2

4.2 Finite Element Examples

It is common to choose piecewise polynomials as the finite dimensional subspace V inthe Galerkin method; in such a case, the procedure is called a finite element method.

4.2. FINITE ELEMENT EXAMPLES 221

This choice allows us to confront a number of remaining issues in the Galerkin method.For example, we will need to choose basis functions for the finite dimensional space V offunctions satisfying the homogeneous essential boundary condition, and develop a function bsatisfying the inhomogeneous essential boundary condition. We will also need to formulatethe Galerkin equations as a linear system. This will involve the computation of variousintegrals, and storing the results in the appropriate matrix and vector entries.

We will illustrate the finite element approach via several examples.

4.2.1 Finite Elements in 1D

In one dimension, our example problem (4.1-1) takes the simpler form

− d

dx

(kdu

dx

)= f in (xL, xR) (4.2-1a)

u(xL) = be (4.2-1b)

kdu

dx(xR) = bn (4.2-1c)

Here, the essential boundary condition is imposed on the left-hand boundary Γe = xL,and the natural boundary condition is imposed on the right-hand boundary. For simplicity,we will choose our Galerkin space V to consist of all piecewise-linear functions that are zeroat x = xL.4.2.1.1 Piecewise Linear Approximations

We will construct a basis for our piecewise linear functions V by mappings from canonicalbasis functions on the unit interval. Our canonical basis functions will be the entries of

v∗(ξ) =[1− ξξ

].

Note that each component of v∗(ξ) is linear, and each component of v∗(ξ) is uniquely equalto one at the endpoints of the unit interval.

Next, we choose a mesh

xL = x0 < x1 < . . . < xN = xR .

The intervals (xn, xn+1) are called the mesh elements and the points xn are called themesh nodes. We also define the element widths

h`+ 12

= x`+1 − x` ,

and the element centersx`+ 1

2= (x`+1 + x`)/2 .


Next, we define the linear transformations from the unit interval to the `th mesh elementby

x`(ξ) =[x` x`+1

]v∗(ξ) .

The nodal basis functions will be chosen to be piecewise linear, one at a single meshnode, and zero at the other mesh nodes. Within element `, only the nodal basis functionsassociated with x` and x`+1 are nonzero. These two functions can be defined as follows:[

v`(x)v`+1(x)

]= v∗(ξ) =

[1− ξξ

]where x = x`(ξ) . (4.2-2)

Given a point x ∈ (x`, x`+1), we would have to solve x`(ξ) = x for ξ to evaluate thenodal basis functions v`(x) = 1 − ξ and v`+1(x) = ξ. Typically, we will need to computeintegrals involving the nodal basis functions and their derivatives, and these integrals willbe computed by coordinate transformations from the unit interval.

It is easy to see that each vn(x) is piecewise linear, since each is a composition of piecewiselinear functions. It is also easy to see that the set v0, . . . , vn is linearly independent, sincevj(xi) = δij . Note that for 1 ≤ j ≤ N , vj(xL) = 0, so each of these basis functions satisfiesthe homogeneous Dirichlet boundary condition. Assuming that the functions f and k inthe two-point boundary-value problem (4.2-1) are bounded, it is easy to see that each ofthe nodal basis functions have finite energy. We let V be the span of v1, . . . , vN .

Note that the function b(x) = bev0(x) satisfies the essential boundary condition. Ourfinite element approximation u(x) will take the form

u(x) = b(x) +N∑j=1

vj(x)ωj

where the coefficients ωj will be determined by the Galerkin equations∫ xR

xL

dvidx

kdu

dxdx =

∫ xR

xL

vif dx+ vi(xR)bn .

Thus the undetermined coefficients ωj will be determined by solving the linear systemAu = b where

Aij =∫ xR

xL

dvidx

kdvjdx

dx , bi =∫ xR

xL

vif dx+ vi(xR)bn −∫ xR

xL

dvidx

kdv0dx

dx be , uj = ωj .

Note that A = A>, so A ∈ RN×N is symmetric. Also note that Aij = 0 for |i− j| > 1;in other words, A is tridiagonal. Finally, note that for any u ∈ RN ,

u>Au = B(N∑j=1

vjuj ,N∑j=1

vjuj) ≥ 0 .


Further, if

0 = u>Au = B(N∑j=1

vjuj ,N∑j=1

vjuj) =∫ xR

xL

k

d

dx

N∑j=1

vjuj

2

dx

then we must have that∑N

j=1 vjuj is constant. Since this function is zero at x = xL, theconstant must be zero. It follows that A is positive-definite.4.2.1.2 Frontal Assembly

The entries of A and b can be assembled by computing integrals over elements:

Aij =∫ xR

xL

v′ikv′j dx =

N−1∑`=0

∫ x`+1

x`

v′ikv′j dx

bi =∫ xR

xL

vif dx =N−1∑`=0

∫ x`+1

x`

vif dx

In turn, the integrals in these sums are zero if ` 6∈ i − 1, i or ` 6∈ j − 1, j. Thus, tocompute all terms in the matrix A that involve integrals in the `th element, we need onlycompute the symmetric part of the 2× 2 matrix[

A`+ 12(v`, v`) A`+ 1

2(v`, v`+1)

A`+ 12(v`+1, v`) A`+ 1

2(v`+1, v`+1)

]

and the array [ ∫ x`+1

x`v`fdx∫ x`+1

x`v`+1fdx

]There are several approaches to computing these arrays.

If k and f are constant, then we can compute the integrals exactly. We evaluate theintegrals by a change of variables to a unit interval, so that we can compute the integralsin terms of the canonical basis function. In this case, we obtain element-wise contributionsto the matrix A of the form[

A`+ 12(v`, v`) A`+ 1

2(v`, v`+1)

A`+ 12(v`+1, v`) A`+ 1

2(v`+1, v`+1)

]=∫ x`+1

x`

[v′`v′`+1

]k[v′` v′`+1

]dx

=∫ 1

0

dv∗(ξ)dξ

(dx`(ξ)dξ

)−1

k(x`(ξ))(dx`(ξ)dξ

)−1 [dv∗(ξ)dξ

]> ∣∣∣∣dx`(ξ)dξ

∣∣∣∣ dξ=∫ 1

0

[−11

]k(x`(ξ))h`+1/2

[−1 1

]dξ =

[1 −1−1 1

]k

h`+1/2.


We also have element-wise contributions to the right-hand side vector b of the form

[b`+1/2(v`)b`+1/2(v`+1)

]=

[ ∫ x`+1

x`v`fdx∫ x`+1

x`v`+1fdx

]=∫ x`+1

x`

[v`(x)v`+1(x)

]f dx

=∫ 1

0v∗(ξ)f(x`(ξ))

∣∣∣∣dx`(ξ)dξ

∣∣∣∣ dξ =∫ 1

0

[1− ξξ

]dξh`+1/2 f

=[11

]h`+1/2f

2.

The element-wise contributions to the matrix A and right-hand side b are then stored inthe proper locations:

A =

. . . A`−1/2(v`−1, v`). . . A`−1/2(v`, v`) + A`+1/2(v`, v`) A`+1/2(v`, v`+1)

A`+1/2(v`+1, v`) A`+1/2(v`+1, v`+1) + A`+3/2(v`+1, v`+1) . . .

A`+3/2(v`+2, v`+1) . . .

b = eNbn +

...b`−1/2(v`−1)b`−1/2(v`) + b`+1/2(v`)

b`+1/2(v`+1) +...

A more common approach is to use Gaussian quadrature, because it applies to variable

coefficients k and f . In this example, a single Gauss quadrature point in each element issufficient to preserve the accuracy of the finite element method; this corresponds to usingthe midpoint rule. The element-wise contributions to the matrix A are[

A`+ 12(v`, v`) A`+ 1

2(v`, v`+1)

A`+ 12(v`+1, v`) A`+ 1

2(v`+1, v`+1)

]=∫ x`+1

x`

[v′`v′`+1

]k[v′` v′`+1

]dx

=∫ 1

0

dv(ξ)dξ

(dx`(ξ)dξ

)−1

k(x`(ξ))(dx`(ξ)dξ

)−1 [dv∗(ξ)dξ

]> ∣∣∣∣dx`(ξ)dξ

∣∣∣∣ dξ≈dv∗dξ

(12)

1h`+1/2

k(x`(12))

1h`+1/2

[dv∗dξ

(12)]>

h`+1/2

=[−11

]k(x`+1/2)h`+1/2

[−11

]>=[

1 −1−1 1

]k(x`+1/2)h`+1/2

.


Similarly, the element-wise contributions to the right-hand side vector b are[b`+1/2(v`)b`+1/2(v`+1)

]=∫ x`+1

x`

[v`v`+1

]f dx =

∫ 1

0v∗(ξ)f(x`(ξ))

∣∣∣∣dx`(ξ)dξ

∣∣∣∣ dξ≈ v∗(

12)f(x`(

12))h`+1/2

=[11

]h`+1/2f(x`+1/2)

2.

A program to implement a finite element approximation to the solution of a two-pointboundary value problem has been provided. This program consists of several pieces:

• Program 4.2-1: FiniteElementMain.C C++main program for solving the two-point boundary value problem and plotting the results interactively.

• Program 4.2-2: finite element.f Fortran 77 subroutines for computing the canon-ical basis functions, mesh, initial values, solution of linear system via conjugate gra-dients, and values of the finite element approximation.

• Program 4.2-3: input Input file for FiniteElementMain.C

• Program 4.2-4: perlmake Perl macros and subroutines to make executables.

If the input file sets nelements to a value greater than one, then the main program willsolve a boundary value problem with that number of elements. If nelements is equal to one,then the main program will perform a mesh refinement study, beginning with 2 elementsand refining repeatedly by a factor of 2.

The file finite element.f contains several subroutines and functions. Subroutinecanonical evaluates the gaussian quadrature points and associated weights that are appro-priate for the method, and then computes the canonical basis functions and their derivativesat the quadrature points. Function approximation evaluates the finite element approxi-mation at a point x, once the linear system has been solved for the coefficients of the basisfunctions. Subroutine grid determines a uniform mesh, and sets the array of nodes foreach element. Function solution computes the analytical solution of the boundary valueproblem at a point x. Subroutine mult multiplies the stiffness matrix times an arbitraryvector of coefficients of the basis functions, and sets the resulting vector to zero at Dirichletboundary conditions. This annihilation is important for performing the conjugate gradientiteration: no change is made to the solution vector at Dirichlet boundary conditions duringthe conjugate gradient iteration. Subroutine initialize sets the flags for the Dirichletboundary conditions, computes the coefficients of the differential operator at the Gaussianquadrature points, selects initial values for the unknown coefficients of the basis functions,

http://www.math.duke.edu/~johnt/math226/bvp2/FiniteElementMain.C

http://www.math.duke.edu/~johnt/math226/bvp2/finite_element.f

http://www.math.duke.edu/~johnt/math226/bvp2/input

http://www.math.duke.edu/~johnt/math226/bvp2/perlmake


and computes the initial residual. Subroutine pre is a preconditioner for conjugate gra-dients; in this case, the preconditioner is the identity. Subroutine precg is adopted froma program of the same name available at netlib.org; it performs preconditioned conjugategradients to solve a linear system. In this case, the computation of the initial residual wasperformed before calling precg, because that calculation is done without use of Dirichletboundary condition flags. In addition, some basic linear algebra subroutines were used toreplace the corresponding loops in the original version of precg. Subroutine stopit is usedby precg to determine when to terminate the iteration.

Numerical results with this program are presented in figure 5.3-2. The results show thatthe L2 error is proportional to 4x2, as is the L∞ error at the element boundaries.

(a) L2 error at Gauss quadrature pts. = O(4x2) (b) L∞ error at mesh points = O(4x2)

Figure 4.2-1: Errors in Continuous Piecewise Linear Finite Elements: log base 10 of errorsversus log base 10 of number of basis functions

4.2.1.3 Finite DifferencesIt is important to note that the finite element equations can be implemented as if they

were finite differences. For example, for 1 ≤ i < N the ith equation in the Galerkinequations is

Ai,i−1ui−1 + Ai,iui + Ai,i+1ui+1 = bi .


If we use single-point Gaussian quadrature to compute the integrals in the finite elementmethod, then for 1 ≤ i < N this equation can be rewritten as

−k(xi−1/2)hi−1/2

ui−1 +[k(xi−1/2)hi−1/2

+k(xi−1/2)hi−1/2

]ui −

k(xi+1/2)hi+1/2

ui+1

=f(xi−1/2)hi−1/2

2+f(xi+1/2)hi+1/2

2.

For i = N , the corresponding Galerkin equation is

−k(xN−1/2)hN−1/2

uN−1 +k(xN−1/2)hN−1/2

uN =f(xN−1/2)hN−1/2

2+ bn .

Note that the finite element method has discretized the Neumann boundary condition with-out the use of a half-cell at the right-hand boundary.

Exercises 4.2.11. Consider the two-point boundary-value problem

− d

dx(du

dx) = π2 cos(πx) , 0 < x < 1

u(0) = 1 , u(1) = −1 .

(a) Find the analytical solution of this problem.

(b) Program the finite element method for this problem.

(c) Plot the log of the error in the solution at the mesh points versus the log of the number ofbasis functions, for 2n elements, 1 ≤ n ≤ 10. What is the slope of this curve (i.e. the order ofconvergence)?

(d) Plot the log of the error in the derivative of the solution at the mesh points versus the log ofthe number of basis functions, for 2n elements, 1 ≤ n ≤ 10. Note that there are two valuesfor the derivative at each mesh point, associated with either of the two elements containing themesh point. What is the slope of these curves (i.e. the order of convergence)?

2. Consider the two-point boundary-value problem

− d

dx(p

du

dx) + ru = f , 0 < x < 1

u(0) = 0 , u(1) = 0 .

Suppose that f(x) is a Dirac delta function associated with some point ξ ∈ (0, 1).

(a) If p(x) ≡ 1 and r(x) ≡ 0, find the analytical solution of this problem.

(b) Describe the finite element method for this problem, and the corresponding finite differenceequations.

(c) Suppose that ξ = 1/2, and consider uniform meshes with an even number of elements. Programthe finite element method for this problem, and plot the log of the error in the solution at themesh points versus the log of the number of basis functions.


(d) Suppose that ξ = 1/2, and consider uniform meshes with an odd number of elements. Programthe finite element method for this problem, and plot the log of the error in the solution at themesh points versus the log of the number of basis functions.

Exercises 4.2.11. Determine the finite difference equations for the two-point boundary-value problem (5.3-2), arising

from the finite element method using continuous linear basis functions and midpoint rule quadrature.

2. Repeat the previous exercise for the trapezoidal rule.

3. Program the finite element method for problem (5.3-2). Take p(x) = 1, r(x) = 0 and f(x) = 1. Plotthe log of the error in the solution at the element midpoints versus the number of nodal basis functionsas you refine the mesh. Also plot the error in the solution at the natural boundary condition.

4. The one-dimensional beam bending problem [88] takes the form

d2

dx2(p(x)

d2u

dx2) = f(x) , 0 < x < L

u(0) = 0 = u(L)

d2u

dx2(0) = 0 =

d2u

dx2(L)

Here u(x) is the displacement of the beam, p(x) is the flexural rigidity, f(x) is the applied load, andL is the length of the beam. The boundary conditions correspond to zero displacement at the ends ofthe beam. Determine the weak form of the beam bending problem. Which of the boundary conditionsare essential, and which are natural?

5. Bessel’s equation isd

dx(x

du

dx) + xu = 0

On the half-line x ≥ 0 it has two solutions, J0(x) and Y0(x). The former satisfies J0(0) = 1 andJ ′0(0) = 0, while the latter satisfies Y0(0) = ∞. The former has an infinite number of real positivezeros, the first of which is α + 1 ≈ 2.4048.

(a) Describe the weak form of Bessel’s equation with specified boundary values.

(b) What are natural boundary conditions for this equation?

(c) Describe the finite element method for this problem on a general interval 0 < x < L, usingspecified values at the boundaries, piecewise linear basis functions and midpoint rule quadrature.

(d) Program this finite element method for Bessel’s equation on the interval 0 < x < 1 with u(0) = 1and u(1) = 0. Perform a mesh refinement study.

(e) Program this finite element method for Bessel’s equation on the interval 0 < x < 2.4048 withu(0) = 1 and u(2.4048) = 0. Perform a mesh refinement study.

4.2.2 Triangular Finite Elements

Next, let us develop an example of finite elements in two dimensions. Suppose thatΩ ⊂ R2 is the rectangle

Ω = (x1,x2) : 0 < x1 < 3, 0 < x2 < 2 .


Let us generate a simple triangular mesh, as in figure 4.2-2. It is common to generatean element-node array (or list) to describe the mesh. In this data structure, we make anarray or list of all nodes associated with each element. For our simple mesh example, theelement-node array is given by Table 4.2-1. Note that in the element-node list, the nodesfor any individual element are listed in counter-clockwise order, beginning with the nodehaving smallest index.

In order to further simply the example, we will assume that Γe = (x1,x2) : x1 =0 or x2 = 0, so that there are Dirichlet boundary conditions on the bottom and left, andNeumann boundary conditions on the top and right. We will choose our finite elementspace V to consist of all piecewise-linear functions that are zero on Γe.

1 3 5

2 4 6

7 9 11

8 10 12

1 2 3 4

5 6 7 8

9 10 11 12

Figure 4.2-2: Triangular Mesh

4.2.2.1 Shape Functions


element nodes1 1 2 52 2 5 63 2 3 64 3 6 75 3 4 76 4 7 87 5 6 98 6 9 109 6 7 10

10 7 10 1111 7 8 1112 8 11 12

Table 4.2-1: Element-Node Array for Triangular Mesh Example

In order to construct a basis for our piecewise linear functions V, we will define canonicalbasis functions on the canonical element

T∗ ≡ (ξ1, ξ2) : 0 ≤ ξ1, xi2 ≤ 1 and ξ1 + ξ2 ≤ 1 .

Next, we define the canonical shape functions

v∗(ξ) ≡

ξ1ξ2

1− ξ1 − ξ2

.

Note that each component of v∗ is linear, and

v∗(1, 0) =

100

, v∗(0, 1) =

010

, v∗(0, 0) =

001

.

Thus the canonical shape functions are uniquely nonzero at the vertices of the canonicalelement T∗. Further, the components of v∗ are uniquely nonzero at vertices ordered counter-clockwise around the boundary of T∗, beginning with the vertex (1, 0).

In the Deal.II code, the basic properties of the various element shapes can be found inGeometryInfo.H . The form of the data for an individual element depends on the number ofspatial dimensions; these are found in Line.H , Quad.H and Hexahedron.H . The actual datafor an individual element can be found in struct CellData in Tria.H . This informationis taken from the global data stored in TriangulationLevel.H, which depends on the level ofmesh refinement. The array of mesh information for various levels of refinement is stored

http://www.math.duke.edu/~johnt/math226/deal/base/GeometryInfo.H

http://www.math.duke.edu/~johnt/math226/deal/grid/Line.H

http://www.math.duke.edu/~johnt/math226/deal/grid/Quad.H

http://www.math.duke.edu/~johnt/math226/deal/grid/Hexahedron.H

http://www.math.duke.edu/~johnt/math226/deal/grid/Tria.H

http://www.math.duke.edu/~johnt/math226/deal/grid/TriangulationLevel.H


in class Triangulation, which can be found in Tria.H . The actual generation of grids isproblem-dependent. Deal.II provides a couple of basic alternatives in GridGenerator.H .

In Deal.II, the shape functions are defined in a variety of classes derived from classFiniteElement in FiniteElement.H . This class is derived from FiniteElementBase, whichstores several tables for the shape functions, and is in turn derived from FiniteElementData,which stores basic information about the shape function on an individual element. The mostcommon shape functions are implemented in FE Q.H .

The space V will consist of all continuous piecewise-linear functions on the triangularelements that are zero on Γe. A basis for these functions will be associated with nodenumbers 6, 7, 8, 10, 11 and 12. For example, the shape function for node 6 is chosen tobe one at node 6, zero at all nodes other nodes, linear on each triangle in the mesh andcontinuous over Ω.

In order to describe the nodal basis functions, we will construct a linear transformationfrom the canonical element to each triangle in the mesh. Given an arbitrary element T` inour triangular mesh, let

X` ≡[xn1 xn2 xn3

]∈ R2×3

be the array of coordinates of the three vertices of T`. Here the indices n1, n2, n3 are thenode indices in the element-to-node array for element ` in table 4.2-1. Then the lineartransformation x` : T∗ → T` mapping the canonical element to this particular element is

x`(ξ) = X`v∗(ξ) .

The nodal basis functions associated with the vertices of a mesh element T` are defined asfollows:

v`(x`(ξ)) = v∗(ξ) . (4.2-3)

Given a point x ∈ T` we would have to solve x = x`(ξ) for ξ to compute the nodalbasis functions associated with T`. Typically, we will need to compute integrals involvingthe nodal basis functions and their derivatives, and these integrals will be computed bycoordinate transformations from the canonical element. The canonical element, and themapping to an arbitrary mesh triangle, are depicted in figure 4.2-3.

As in the one-dimensional case, it is easy to see that each of the nodal basis functionsis piecewise linear, since each is a composition of piecewise linear functions. It is easy tosee that the set of nodal basis functions is linearly independent, since vi(xj) = δij . Notethat the nodal basis functions for nodes 6, 7, 8, 10, 11 and 12 are zero on the bottom andleft sides of the domain Ω, so they satisfy the homogeneous Dirichlet boundary condition.Assuming that the functions f and k in the partial differential equation (4.1-1) are bounded,it is easy to see that each of the nodal basis functions has finite energy.

In Deal.II, the mappings from the canonical element to individual elements is describedby class Mapping in Mapping.H . The most common choice of mapping can be found in

http://www.math.duke.edu/~johnt/math226/deal/grid/Tria.H

http://www.math.duke.edu/~johnt/math226/deal/grid/GridGenerator.H

http://www.math.duke.edu/~johnt/math226/deal/dofs/FiniteElement.H

http://www.math.duke.edu/~johnt/math226/deal/dofs/FiniteElementBase.H

http://www.math.duke.edu/~johnt/math226/deal/dofs/FiniteElementData.H

http://www.math.duke.edu/~johnt/math226/deal/fe/FE_Q.H

http://www.math.duke.edu/~johnt/math226/deal/dofs/Mapping.H


(0,0) (1,0)

(0,1)

x

x

x

i

i

i

1

2

3

ex ( ξ )

ξ

x

Figure 4.2-3: Canonical Element and Transformation to Mesh Element

MappingQ1.H .4.2.2.2 Boundary Conditions

Since the numerical solution u involves possibly non-zero boundary values, we will writeit in the form

u(x) = b(x) +∑

n∈6,7,8,10,11,12

vn(x)ωn , (4.2-4a)

where the function approximately satisfying the Dirichlet boundary values is

b(x) =∑

n∈1,2,3,4,5,9

vn(x)be(xn) . (4.2-4b)

Here be(xn) is the value of the Dirichlet boundary data at boundary node xn. Note thatthis expression for u has the form u = b+ w, where w ∈ V and b approximately satisfies theinhomogeneous Dirichlet boundary condition at the boundary nodes.

Deal.II uses class Boundary in Boundary.H to represent the effects of the boundaryon the grid elements. Some example boundary shapes are represented in BoundaryLib.H .Constraints on shape functions due to boundary conditions are handled in FiniteElement-Base.H.4.2.2.3 Frontal Assembly

If we combine the Galerkin equations (4.1-7) with the piecewise linear finite element

http://www.math.duke.edu/~johnt/math226/deal/dofs/MappingQ1.H

http://www.math.duke.edu/~johnt/math226/deal/grid/Boundary.H

http://www.math.duke.edu/~johnt/math226/deal/grid/BoundaryLib.H




approximation (4.2-4) we obtain a linear system of the form

∑n∈6,7,8,10,11,12

∫Ω∇xvm(x) · k(x)∇xvn(x) dx =

∫Ωvm(x)f(x) dx +

∫∂Ω\Γe

vm(x)bn(x) ds

−∑

n∈1,2,3,4,5,9

∫Ω∇xvm(x) · k(x)∇xvn(x)be(xn) dx

(4.2-5)

Each of these integrals will be decomposed into a sum of integrals over the individualelements:

∫Ω∇xVm(x) · k(x)∇xVn(x) dx =

∑`

∫T`

∇xVm(x) · ∇xVn(x) dx ,∫Ωvm(x)f(x) dx =

∑`

∫T`

vm(x)f(x) dx .∫∂Ω\Γe

vm(x)bn(x) ds =∑`

∫∂T`∩(∂Ω\Γe)

vm(x)bn(x) ds .

Note that every (nonzero) integral of the form∫T`∇xvm(x) · k(x)∇xvn(x)dx will contribute

to row m and column n of the stiffness matrix. These integrals can be nonzero only ifboth nodes m and n are nodes of T`. Similarly, every (nonzero) integral of the form∫T`Vm(x)f(x) dx will contribute to entry m of the right-hand side. These integrals can

be nonzero only if node m is a node of T`. Finally, every (nonzero) integral of the form∫∂T`∩(∂Ω\Γe)

vm(x)bn(x) ds will contribute to entry m of the right-hand side. These integralscan be nonzero only if the boundary of T` intersects the right or top boundary of Ω andnode m lies in that intersection of boundaries.

Now we can perform some elementary calculus to compute the contributions to thestiffness matrix. First, we note that the Jacobian of the coordinate transformation from thecanonical element T∗ to the mesh element T` with node number list N` = n1, n2, n3 is

J` =∂x`∂ξ

= X`∂v∗∂ξ

= X`

1 00 1−1 −1

= [xn1 − xn3 , xn2 − xn3 ] .

Note that |J`| = 2|T`| is twice the area of the triangle T`. Using the well-known properties


of the determinant, we can then show that

detJ` = det [x1 − x3 , x2 − x3]= det [(x1 − x3)− (x2 − x3) , x2 − x3] = det [x1 − x2 , x2 − x3]= −det [x1 − x2 , x3 − x2] = det [x3 − x2 , x1 − x2]= det [x1 − x3 , x2 − x3 − (x1 − x3)] = det [x1 − x3 , x2 − x1]= −det [x3 − x1 , x2 − x1] = det [x2 − x1 , x3 − x1] .

In other words, the determinant detJ` = |J`| does not depend strongly on the ordering ofthe nodes. It is important here that |J`| > 0; this requires that the nodes be entered intothe element-to-node array in counter-clockwise order around the perimeter of T`. Otherthat this requirement, there are no restrictions on the ordering.

Let v`, defined by equation (5.3-1), be the array of nodal shape functions associatedwith the element T`. We can use the definition (5.3-1) of these nodal shape functions interms of the canonical basis functions to compute the array of gradients of v`:

∂v`∂x

=∂v∗∂ξ

∂ξ

∂x=

1 00 1−1 −1

J−1` .

Note that this array is constant, which proves our earlier claim that the shape functionsvn(x) are piecewise linear.

The array of stiffness matrix integrals associated with element T` is∫T`

∂v∂x

k

(∂v∂x

)>dx =

∫T∗

(∂v∗∂ξ

J−1`

)k(x`(ξ))

(∂v∗∂ξ

J−1`

)>|J`| dξ

=∂v∗∂ξ

J−1`

∫T∗

k(x`(ξ))dξ |J`|(∂v∗∂ξ

J−1`

)>In these computations, we took special advantage of the fact that ∂v∗

∂ξ and J` are constant,so that we could pull these terms outside the integrals. If k is constant, we can computethe integral involving it exactly: ∫

T∗

k(x`(ξ))dξ = k/2 .

If k is not constant, we can use a quadrature rule, such as∫T∗

k(x`(ξ)) dξ ≈ k

(x`

(13,13

))/2 .


This quadrature rule approximates the integral by the value of the function at the centroidof the triangle times the area of the triangle.

In Deal.II, the arrays of values of the shape functions and their derivatives is storedin class FEValues, which can be found in FEValues.H . The integrals are approximatedby quadrature rules, which are described by class Quadrature in Quadrature.H . Somecommon quadrature rules are implement in derived classes, found in QuadratureLib.H .

There are three different kinds of contributions to the right-hand side in equation (4.2-5).The third kind involves integrals over the elements, of the same form as in the stiffnessmatrix, so we will not discuss these further. The first kind can be computed from thecomponents of the integrals

∫T`

v(x)f(x) dx =∫T∗

v∗(ξ)f(x`(ξ))|J`| dξ

If f is constant, we can compute the integral involving it exactly:

∫T∗

v∗(ξ)f(x`(ξ)) dξ =

111

f6.

If f is not constant, we can use a quadrature rule, such as

∫T∗

v∗(ξ)f(x`(ξ)) dξ ≈ v∗(

13,13

)f

(x(

13,13

))12

=

111

f (x` (13 ,

13

))6

.

Note that in this quadrature rule, f is evaluated at the centroid of the triangle T`.The remaining contribution to the right-hand side in (4.2-5) involves components of the

integral

∫T`∩(∂Ω\Γe)

vm(x)bn(x) ds .

If T` lies next to the right or top boundary of Ω, then this integral can be rewritten as asum over select sides of T`.4.2.2.4 Stiffness Matrix Assembly

http://www.math.duke.edu/~johnt/math226/deal/dofs/FEValues.H

http://www.math.duke.edu/~johnt/math226/deal/base/Quadrature.H

http://www.math.duke.edu/~johnt/math226/deal/base/QuadratureLib.H


This manner of evaluating the integrals allows us to re-order the calculations by elements:

loop over elements T`, with nodes n1, n2, n3

compute volume(T`) = |T`|loop over quadrature points xq within T`

compute element volume times quadrature weight |T`|wqloop over canonical shape functions v∗i , i = 1, 2, 3

loop over canonical shape functions v∗j , j = 1, 2, 3

compute the contribution to entry i, j of the local stiffness matrixstore the result in entry ni, nj of the global stiffness matrix

In any case, the three innermost loops are usually very short. For piecewise linear basisfunctions, there is one quadrature point and one plus the number of dimensions basis func-tions per element. Computers employing pipelining or vector processing may not performwell on such short loops.

It is interesting to note that neither the stiffness matrix nor the right-hand side willuse any integrals involving element 1. This is because the Dirichlet boundary conditionsconstrain all three nodes of this element. This causes a poor approximation to the solutionof the partial differential equation, due to our poor treatment of one of the corners in thedomain. On the other hand, the entries of the stiffness matrix in the row for nodes 6 and7 will involve contributions from six elements, namely 2,3,4,7,8 and 9.

A feature of the Deal.II code makes it difficult for students to read. This feature isthe use of iterators and accessors. Such a programming style is favored by C++purists[85, Chapter 19], because it allows general access to information without requiring theuser to know the form of the data. Additional discussion of iterators and accessors inthe standard C++templates can be found in [50, Chapter 8]. This style often comes atthe cost of lower execution speed. In Deal.II, the class TriaIterator and related classesin TriaIterator.H are designed to increment and decrement grid cells (as wells as facesand vertices), and to provide de-references to instances of class TriaAccessor. The classTriaAccessor and related classes in TriaAccessor.H provide ways to pull information outof a TriangulationLevel, which is where the global information for the grid is stored. Anexample of the used of these iterators and accessors can be found in the implementation ofTriangulation::create triangulation in file Tria.C .

Since the assembly of the stiffness matrix is problem-dependent, Deal.II leaves this tothe user. An example of the assembly of the stiffness matrix for the Laplace problem canbe found in procedure LaplaceProblem::assemble system in LaplaceProblem.C.4.2.2.5 Linear System Solution

After the stiffness matrix and right-hand side are assembled, we have a 6 × 6 stiffnessmatrix and 6 entries of the right-hand side vector. We solve this linear system for the

http://www.math.duke.edu/~johnt/math226/deal/grid/TriaIterator.H

http://www.math.duke.edu/~johnt/math226/deal/grid/TriaAccessor.H

http://www.math.duke.edu/~johnt/math226/deal/grid/Tria.C

http://www.math.duke.edu/~johnt/math226/deal/examples/LaplaceProblem.C

4.3. ELLIPTIC EQUATIONS 237

unknown values of the solution at nodes 6, 7, 8, 10, 11 and 12.Deal.II provides a number of linear system solvers, such as Richardson , Conjugate

Gradients , GMRES , and multigrid. Most of the examples use conjugate gradients, andnone of the examples use multigrid.

4.3 Elliptic Equations

The physical problems in examples 4.1-4 to 4.1-6 suggest that we examine more generalproblems. In this section, we will review the theory concerning the solution of more generalelliptic partial differential equations.

We will assume that the reader is familiar with some basic notions. For example, thereader should understand the meaning of an open set, as well as its complement, closureand boundary. We expect the reader to know that by definition, bounded subsets of Rd arecontained in a ball of finite radius, and that a subset of Rd is compact if and only if it isclosed and bounded. The reader should also know what it means for a set to be connected.These issues are discussed in books on mathematical topology, such as [56]. Most of theseissues are also discussed in books on functional analysis, such as [59, 72, 93].

We assume that the reader understands the meaning of the support of a function. Forfunctions f : Rd → Rm, the reader should know that f ∈ C∞ means that Dαf is continuousfor all multi-indices α, and that f ∈ C∞0 means that f ∈ C∞ has compact support. Thereader should be familiar with the Lebesque integral. These issues are discussed in [59, 69,73, 93].

4.3.1 Multi-Indices

Definition 4.3-1: A multi-index is any vector α with d nonnegative integer components.If α is a multi-index, then |α| ≡

∑di=1 αi and α! ≡

∏di=1 αi!. If x ∈ Rd, then xα ≡

∏di=1 xαi

i .If α and β are multi-indices then αβ ≡

∏di=1 αiβi, and β ≤ α if and only if for all i we

have βi ≤ αi. If β ≤ α, then (αβ

)=

α!β! (α− β)!

.

Example 4.3-2: Using multi-indices, the binomial expansion can be written in the form

(x + y)α =∑β≤α

(α

β

)xαyβ .

Example 4.3-3: It is easy to write the multi-dimensional form of Taylor’s formula byusing multi-indices. Let the vector differential operator D be defined by Di = ∂

∂xi. Then for

http://www.math.duke.edu/~johnt/math226/deal/solver/SolverRichardson.H

http://www.math.duke.edu/~johnt/math226/deal/solver/SolverCG.H

http://www.math.duke.edu/~johnt/math226/deal/solver/SolverCG.H

http://www.math.duke.edu/~johnt/math226/deal/solver/SolverGMRES.H


any multi-index α,

Dαu =d∏i=1

Dαii u =

∂α1

∂xα11

. . .∂αdu

∂xαdd

.

Suppose that f : Rd → Rm. Then Taylor’s formula can be written

f(x + y) =∑|α|<k

1α!

yαDαf(x) + k∑|α|=k

1α!

yα∫ 1

0sk−1Dαf(x + y[1− s]) ds . (4.3-1)

4.3.2 Elliptic Differential Operators

.For further discussion of the following material, see Agmon [3] or Lions and Magenes

[60].

Definition 4.3-4: Suppose that Ω ⊂ Rd is open, that α ∈ Zd is a multi-index, that ` isa positive integer, and that for all |α| ≤ `, aα : Ω → R. Define the differential operator

A(x,D) ≡∑|α|≤`

aα(x)Dα . (4.3-2)

Then the principal part of A is the polynomial pA(x, ξ) ≡∑

|α|=` aα(x)ξα generated bythe highest-order derivative terms in A. The differential operator A is elliptic at x ∈ Ωiff pA(x, ξ) 6= 0 for all ξ 6= 0. If Ω is bounded, the differential operator A is uniformlyelliptic if and only if there exists constants 0 < C ≤ C such that for all ξ ∈ Rd and forall x ∈ Ω,

C|ξ`| ≤ |pA(x, ξ)| ≤ C|ξ`| .

Note that if A is an elliptic operator of order ` at x ∈ Ω, then ` is even. This is becauseodd-order polynomials always have at least one real root.

Example 4.3-5: The Laplacian Au ≡ −∇x · ∇xu has principal part pA(ξ) = −‖ξ‖22.

Since pA(ξ) = 0 =⇒ ξ = 0, the Laplacian is elliptic. It is easy to see that the Laplacian isalso uniformly elliptic, with C = 1 = C.

Example 4.3-6: The wave operator Au ≡ ∂2u∂x2

0− ∇x · ∇xu has principal part pA(ξ) ≡

ξ20 −∑d

j=1 ξ2j . Since pA(1, 1, 0, . . . , 0) = 0, the wave operator is not elliptic.

Definition 4.3-7: If Ω ⊂ Rd is open and bounded, the differential operator A(x,D)of order ` defined in equation (4.3-2) is k-smooth if and only if for all |α| > ` − k,aα ∈ C|α|−`+k(Ω) and for all |α| ≤ `− k there exists Cα > 0 so that for all x ∈ Ω we have|aα(x)| ≤ Cα.


Definition 4.3-8: If Ω ⊂ Rd is open and bounded, the adjoint of the differential operatorA(x,D) of order ` defined in equation (4.3-2) is A∗, defined by

∀u ∈ C`(Ω) ∀φ ∈ C∞0 (Ω),∫

Ω(A∗φ)u dx ≡ (A∗φ, u) = (φ,Au) ≡

∫Ωφ(Au) dx .

Lemma 4.3-9: If Ω ⊂ Rd is open and bounded, and A(x,D) ≡∑

|α|≤` aα(x)Dα is`-smooth, then for all φ ∈ C`(Ω) the adjoint satisfies

A∗(x,D)φ =∑|α|≤`

(−1)|α|Dα(aαφ) .

Proof: Use integration by parts. 2

Example 4.3-10: Let us compute the adjoint of the Laplacian. Suppose that u, δu ∈C∞0 (Ω). Then

∫Ωδu(∇x · ∇xu) dx =

∫∂Ωδu(n · ∇xu) ds−

∫Ω(∇xδu) · (∇xu) dx

=∫∂Ωδu(n · ∇xu) ds−

∫∂Ω

(n · ∇xδu)u ds+∫

Ω(∇x · ∇xδu)u dx

=∫

Ω(∇x · ∇xδu)u dx .

Thus the adjoint of the Laplacian is the Laplacian.

Lemma 4.3-11:(Green’s Formula) Suppose that Ω ⊂ Rd, the operator A(x,D) is of order` and is k-smooth for some k ≥ `, and ∂Ω is C`−1. Then for 0 ≤ j < ` there are lineardifferential operators Nj such that

∀u, δu ∈ C∞(Ω), (A∗δu, u)− (δu,Au) =`−1∑j=0

∫∂Ω

(N`−j−1δu)∂ju

∂njds .

Example 4.3-12: Let us compute Green’s formula for the Laplacian. Suppose that


u, δu ∈ C∞(Ω). Then∫Ωδu(∇x · ∇xu) dx =

∫∂Ωδu(n · ∇xu) ds−

∫Ω(∇xδu) · (∇xu) dx

=∫∂Ωδu(n · ∇xu) ds−

∫∂Ω

(n · ∇xδu)u ds+∫

Ω(∇x · ∇xδu)u dx .

It follows that

(A∗δu, u)− (δu,Au) ≡∫

Ω(∇x · ∇xδu)u dx−

∫Ωδu(∇x · ∇xu) x.

=∫∂Ω

(n · ∇xδu)u ds−∫∂Ωδu(n · ∇xu) ds

≡∫∂ΩN1(δu)u ds+

∫∂ΩN0(δu)

∂u

∂nds

Thus N1 = n · ∇x = ∂∂n , and N0 is the identity operator.

4.3.3 Dirichlet Problems for Elliptic Operators

Definition 4.3-13: Suppose that Ω ⊂ Rd is open, that m > 0, and that A(x,D) isa strongly elliptic operator of order 2m in Ω. Without loss of generality, assume thatA has been normalized so that for all ξ 6= 0 the principal part pA(x, ξ) of A satisfies(−1)mpA(x, ξ) > 0. Suppose that we are given functions f : Ω → R and bj : ∂Ω → R for0 ≤ j < m, and suppose that there is a function b(x) ∈ Cm−1(Ω) so that for all 0 ≤ j < m

and for all x ∈ ∂Ω, ∂jb∂nj (x) = bj(x). Then the Dirichlet problem is to solve

A(x,D)u = f ∀x ∈ Ω ,

∂ju

∂nj= bj ∀0 ≤ j < m ∀x ∈ ∂Ω .

Note that the assumption that the function b exists is a restriction on both the boundaryvalues bj and the boundary ∂Ω.

If u solves the Dirichlet problem, then Green’s formula shows that for all δu ∈ C∞(Ω),

(A∗δu, u)− (δu, f) =2m−1∑j=0

∫∂Ω

(N2m−j−1δu)bjds = (A∗δu, b)W 02 (Ω) − (δu, b)W 0

2 (Ω) .

This implies that for all δu ∈ C∞(Ω),

(A∗δu, u− b) = (δu, f)− (δu,Ab) .


This is the approach in distribution theory.Rather than integrate by parts 2m times to transfer all derivatives from u to δu, let

us integrate by parts m times. In order to avoid contributions from boundary terms, weassume that the test functions satisfy ∂jδu

∂nj = 0 for all 0 ≤ j < m. If u solves the Dirichletproblem and δu has zero normal derivatives up to order m − 1, then integration by partswill determine coefficients aαβ so that

(δu,Au) =∑

|α|,|β|≤m

∫Ωaαβ Dαδu Dβu dx ≡ A(δu, u) . (4.3-3)

We call A(δu, u) a Dirichlet bilinear form corresponding to the elliptic operator A.

Definition 4.3-14: Suppose that Ω ⊂ Rd is open, and that b has m − 1 continuousderivatives on Ω. Let the bilinear form A be given by

A(δu, u) =∑

|α|,|β|≤m

∫Ωaαβ Dαδu Dβu dx .

If u has m − 1 continuous derivatives on Ω, then u is a generalized solution of theDirichlet problem if and only if

A(δu, u) = (δu, f) ∀δu ∈ C∞0 (Ω) ,

∂j(u− b)∂nj

= 0 ∀0 ≤ j < m and ∀x ∈ ∂Ω .

Definition 4.3-15: The bilinear formA, given by (4.3-3), is uniformly strongly ellipticif and only if there exists a positive constant E such that for all ξ ∈ Rd and for all x ∈ Ωwe have ∑

|α|=m=|β|

aαβ(x)ξα+β ≥ E|ξ|2m . (4.3-4)

The ellipticity constant of A is the largest constant E for which the uniformly stronglyelliptic inequality (4.3-4) is valid. The bilinear form A(w, u) corresponding to the differ-ential operator A of order 2m is symmetric if and only if for all u,w ∈ C∞(Ω) we haveA(w, u) = A(u,w).

Example 4.3-16: The Dirichlet problem for the Laplacian is

− ∇x · ∇xu = f ∀x ∈ Ω ,

u = b ∀x ∈ ∂Ω


and the Dirichlet bilinear form for the Laplacian is A(δu, u) =∫Ω ∇xδu·∇xu dx. The function

u is a generalized solution of the Dirichlet problem for the Laplacian if and only if for allδu ∈ C∞0 (Ω) we have A(δu, u) = (δu, f) and for all x ∈ ∂Ω we have u(x) = b(x). Theellipticity constant for the Laplacian is E = 1.

Other kinds of boundary conditions are also possible. For example, the problem

∀x ∈ Ω , −∇x · ∇xu(x) + a(x)u(x) = f(x)∀x ∈ ∂Ω , n · ∇xu(x) + a(x)u(x) = b(x)

has the following generalized form: find u ∈ C2(Ω) so that for all δu ∈ C∞(Ω)

B(δu, u) ≡∫

Ω(∇xδu) · (∇xu) dx +

∫∂Ωδu a u ds =

∫Ωδu f dx +

∫∂Ωδu b dx ≡ λ(δu) .

In this case, no a priori conditions are imposed on boundary values of the solution u orits variation δu; instead, the boundary condition is enforced by the weak formulation. Forthis reason, the boundary conditions involving the normal derivative are called naturalboundary conditions.

4.4 Elliptic Regularity

Before we can discuss the convergence of finite element methods for elliptic boundaryvalue problems, it will be useful to discuss the behavior of the analytical solutions. In partic-ular, we would like to know how the smoothness of the solution depends on the smoothnessof the coefficients in the differential equation, the smoothness of the inhomogeneities (theright-hand side of the differential equation and boundary conditions), and the smoothnessof the boundary of the domain. These are difficult issues, especially for three-dimensionaldomains with corners. We will present some of the basic ideas here, and direct the readerto other sources for details.

4.4.1 Function Norms

In this section, we will quickly review some basic notions from functional analysis. Webegin with the following technical definition.

4.4. ELLIPTIC REGULARITY 243

Definition 4.4-1: If E ⊂ [−∞,∞], then the infimum of E is the greatest lower boundon members of E, and the supremum of E is the least upper bound. If Ω ⊂ Rd andg : Ω → R+ is measureable with Lebesgue measure µ, let

S = α ∈ R+ : µ(g−1((α,∞]) = 0 .

If S 6= ∅ then the essential supremum of g is

ess supΩ(g) ≡ inf(S) ;

if S = ∅ then ess supΩ(g) ≡ ∞.

This definition allows us to define the following norms on functions, and related normedlinear spaces.

Definition 4.4-2: If Ω ⊂ Rd and f : Ω −→ Rm has Lebesgue-integrable p-norm, then

‖f‖Lp(Ω) ≡[∫

Ω‖f(x)‖pp dx

]1/p

and ‖f‖L∞(Ω) ≡ ess supΩ(‖f(x)‖∞) .

If Ω ⊂ Rd is open, the Lebesgue spaces for p ∈ [1,∞] are

Lp(Ω) = f : Ω → Rm with ‖f‖Lp(Ω) <∞ .

Lemma 4.4-3:(Minkowski’s Inequality) If p ∈ [1,∞], Ω ⊂ Rd is open and f, g ∈ Lp(Ω),then f + g ∈ Lp(Ω) and the triangle inequality is satisfied:

‖f + g‖Lp(Ω) ≤ ‖f‖Lp(Ω) + ‖g‖Lp(Ω) .

Proof: See, for example, [73, p. 62]. 2

Lemma 4.4-4:(Holder’s Inequality) If p ∈ [1,∞], q = p/(p − 1), f ∈ Lp(Ω) andg ∈ Lq(Ω), then fg ∈ L1(Ω) and∣∣∣∣∫

Ωfg dx

∣∣∣∣ ≤ ‖f‖Lp(Ω)‖g‖Lq(Ω) .



Some special cases of Holder’s inequality are commonly used: if p = 1 then

∣∣∣∣∫Ωfg dx

∣∣∣∣ ≤ ‖f‖L1(Ω)‖g‖L∞(Ω) ;

if p = 2 then we have Schwarz’s inequality

∣∣∣∣∫Ωfg dx

∣∣∣∣ ≤ ‖f‖L2(Ω)‖g‖L2(Ω) .

Definition 4.4-5: If Ω ⊂ Rd is open, the set of locally integrable functions in Ω is

L1loc(Ω) = f : Ω → Rm such that ∀ compact K ⊂ Ω , f ∈ L1(K) .

This definition allows us to consider functions with bad behavior near the boundary of Ω,such as functions with boundary layers or singularities on the boundary.

4.4.2 Function Spaces

Definition 4.4-6: If U is a real linear space with a norm ‖ · ‖, then a sequence uk ⊂ Uis a Cauchy sequence if and only if forall ε > 0 there is an integer n > 0 so that for allj, k > n ‖uj − uk‖ < ε. The normed real linear space B is a Banach space if and only ifevery Cauchy sequence in B converges to a member of B. Alternatively, we say that thenormed real linear space B is complete if and only if all Cauchy sequences in B convergeto a member of B.

Lemma 4.4-7: If Ω ⊂ Rd is an open measureable set and 1 ≤ p ≤ ∞, then Lp(Ω) is aBanach space.

Proof: See, for example, [71, p. 66] or [93, p. 53]. 2


Definition 4.4-8: A Hilbert space is a Banach space with norm induced by an innerproduct. If H is a Hilbert space, then λ : H → R is a linear functional iff the followingconditions are satisfied:

linearity : ∀h1, h2 ∈ H ∀α1, α2 ∈ R λ(h1α1 + h2α2) = λ(h1)α1 + λ(h2)α2;

boundedness : ∃c ≥ 0 ∀h ∈ H |λ(h)| ≤ c‖h‖.

If λ is a linear functional on a Hilbert space H, then its norm is

‖λ‖ ≡ suph∈H,h 6=0

|λ(h)|‖h‖

.

Example 4.4-9: Let Ω ⊂ Rd be open and measureable. We can define the inner product

(f, g) =∫

Ωf(x)g(x) dx

on functions f and g in C∞0 (Ω). However, C∞0 (Ω) is not a Hilbert space, because limits ofCauchy sequences in the L2(Ω) norm, which corresponds to this inner product, do not nec-essarily have compact support. Furthermore, derivatives of these limits are not necessarilycontinuous.

Definition 4.4-10: If H is a real linear space with an inner product, the completion ofH is the set of all limits of Cauchy sequences in H. If H is a Hilbert space and D ⊂ H,then D is dense in H iff H is the completion of D.

It is often useful to know a dense subset of a Hilbert space. This is because it mightbe easier to deal with derivatives and boundary values of the functions in the dense subset,and take limits to handle the remaining functions.

Lemma 4.4-11: If Ω ⊂ Rd is open and 1 ≤ p <∞, then C∞0 (Ω) is dense in Lp(Ω).


Riesz Representation Theorem 4.4-12: If H is a Hilbert space and λ is a linearfunctional onH, then there is a unique hλ ∈ H so that for all h ∈ H we have λ(h) = (h, hλ).Further, ‖λ‖ = ‖hλ‖.

Proof: See, for example, [71, p. 130] or [93, p. 90]. 2

Example 4.4-13: It is easy to see that Rd is a Hilbert space, with the inner productgiven by the usual vector dot product. If h ∈ Rd, for all x ∈ Rd define λ(x) to be the signed


length of the projection of x onto h. Then λ : Rd → R. It is easy to see that λ is linear.Since projection cannot increase the length of a vector, |λ(h)| ≤ ‖h‖; thus λ is bounded. Itfollows that λ is a linear functional on Rd. The Riesz representation theorem implies thatthere is a unique hλ ∈ Rd so that for all x ∈ Rd, λ(x) = (x,hλ). In this case, it is easy tosee that hλ = h/‖h‖.

Definition 4.4-14: If H is a Hilbert space and B : H × H → R, then B is a bilinearform iff B is linear in each of its arguments. The bilinear form B is coercive if and onlyif there exists a constant CB > 0 such that for all h ∈ H

B(h, h) ≥ CB‖h‖2 . (4.4-1)

The bilinear form B is bounded if and only if there exists a constant c so that for allu, v ∈ H

|B(v, u)| ≤ c‖v‖‖u‖ . (4.4-2)

Lax-Milgram Theorem 4.4-15: Suppose that H is a Hilbert space with a boundedcoercive bilinear form B, meaning that

∃c > 0 ∀u, v ∈ H , |B(v, u)| ≤ c‖v‖‖u‖∃c > 0 ∀v ∈ H , B(v, v) ≥ c‖v‖2 .

Also suppose that λ is a bounded linear functional on H, meaning that λ is linear and

∃cλ ∀v ∈ H , |λ(v)| ≤ cλ‖v‖ .

Then there is a unique hλ ∈ H such that for all h ∈ H

λ(h) = B(h, hλ) ,

andc‖hλ‖ ≤ ‖λ‖ ≤ c‖hλ‖ .

Proof: See [93, p. 92]. 2

The Lax-Milgram theorem is more general than the Riesz representation theorem. TheRiesz representation theorem applies to coercive bilinear forms that are also symmetric, sothat they induce an inner product on a Hilbert space. On the other hand, the Lax-Milgramtheorem does not require the bilinear form to be symmetric. Unlike the Riesz representationtheorem, the Lax-Milgram theorem cannot say precisely what the norm of the representingfunction hλ ∈ H must be, but it can place lower and upper bounds on its norm. We will use


the Riesz representation theorem and the Lax-Milgram theorem in section 4.5.2 to provethat certain elliptic boundary value problems have solutions, and to describe how thosesolutions depend on their data.

4.4.3 Sobolev Spaces

For a reference to the discussion in this section, see [2, 60]. The use of Sobolov normswill allow us to discuss variable-coefficient elliptic boundary value problems in section 4.5 bymeans of norms with constant coefficients. In this way, the norms are problem-independentand the discussions are simplified.

Definition 4.4-16: Suppose that Ω ⊂ Rd is open, k is a nonnegative integer, and f ∈C∞(Ω). For any p ∈ [1,∞) the Sobolev norm is

‖f‖Wkp (Ω) =

∑|α|≤k

‖Dαf‖pLp(Ω)

1/p

,

while for p = ∞ the Sobolev norm is

‖f‖Wk∞(Ω) = max

|α|≤k‖Dαf‖L∞(Ω) .

The Sobolev space W kp (Ω) is the Banach space given by the completion of C∞(Ω) with

respect to the norm ‖ · ‖Wkp (Ω). Similarly, W k

0,p(Ω) is the completion of C∞0 (Ω) with respectto ‖ · ‖Wk

p (Ω). For p ∈ [1,∞), the Sobolev seminorm is

|f |Wkp (Ω) =

∑|α|=k

‖Dαf(x)‖pLp(Ω)

1/p

;

while for p = ∞ the corresponding Sobolev seminorm is

|f |Wk∞(Ω) = max

|α|=k‖Dαf(x)‖L∞(Ω) .

For p = 2, the Sobolev space W k2 (Ω) = Hk(Ω) is a Hilbert space with Sobolev inner

product

(f, g)Hk(Ω) =∑|α|≤k

∫Ω

DαfDαg dx .


Example 4.4-17: If Ω = Rd, then we can use Fourier transforms to define equivalentnorms on W k

2 (Ω). If u ∈W 02 (Ω) = L2(Ω), then the Fourier transform is defined by

u(y) =1

(2π)d/2

∫Rd

e−ix·yu(x) dx .

The inverse Fourier transform is u(x) = u(−x) and satisfies u(x) = ˇu(x), Dαu(y) =(iy)αu(y) and Dαu(y) = ˆ(−ix)αu(y). Since the Fourier transform turns differentiationinto multiplication by a polynomial, it is easy to see that an equivalent definition of ‖·‖Wk

2 (Ω)

is

‖u‖Wk2 (Ω) =

[∫Rd

(1 + ‖y‖22)m|u(y)|2 dy

]1/2

.

Thus functions with finite W k2 (Ω)-norm on Rd are such that their Fourier transform decays

sufficiently rapidly at infinity, so that the integral above is finite.

4.4.4 Sobolev Imbedding Theorems

Because including additional derivatives in the Sobolev norms imposes additional re-strictions on functions, the following result is obvious.

Lemma 4.4-18: If 1 ≤ p ≤ ∞ and m ≥ k, then Wmp (Ω) ⊂W k

p (Ω).

Recall that for all u ∈ W kp (Ω) there is a sequence uj ⊂ C∞(Ω) such that for all

multi-indices |α| ≤ k, Dαuj is a Cauchy sequence in W 0p (Ω), and uj → u. Since W 0

p (Ω)is complete, it follows that for all |α| ≤ k there is a function uα ∈ W 0

p (Ω) such thatDαu → uα. However, it is not necessarily true that uα = Dαu. We will have to imposeadditional restrictions on Ω to avoid this problem.

Definition 4.4-19: The set

K(a;h, θ) ≡ z ∈ Rd : ‖z‖2 ≤ h and ‖z‖2‖a‖2 cos θ ≤ z · a

is a cone with axis a, height h and angle θ. The open set Ω ⊂ Rd satisfies the coneproperty if there is an angle θ and height h such that for all x ∈ Ω there is some axis ax

such that the translated cone x +K(ax, h, θ) is contained in Ω.

Definition 4.4-20: The open set Ω ⊂ Rd satisfies the cone condition if and only ifthere exists a cone height h and cone angle θ such that for all x ∈ Ω there is a cone axisax such that x +K(ax, h, θ) ⊂ Ω.


Definition 4.4-21: The open set Ω ⊂ Rd satisfies the strong cone conditioni if andonly if there exists a collection Oi : 1 ≤ i ≤ N of open sets such that ∂Ω ⊂ ∪Ni=1Oi, andcorresponding cones Ki with vertices at the origin such that for all x ∈ Ω ∩ Oi we havex +Ki ⊂ Ω.

Definition 4.4-22: The open set Ω ⊂ Rd satisfies the uniform cone condition if andonly if there exists a sequence of open sets Oi∞i=1 ⊂ Rd and a corresponding sequenceK(ai, hi, θ)∞i=1 ⊂ Rd of open cones such that

1. every compact subset of Rd intersects at most a finite number of the Oi,

2. ∂Ω ⊂⋃∞i=1Oi,

3. there is some finite number R such that for all i the diameter of Oi is less than R,

4. there is a δ > 0 such that x ∈ Ω : dist (x, ∂Ω) < δ ⊂⋃∞i=1Oi,

5. for all i Qi ≡⋃

x∈Ω∩Oi(x +K(ai, hi, θ)) ⊂ Ω, and

6. there exists an integer N so that any N + 1 of the sets Qi has empty intersection.

The uniform cone condition prevents cusps in the boundary of the domain.

Note that Sobolev spaces are formed by taking limits of C∞ functions. However, somecontinuity is lost in taking these limits; for example, W 0

2 (Ω) includes step functions. Aswe require more derivatives to be p-integrable, more continuity remains in the completion.


These notions are made precise by the next theorem.

Sobolev’s Imbedding Theorem 4.4-23: Suppose that Ω ⊂ Rd is open and satisfiesthe cone condition. Suppose that m is a nonnegative integer, and 1 ≤ p < ∞. Adams,p. 83 says that Ω ⊂ Rd satisfies the strong local Lipschitz condition if and only ifthere exist δ > 0, M > 0, a locally finite open cover Oj of ∂Ω, and functions fj of d− 1variables such that

1. there exists R < ∞ so that every collection of R + 1 of the sets Oj has emptyintersection

2. for all x,y ∈ Ωδ ≡ z ∈ Ω : dist(z, ∂Ω) < δ and ‖x − y‖ < δ there exists Oj suchthat dist(x, ∂Oj) > δ and dist(y, ∂Oj) > δ

3. for all j and for all ξ, ρ ∈ Rd−1 we have ‖fj(ξ)− fj(ρ)‖ ≤M‖ξ − ρ‖.

4. for some Cartesian coordinate system ζ ∈ Oj , Ω∩Oj is represented by the inequalityζd < fj(ζ1, . . . , ζd−1).

Adams, p. 85, says j ≥ 0, m ≥ 1 integers 1 ≤ p <∞ Part I: Ω satisfies the cone condition.A: If mp > d or m = d and p = 1 then ‖u‖

W j∞(Ω)

≤ C‖u‖W j+m

p (Ω)and if q ∈ [p,∞] then

‖u‖W j

q (Ω)≤ C‖u‖

W j+mp (Ω)

B: If mp = d then ‖u‖W j

q (Ω)≤ C‖u‖

W j+mp (Ω)

C: If mp < d andq ∈ [p, dp/(d−mp)] then ‖u‖

W jq (Ω)

≤ C‖u‖W j+m

p (Ω)Part II: If Ω satisfies the strong local

Lipschitz condition, mp > d > (m− 1)p and λ ∈ (0,m− d/p] then u ∈ W j+mp (Ω) implies

that u ∈ Cj(Ω) and ‖u‖W j∞(Ω)

+ max|α|≤j supx,y∈Ω,x6=y‖Dαu(x)−Dαu(y)‖

‖x−y‖λ ≤ C‖u‖W j+m

p (Ω)

1. If p ≤ q <∞, and k is an integer satisfying the constraint

m+ d(1p− 1q) ≤ k < m+

d

p

then there is a constant C > 0 such that for all u ∈W kp (Ω)

‖u‖Wmq (Ω) ≤ C‖u‖Wk

p (Ω) .

Further, for all u ∈W kp (Ω) there is a function u ∈Wm

q (Ω) such that ‖u−u‖Wkp (Ω) = 0.

2. If k is an integer satisfying the constraint

k ≥ m+ d , if p = 1k > m+ d/p , if 1 < p <∞

then there is a constant C > 0 such that for all u ∈W kp (Ω)

‖u‖Wm∞(Ω) ≤ C‖u‖Wk

p (Ω) .

Further, for all u ∈W kp (Ω) there is a function u ∈ Cm(Ω) such that ‖u−u‖Wk

p (Ω) = 0.



In our later discussions, we will ignore the distinction between u and u, since they differonly on sets of measure zero. In practice, the most common use of Sobolev’s ImbeddingTheorem will be in the case p = 2; in this case, we see that a function in W k

2 (Ω) = Hk(Ω)is m-times continuously differentiable for any m < k − d/2.

Example 4.4-24: Sobolev’s imbedding theorem says that in one dimension, functions inH1(Ω) are continuous. In two and three dimensions, functions in H1(Ω) are not necessarilycontinuous. In these cases, we must assume at least 2 Sobolev derivatives.

Theorem (Sobolev’s Inequality) 4.4-25: Suppose that Ω ⊂ Rd is open and satisfiesthe uniform cone condition, and that 1 ≤ p < ∞. If p = 1 assume that k ≥ d, otherwisek > d/p. Then there exists C > 0 such that for all r ≥ 1, for all |α| ≤ k and for all x ∈ Ω

|Dαu(x)| ≤ Cr−(k−|α|−d/p)|u|Wkp (Ω) + rk‖u‖W 0

p (Ω) . (4.4-3)


A related inequality is due to Aronszajn and Smith [77].

Theorem 4.4-26: Suppose that Ω ⊂ Rd is open, bounded and satisfies the strong conecondition (see definition 4.4-21). Then for all m > 0 and for all p ∈ [1,∞] there is aconstant C such tthat for all u ∈Wm

p (Ω) we have

‖u‖Wmp (Ω) ≤ C

[|u|Wm

p (Ω) + ‖u‖W 0p (Ω)

]

4.4.5 Hilbert Scales and Fractional-Order Sobolev Spaces

If Ω = Rd, we could have defined ‖u‖Wk2 (Ω) =

[∫Ω(1 + ‖ξ‖2

2)k‖u(ξ)‖2

2 dξ] 1

2 for any realnumber k. More generally, we could define fractional order Sobolev norms without usingFourier transforms. For 0 ≤ s < 1 we could define the seminorm

|u|2W s2 (Ω) =

∫Ω

∫Ω

‖u(x)− u(y)‖22

‖x− y‖d+2s2

dx dy ,

and for arbitrary s > 0 with bsc representing the largest integer less than or equal to s,

‖u‖W s2 (Ω) =

‖u‖2

Wbsc2 (Ω)

+∑

|α|=bsc

|Dαu|2W

s−bsc2 (Ω)

1/2

.


This is an example of the following more general idea.

Definition 4.4-27: Let Ω ⊂ Rd be open, and let ‖ · ‖r : r ∈ R be a continuum of normsgenerated by inner products defined on C∞(Ω), such that

∀u ∈ C∞(Ω) ∀r2 < r1 ‖u‖r2 ≤ ‖u‖r1 .

For each r ∈ R, let Hr be the Hilbert space formed by taking the completion of C∞(Ω)in ‖ · ‖r. Then the continuum of Hilbert spaces Hr is a Hilbert scale iff the followingconditions are satisfied:

1. ∀r1 > r2, Hr1 is dense in Hr2 ;

2. (Rellich’s Lemma) if the sequence uj is uniformly bounded in Hr1 then for anyr2 < r1 this sequence has a convergent subsequence.

3. if r1 > r > r2 and u ∈ Hr1 then

‖u‖r1−r2r ≤ ‖u‖r−r2r1 ‖u‖r1−rr2 ;

4. ∀r > 0 H−r is the H0 dual of Hr; in other words, an equivalent norm on H−r is

‖u‖−r = supv∈Hr

|(u, v)|‖v‖s

.

For proofs of the following results, see Krein and Petunin [58], or Adams [2, p. 188f].However, Adams discusses scales of Hilbert spaces principally for Ω = Rd.

Lemma 4.4-28: If Ω ⊂ Rd, 1 ≤ p ≤ ∞, the dual of W kp (Ω) is W−k

p/(p−1)(Ω).

Lemma 4.4-29: If Ω ⊂ Rd is open and satisfies the uniform cone condition, then theSobolev spaces Hr(Ω) form a Hilbert scale.


Lemma 4.4-30: Suppose that Hr and Hr are two Hilbert scales. Also suppose thatT : Hr1 → Hr1 and T : Hr2 → Hr2 is linear and bounded:

for r = r1 or r2 , ‖T‖Hr→Hr ≡ supv∈Hr,‖v‖Hr>0

‖Tv‖Hr

‖v‖Hr<∞ .

Then for all r ∈ (r1, r2), T is a continuous linear operator from Hr to Hr, and

‖T‖r1−r2Hr→Hr ≤ ‖T‖r−r2Hr1→Hr1‖T‖r1−rHr2→Hr2 .

Example 4.4-31: Suppose that Ω ⊂ Rd satisfies the uniform cone condition, 1 ≤ p ≤∞, and k is a positive integer satisfying

p = 1 =⇒ k ≤ dp ≥ 1 =⇒ k > d/p

.

Then theorem 4.4-25 (Sobolev’s inequality) shows that the Dirac δ-function is a boundedlinear operator on W k

p (Ω). In other words, δ ∈ W−kp (Ω). In particular, the Dirac delta-

function is in H−k(Ω) for any k > d/2; for any ε > 0, δ ∈ H−d/2−ε(Ω).Example 4.4-32: Negative Sobolev norms are useful, because they provide bounds on

averages of functions. For example, if Ω is bounded, 1 < p <∞ and u ∈W−1p (Ω), then

‖u‖W−1p (Ω) ≡ sup

v∈W 1p (Ω)

|(v, u)L0(Ω)|‖v‖W 1

p (Ω)≥|(1, u)L0(Ω)|‖1‖W 1

p (Ω)=|∫Ω u dx|∫Ω dx

.

4.4.6 Sobolev Trace Theorems

Definition 4.4-33: If Ω ⊂ Rd, x ∈ ∂Ω and n is the unit outer normal to Ω at x andk ≥ 0, then the k’th normal derivative of u at x is

∂ku

∂nk≡∑|α|=k

k

α!nαDαu .

Note that we have to make further restrictions on ∂Ω so that the outer normal makessense. For example, if the boundary of Ω involves a slit, it is not possible to move in adirection normal from the boundary along the slit into Ωc. For other technical reasons, we


will also want to make sure that Ω is connected.

Trace Theorem 4.4-34: Suppose that Ω ⊂ Rn is open, connected and not locally onone side of its boundary. Then for every integer k ≥ 1 there exists C > 0 such that for allj ∈

(0, bk − 1

2c)

(where bk− 12c is the greatest integer less than or equal to k− 1

2) and forall u ∈W k

2 (Ω) ∥∥∥∥∂ju∂nj

∥∥∥∥H

k−j− 12

2 (∂Ω)

≤ C‖u‖Wk2 (Ω) .

Proof: See Aziz [8, p. 32], or Lions and Magenes [60, p. 39]. On page 15 Azizassumes that there are a finite number of open connected sets Is ⊂ Rd−1, 1 ≤s ≤ ν such that Ics ⊂ Is. Further, for each s ∈ 1, . . . , ν there is a functionφs : Is → R such that

Os ≡ (y, φs(y)) : y ∈ Is ⊂ ∂Ω ,

and∪νs=1Os = ∂Ω .

Further Aziz assumes that Ω is locally on one side of ∂Ω. When he uses theterm “smooth domain,” he assumes that the functions φs are infinitely differen-tiable. The domain Ω is Lipschitzian if every φs is Lipschitz continuous. For alldiscussions of Sobolev spaces, Aziz assumes that Ω is smooth.

On p. 34, Lions and Magenes assume that the boundary of Ω is a d − 1 di-mensional infinitely differentiable variety, and that Ω is locally on one side of itsboundary. They also assume that Ω is bounded. 2

In other words, the boundary values of functions in Sobolev spaces lie in fractional orderSobolev spaces on the boundary; boundary values have one half fewer derivatives than thefunction in the interior of the domain. A converse of this theorem, which says that for anybe ∈ Hk−1/2(∂Ω) there is a u ∈ Hk(Ω) so that ‖u‖Hk(Ω) ≤ ‖be‖Hk−1/2(∂Ω) is discussed in [8,p. 33].

Definition 4.4-35: If Ω ⊂ Rd and u : Ω → Rm, the Lipschitz norm of u is

‖u‖Lip(Ω) = ‖u‖L∞(Ω) + supx,y∈Ω ;x6=y

‖f(x)− f(y)|2‖x− y‖2

.

Definition 4.4-36: [16, p. 33] If φ : Rd−1 → R is Lipschitz continuous, then Ω = (x, ξ) :φ(x) < ξ is the graph of φ.


Note that if Ω is the graph of a Lipschitz continuous function φ and u : ∂Ω → R, thenthe integral of u over ∂Ω is∫

∂Ωu(x) ds =

∫Rd−1

u(x, φ(x))√

1 + ‖∇xφ‖22 dx .

Definition 4.4-37: Let Ω ⊂ Rd. Then Ω has a Lipschitz boundary if and only if thereexist a countable set Oi of open sets Oi ⊂ Rd such that Oi ∩ ∂Ω 6= ∅, a number ε > 0and integers M and N such that

1. for all x ∈ ∂Ω there exists an integer i so that the ball centered at x with radius ε iscontained in Oi:

y : ‖y − x‖2 < ε ⊂ Oi

2. no more than N of the sets Oi have a nonempty intersection,

3. for all i there exists a Lipschitz function φi : Rd → R such that

Ωi ≡ (y, η) : y ∈ Rd−1, η ∈ (−∞, φi(y))

is the graph of φi,

4. Oi ∩ Ω = Oi ∩ Ωi, and

5. ∀i ‖φi‖Lip(Rd−1)≤M .

See also Babuska and Aziz, Thm 2.4.1 p. 32 and Thm 2.4.3 p. 33:

Trace Theorem 4.4-38: For every Ω ⊂ Rd open with Lipschitz boundary and every1 ≤ p ≤ ∞ there exists C > 0 such that for all v ∈W 1

p (Ω)

‖v‖W 0p (∂Ω) ≤ C‖v‖1−1/p

W 0p (Ω)

‖v‖1/pW 1

p (Ω).

Proof: See Grisvard [45]. 2

From our results on Hilbert scales in section 4.4.5, we see that this statement is consistentwith stating that the boundary values of v ∈W 1

p (Ω) have 1− 1/p derivatives.The trace theorems do have other important consequences for Dirichlet boundary con-

ditions that are easy to understand.

Definition 4.4-39: If Ω ⊂ Rd is open with Lipschitz boundary, then W k0,p(Ω) is the

completion of C∞0 (Ω) in the norm ‖ · ‖Wkp (Ω).


Lemma 4.4-40: Suppose that Ω ⊂ Rd is open with Lipschitz boundary. If p ∈ [1,∞]and k > d/p, then for all u ∈W k

0,p(Ω), for all j ∈ (0, bk−d/pc) and for all x ∈ ∂Ω we have∂ju∂nj (x) = 0.

Proof: See Lions and Magenes, [60, p. 62], EXCEPT THAT L&M MAKEADDITIONAL ASSUMPTIONS ON OMEGA. 2

This lemma says that limits in W k2 (Ω) of smooth functions that are zero near the boundary

of Ω (in an appropriate sense) have fewer than k − d/2 zero normal derivatives on theboundary.

The next lemma allows us to bound lower-order derivatives of limits in W j2 (Ω) of C∞0

functions, in terms of higher-order derivatives.

Lemma 4.4-41:(Poincare’s Inequality) Suppose Ω ⊂ Rd is an open set, and suppose thatthere is a a constant LΩ > 0 and a line `Ω ⊂ Rd such that for all lines `′ parallel to `Ω thelength of the line segment `′ ∩ Ω is at most LΩ. Then for all m > 0 there is a constantCm,d independent of c and LΩ such that for all u ∈ Cm0 (Ω) and for all j ∈ [0,m) we have

|u|W j

2 (Ω)≤ Cm,dL

m−jΩ |u|Wm

2 (Ω) .

Proof: See Agmon [3, p.73]. We will prove the result for j = 0 and m = 1.

For all `′ parallel to `Ω, let x0 and y be any points so that both x0 and x0 + ylie in `′ ∩ ∂Ω, and so that `′ ∩ Ω is contained in the line segment from x0 tox0 + y. Given any u ∈ C1

0(Ω), define

f(t) = u(x0 + yt/‖y‖) .

Next, note that f(0) = 0 = f(‖y‖) and that f ′(t) = ∇xu · y/‖y‖. Since thefundamental theorem of calculus implies that f(t) =

∫ t0 f

′(τ) dτ , the Cauchy-Schwarz inequality implies that

|f(t)|2 ≤∫ t

012 dτ

∫ t

0f ′(τ)2 dτ ≤ LΩ

∫ ∞

−∞|f ′(τ)|2 dτ .

Since x0 was an artibrary endpoint, express∫Ω u(x)2 dx as an interated integral

with one of the integrations taken in the direction of `. Then it follows that∫Ωu(x)2 dx ≤ L2

Ω

∫Ω‖∇xu‖2 dx .

2


See comments about Dirichlet condition in [11, p. 33].

Lemma 4.4-42:(Friedrichs’ Inequality) Suppose that Ω ⊂ Rd is a convex open set withfinite radius ρΩ, meaning that for all x, z ∈ Ω we have ‖x − z‖∞ ≤ 2ρΩ. Then for allp ∈ [1,∞] and for all u ∈W 1

p (Ω),∥∥∥∥∫Ω u(y) dy∫

Ω dy− u

∥∥∥∥W 0

p (Ω)

≤ 2ρΩ|u|W 1p (Ω) .

Proof: Since C∞(Ω) is dense in W 1p (Ω), we may assume that u ∈ C∞(Ω).

First, we note that for any x ∈ Ω∫Ωu(y) dy − u(x)

∫Ωdy =

∫Ωu(y)− u(x) dy

so Taylor’s theorem implies that

=∫

Ω

∫ 1

0∇xu(x + [y − x]s) · (y − x) ds dy

and a change of variables of integration implies that

=∫

Ω

∫ 1

mins∈(0,1):x+(z−x)/s∈Ω∇xu(z) · (z− x)sd−1 ds dz

=∫

Ω

∫ 1

mins∈(0,1):x+(z−x)/s∈Ω∇xu(z) · (z− x)sd−1 ds dz

If 1/p+ 1/q = 1, we can take absolute values and obtain∣∣∣∣∫Ωu(y) dy − u(x)

∫Ωdy∣∣∣∣ ≤ 1

d

∣∣∣∣∫Ω∇xu(z) · (z− x)dz

∣∣∣∣ ≤ 1d

∫Ω‖∇xu(z)‖Lp(Ω)‖z−x‖Lq(Ω) dz .

If p = ∞, then q = 1 and

ess supx∈Ω

∣∣∣∣∫Ω u(y) dy∫

Ω dy− u(x)

∣∣∣∣ ≤ 1d

ess supx∈Ω

∫Ω ‖∇xu(z)‖L∞(Ω)‖z− x‖L1(Ω) dz∫

Ω dy(4.4-4)

≤ 1d|u|W 1

∞(Ω)

ess supx∈Ω

∫Ω ‖z− x‖L1(Ω) dz∫Ω dy

(4.4-5)

≤ |u|W 1∞(Ω)

ess supx∈Ω

∫Ω ‖z− x‖L∞(Ω) dz∫Ω dy

≤ 2ρΩ|u|W 1∞(Ω) .

(4.4-6)


On the other hand, if p <∞ then∫Ω

∣∣∣∣∫Ω u(y) dy∫

Ω dy− u(x)

∣∣∣∣p dx ≤ 1dp

[∫Ω ‖∇xu(z)‖Lp(Ω)‖z− x‖Lq(Ω) dz∫

Ω dy

]p(4.4-7)

≤ 1dp

∫Ω‖∇xu‖pLp(Ω) dz

1/p

∫Ω ‖z− x‖qLq(Ω) dz

1/q∫Ω dy

p

(4.4-8)

= d−p|u|pW 1

p (Ω)

∫

Ω d‖z− x‖qL∞(Ω) dz1/q∫

Ω dy

p

(4.4-9)

≤ d−1|u|pW 1

p (Ω)

[∫Ω(2ρΩ)q dz

1/q∫Ω dy

]p(4.4-10)

= d−1(2ρΩ)p|u|pW 1

p (Ω)(4.4-11)

Since d ≥ 1, the result follows. 2

See also Friedrichs’ inequality [16, p. 104]

∀Ω ⊂ Rd ∃C > 0∀u ∈W 1p (Ω)‖u− u‖W 1

p (Ω) ≤ C|u|W 1p (Ω) .

and a different version [11, p. 33].Example 4.4-43: Since the Laplacian has principal part p∆(ξ) = ‖ξ‖2

2, the Laplacian isuniformly strongly elliptic with ellipticity constant equal to 1. The associated bilinear form

A(δu, u) =∫

Ω∇xδu · ∇xu dx

is obviously symmetric. Since

A(δu, δu) =∫

Ω∇xδu · ∇xδu dx =

∫Ω‖∇xδu‖2

2 dx

the Poincare inequality (??) shows that A is coercive:

A(δu, δu) =∫

Ω‖∇xδu‖2

2 dx =1

1 + LΩ

∫Ω‖∇xδu‖2

2 dx +LΩ

1 + LΩ

∫Ω‖∇xδu‖2

2 dx

≥ 11 + LΩ

[∫Ω‖∇xδu‖2

2 dx +∫

Ωu2dx

]=

11 + LΩ

‖u‖2W 1

2 (Ω) .

4.5. ELLIPTIC DIFFERENTIAL OPERATORS 259

4.5 Elliptic Differential Operators

Recall that we discussed elliptic differential operators and general Dirichlet problemsin section 4.3.2. That discussion included the description of weak forms for differentialequations, and coercivity.

4.5.1 Garding’s Inequality

The following result is useful in describing the well-posedness of elliptic boundary valueproblems. This lemma will allow us to discuss elliptic operators that are not necessarilycoercive.

Lemma 4.5-1:(Garding’s Inequality) Suppose that Ω ⊂ Rd is bounded and open,and suppose that the bilinear form

B(v, u) =∑|β|≤m

∑|α|≤m

∫Ωaα,βDαvDβu dx

is uniformly strongly elliptic (see definition ??) with ellipticity constant E. Suppose thataα,β is continuous on Ω for all |α| = m = |β|, and that aα,β is bounded for all |α|+ |β| ≤ m.Then there exist constants γ > 0 and λ ≥ 0 so that for all v ∈ Cm0 (Ω) we have

B(v, v) ≥ γE‖v‖2Wm

2 (Ω) − λ‖v‖2W 0

2 (Ω) .

Proof: See Agmon [3, p. 78], or Brenner and Scott [16, p. 136]. 2

Elliptic boundary value problems described by weak forms for which the bilinear formsatisfies Garding’s inequality can have multiple solutions. Fortunately, Garding’s inequal-ity guarantees that both this elliptic boundary value problem and its adjoint have finitedimensional solution spaces.

Lemma 4.5-2: Suppose that Ω ⊂ Rd is bounded and open, and suppose that the bilinearform B associated with the elliptic operator A satisfies Garding’s inequality (4.5-1). Then

N (A) ≡ z ∈Wm0,2(Ω) : ∀v ∈Wm

0,2(Ω) B(v, z) = 0

andN ∗(A) ≡ z ∈Wm

0,2(Ω) : ∀v ∈Wm0,2(Ω) B(z, v) = 0

are finite dimensional with the same dimension.

Proof: See Agmon [3, p. 102]. 2


4.5.2 Existence and Uniqueness for Weak Problems

Finally, we have assembled most of the important results we need to discuss the well-posedness of elliptic boundary value problems. Our first goal will be to examine the solutionof self-adjoint coercive Dirichlet problems.

Lemma 4.5-3: Let Ω ⊂ Rd be open, and let A(x,D) be an elliptic operator with associ-ated bilinear form

A(v, u) =∑

|α|,|β|≤m

∫Ωaα,βDαvDβu dx .

Suppose that the coefficients aα,β are uniformly bounded, meaning that there exists aconstant Ca such that for all x ∈ Ω and all multi-indices α and β with max|α|, |β| ≤ m,

|aα,β(x)| ≤ Ca .

Also suppose that A is symmetric, and coercive; there is a constant CA such that for anyv ∈ Hm(Ω) we have

A(v, v) ≥ CA‖v‖2Hm(Ω) .

If f ∈W−m2 (Ω) and b ∈Wm

2 (Ω), then the Dirichlet problem

A(v, u) = (v, f) ∀v ∈ Hm0 (Ω)

u− b ∈ Hm0 (Ω)

has a unique solution u ∈ Hm(Ω). Further, u depends continuously on the data:

‖u‖Hm(Ω) ≤1CA

‖f‖H−m(Ω) +(

1 +CaCA

)‖b‖Hm(Ω) .

Alternatively, u solves the Dirichlet problem if and only if u is a minimum point of thetotal energy

E(w) ≡ 12A(w,w)− (w, f) +A(w, b) .

Finally, if u minimizes E then for any other w ∈ b+Hm0 (Ω) we have

E(w) = E(u+A(w − u,w − u) .

Proof: Since A is bilinear, symmetric and coercive, A is an inner producton Hm

0 (Ω). Let us define the linear functional λ(v) ≡ (v, f) − A(v, b) for all


v ∈ Hm0 (Ω). Then for all v ∈ Hm

0 (Ω)

|λ(v)| ≤ ‖f‖H−m(Ω)‖v‖Hm(Ω) + Ca‖b‖Hm(Ω)‖v‖Hm(Ω)

≤‖f‖H−m(Ω) + Ca‖b‖Hm(Ω)√

CA

√A(v, v) .

This shows that λ is a bounded linear functional on Hm0 (Ω), in the norm gen-

erated by the inner product A. The Riesz Representation Theorem 4.4-12 nowshows that there exists a unique w ∈ Hm

0 (Ω) satisfying

A(v, w) = λ(v)

for all v ∈ Hm0 (Ω), and that√

A(w,w) = ‖λ‖A ≡ supv∈Hm

0 (Ω),v 6=0

|λ(v)|√A(v, v)

≤‖f‖H−m(Ω) + Ca‖b‖Hm(Ω)√

CA.

Next, we use the coercivity of A to obtain

‖w‖Hm(Ω) ≤1√CA

√A(w,w) ≤ (‖f‖H−m(Ω) + Ca‖b‖Hm(Ω))

1CA

.

Finally, we let u = w + b to prove continuous dependence on the data.

Next, let w ∈ b + Hm0 (Ω). Since u solves the Dirichlet problem and w − u ∈

Hm0 (Ω), A(w − u, u) = λ(w − u) = (w − u, f)−A(w − u, b). It follows that

E(w) = E(u+ [w − u], u+ [w − u])

=12A(u+ [w − u], u+ [w − u])− (u+ [w − u], f) +A(u+ [w − u], b)

=

12A(u, u)− (u, f) +A(u, b)

+ A(w − u, u)− (w − u, f) +A(w − u, b)

+12A(w − u,w − u)

= E(u) +12A(w − u,w − u) .

Since A(w − u,w − u) ≥ CA‖w − u‖Hm(Ω) ≥ 0, this shows that u minimizes Eover b+Hm

0 (Ω), and completes the proof. 2

Example 4.5-4: Suppose we want to solve

− ∇x · ∇xu+ au = f ∀x ∈ Ωu = be ∀x ∈ ∂Ω


where Ω has width at most LΩ, and there are constants a ≥ a ≥ 0 so that for all x ∈ Ω wehave a(x) ∈ [a, a]. Given any function b ∈ H1(Ω) so that b(x) = be(x) for all x ∈ ∂Ω, wecan find u ∈ b+H1

0 (Ω) so that for all v ∈ H10 (Ω)

A(v, u− b) ≡∫

Ω∇xv · ∇xu+ avu dx =

∫Ωvf dx−A(v, b) ≡ λ(v) .

If a is nonnegative, then the Poincare inequality (4.4-41)

|v|H0(Ω) ≤ LΩ|v|H1(Ω)

shows that for any µ ≥ 0

A(v, v) ≥ |v|2H1(Ω) + a|v|2H0(Ω)

= µ|v|2H1(Ω) + (1− µ)|v|2H1(Ω) + a|v|2W 02 (Ω)

≥[a+

µ

L2Ω

]|v|2H0(Ω) + (1− µ)|v|2H1(Ω) .

If a > 1, we can take µ = 0, while if if a ≤ 1, we can solve 1− µ = a+ µ/L2Ω for µ ≥ 0. In

either case, we get

A(v, v) ≥[1−

L2Ω

1 + L2Ω

max1− a, 0]‖v‖2

H1(Ω) ≡ CA‖v‖2H1(Ω) ,

so A is coercive. It follows from the previous lemma that the boundary value problemhas a unique solution u ∈ H1(Ω) depending continuously on its data. Since the functionb ∈ H1(Ω) satisfying the prescribed boundary data was otherwise arbitrary, it follows that

‖u‖W 12 (Ω) ≤

1CA

‖f‖W−12 (Ω) +

(1 +

max1, aCA

)inf

b∈W 12 (Ω)(Ω):b=be on ∂Ω

‖b‖W 12 (Ω) .

Example 4.5-5: This problem does not involve a Dirichlet boundary condition. Supposethat we want to solve

− ∇x · ∇xu+ au = f ∀x ∈ Ωn · ∇xu+ αu = bn ∀x ∈ ∂Ω .

Here we assume that there is a constant a > 0 so that for all x ∈ Ω we have a(x) ≥ a. Wealso assume that for all x ∈ ∂Ω we have α(x) ≥ 0. The weak form of this problem is to findu ∈ H1(Ω) so that for all v ∈ H1(Ω) we have

A(v, u) ≡∫

Ω∇xv · ∇xu+ a v u dx +

∫∂Ωvαu ds =

∫Ωvf dx +

∫∂Ωvbn ds ≡ λ(v) .


Unlike the previous lemma, we do not require v ∈ H10 (Ω), because we have a boundary

condition involving a normal derivative. Note that

A(v, v) =∫

Ω∇xv · ∇xv + a v2 dx +

∫∂Ωαv2 ds ≥ min1, a‖v‖2

H1(Ω) ≡ CA‖v‖2H1(Ω) ,

so A is coercive. It follows from the proof of the previous lemma that the boundary valueproblem has a unique solution u ∈ H1(Ω) such that

‖u‖H1(Ω) ≤1CA

supv∈H1(Ω),v 6=0

|∫Ω vf dx +

∫∂Ω vbn ds|

‖v‖H1(Ω).

Recall that the trace theorem 4.4-34 implies that there exists a constant C∂Ω so that for allv ∈ H1(Ω)

‖v‖H1/2(∂Ω) ≤ C∂Ω‖v‖H1(Ω) .

It follows from the definition of the negative Sobolev norms that

‖u‖H1(Ω) ≤1

min1, a

[‖f‖H−1(Ω) + ‖bn‖

H−12 (∂Ω)

].

The next lemma allows us to handle some differential equations that are not self-adjoint.

Lemma 4.5-6: Let Ω ⊂ Rd be open, and let A(x,D) be an elliptic operator with associ-ated bilinear form

A(v, u) =∑

|α|,|β|≤m

∫Ωaα,βDαvDβu dx .

Suppose that there is a constant Ca > 0 so that for all x ∈ Ω and for all α|, |β| ≤ m we have|aα,β(x)| ≤ Ca. Also suppose that the bilinear form A is coercive with coercivity constantCA. If f ∈ H−m(Ω) and b ∈ Hm(Ω), then the Dirichlet problem to find u ∈ b+Hm

0 (Ω) sothat for all v ∈ Hm

0 (Ω)A(v, u− b) = (v, f)−A(v, b)

has a unique solution satisfying

‖u‖Hm(Ω) ≤1CA

(‖f‖H−m(Ω) + Ca‖b‖Hm(Ω)2) . (4.5-1)

Proof: This proof is similar to that of the previous lemma, but it uses theLax-Milgram theorem 4.4-15, which does not require symmetry of the bilinearform. 2


Example 4.5-7: Suppose Ω ⊂ Rd has width LΩ, and that we want to solve the steady-state diffusion-convection-reaction problem

− ∇x ·D∇xu+ v · ∇xu+ ru = f ∀x ∈ Ωu = 0 ∀x ∈ ∂Ω .

Initially, we assume that D > 0 and v 6= 0. The weak form of this problem is to findu ∈ H1

0 (Ω) so that for all v ∈ H10 (Ω)

A(v, w) ≡∫

Ω∇xv ·D∇xu+ vv · ∇xu+ vru dx =

∫Ω

[∇xv v

] [ I v/2v/2 r

] [∇xuu

]dx

=∫

Ωvf dx ≡ (v, f) .

Let us search for a coercivity constant for A. For any µ ∈ [0, 1] the Poincare inequalityimplies that

A(w,w) =∫

Ωµ∇xw ·D∇xw dx +

∫Ω(1− µ)∇xw ·D∇xw + wv · ∇xw + rw2 dx

≥ µ

L2Ω

∫Ωw2 dx +

∫Ω(1− µ)∇xw ·D∇xw + wv · ∇xw + rw2 dx

=∫

Ω

[LΩ∇xw> w

] [ID(1− µ)/L2Ω v/(2LΩ)

v>/(2LΩ) r + µD/L2Ω

] [LΩ∇xww

]dx

The coercivity constant for A is given by any lower bound on the smaller eigenvalue of thematrix in this quadratic form.

Consider the eigenvector problem[Iα bb> γ

] [yη

]=[yη

]λ ,

where α > 0, γ > 0 and b 6= 0. If the first block component y of the eigenvector isindependent of b, then ξ = 0 and b>y = 0. This gives us one eigenvalue λ = α, withmultiplicity one less than the dimension of b. On the other hand, if y = b then the firstequation implies that ξ = λ − α, and the second equation implies that b>b + γξ = ξλ.Putting these equations together gives us a quadratic equation for the eigenvalue λ:

0 = λ2 − λ(α+ γ) + αγ − b>b .

The solutions of this quadratic are

λ =12

[α+ γ ±

√(α− γ)2 + 4b>b

].


Thus the smaller eigenvalue of these two eigenvalues is maximized when γ = α, but thiseigenvalue is always smaller than α. Also, the smaller eigenvalue is positive when

b>b < αγ .

In our quadratic form for the Dirichlet problem, we have α = (1−µ)D/L2Ω, b = v/(2LΩ)

and γ = r + µD/L2Ω. Choosing γ = α to minimize the smallest eigenvalue leads to

µ =12

(1−

L2Ωr

D

).

Since we have required µ ∈ [0, 1], this choice is permissible if and only if −D/L2Ω ≤ r ≤

D/L2Ω. In this case, the smallest eigenvalue of the quadratic form will be

λ = α− ‖b‖ =12

[D

2L2Ω

+r

2− ‖v‖LΩ

].

In order that the smallest eigenvalue of the quadratic form be positive, we would require

‖v‖ ≤ 12

(D

LΩ+ rLΩ

).

On the other hand, if r ≥ D/L2Ω, then we must choose µ = 0, which implies that the product

of the two eigenvalues of the quadratic form is Dr/L2Ω−‖v‖2/(4L2

Ω). Requiring this productto be positive leads to

‖v‖ < 2√rD .

then the quadratic form will be coercive. If r < −D/L2Ω, then we would have to take µ = 1,

which would imply that α = 0 and cause the quadratic form to be indefinite.In summary, if the inequalities D > 0, −D/L2

Ω < r and

‖v‖ <

2√Dr, r ≥ D/L2

Ω12

[DLΩ

+ rLΩ

], −D/L2

Ω < r ≤ D/L2Ω

are satisfied, then the quadratic form will be positive-definite. Slight modifications of theseinequalities will produce a coercivity condition. The previous lemma will then guarantee theexistence of the solution to the problem. Otherwise, we may not be able to guarantee theexistence and uniqueness of the solution.

4.5.3 Higher-Order Dependence on the Data

Next, let us discuss higher-order dependence on the data.


Definition 4.5-8: The bilinear form

B(v, u) ≡∑

|α|,|β|≤m

∫Ωaα,β(x)DαvDβu dx

is right j-smooth iff

1. aα,β is bounded in Ω for all |α|, |β| ≤ m, and

2. aα,β ∈ C|α|+j−m(Ω) for all |β| ≤ m and for all |α| > m− j.

Definition 4.5-9: Suppose that Ω ⊂ Rd is open and bounded. For an integer k ≥ 1, Ω isof class Ck if and only if for all x ∈ ∂Ω there exists an open set Θ containing x and thereexists a function fΘ : Ω ∩Θ → a hemisphere, with f ∈ Ck(Θ) and points in ∂Ω ∩Θ beingmapped to the flat part of the hemisphere.

Lemma 4.5-10: [3, p. 128]. If Ω is of class Ck for any k ≥ 1, then Ω satisfies the strongcone property.


Lemma 4.5-11: [3, p. 129] Suppose that Ω ⊂ Rd is open, bounded, and of class C2m.Next, suppose that the bilinear form

A(v, u) ≡∑

|α|,|β|≤m


satisfies Garding’s inequality (4.5-1) and is right j-smooth for some integer j ∈ [1,m].Then

1. there exists a constant Cj > 0 such that for all f ∈ H0(Ω) and for all b ∈ Hm+j(Ω)the solution u ∈ b+Hm

0 (Ω) of the weak elliptic problem

∀v ∈ Hm0 (Ω)A(v, u) = (v, f)

satisfies‖u‖Hm+j(Ω) ≤ Cj‖f‖H0(Ω) + ‖b‖Hm+j(Ω) + ‖u‖H0(Ω) .

2. If k ≥ 0, Ω is of class C2m+k and A is right (m + k)-smooth, then there exists aconstant Ck > 0 such that for all f ∈ H0(Ω) and for all b ∈ H2m+k(Ω) the solutionu ∈ b+H2m+k

0 (Ω) of the weak elliptic problem

∀v ∈ Hm0 (Ω)A(v, u) = (v, f)

satisfies‖u‖H2m+k(Ω) ≤ C‖f‖Hk(Ω) + ‖b‖H2m+k(Ω) + ‖u‖H0(Ω) .

andinf

z∈N (A)‖u+ z‖H2m+k(Ω) ≤ C‖f‖Hk(Ω) + ‖b‖H2m+k(Ω) .

Next, let us remark that the smoothness of the boundary of Ω can affect the smoothnessof the solution of Dirichlet problems.


Lemma 4.5-12: Suppose that Ω ⊂ R2 and ∂Ω has a corner of angle 0 < α < 2π withα 6= π Then for all j ∈ (1, π/α) there exists a constant C > 0 such that for all f ∈ Hj−2(Ω)the solution u ∈ Hj

0(Ω) of∫Ω∇xv · ∇xu dx =

∫Ωvf dx ∀v ∈ Hj

0(Ω)

satisfies‖u‖

W j2 (Ω)

≤ C‖f‖W j−2

2 (Ω).

This and general results can be found in [45].

Example 4.5-13: A rectangular domain has corners with angle α = π/2. In general,solutions of Laplace’s equation in rectangular domains are in W 2−ε

0,2 (Ω) for any ε > 0, nomatter how smooth f may be.

4.6 Piecewise Polynomial Approximation

4.6.1 Bramble-Hilbert Lemma

In section 4.2, we discussed an example of the finite element method for a particularpartial differential equation. In that example, we used piecewise polynomials to approximatethe solution of the partial differential equation. In this section, we will discuss fairly generalresults regarding the errors in piecewise polynomial approximation.

We begin with a technical lemma.

4.6. PIECEWISE POLYNOMIAL APPROXIMATION 269

Lemma 4.6-1: Let k > 0 be an integer, and define the sets of d-dimensional multi-indices

Lk = α : |α| = k = maxiαi and Mk = α : |α| = k .

Let K be any set of multi-indices so that Lk ⊂ K ⊂Mk, and let

PK = q ∈ Pk : Dτq = 0 for all τ ∈ K .

Then for all u ∈W kp (Ω) there exists a polynomial q ∈ PK such that for all |γ| < k we have

(1,Dγ(u+ q)) = 0 (4.6-1)

If, in addition, we have that Mk \ K = γ(1), . . . , γ(s) 6= ∅ then define K(0) ≡ K andfor j ∈ [1, s] we define K(j) ≡ K ∪ γ(1), . . . , γ(j). Then q can be chosen so that for allj ∈ [1, s] and for all r ∈ PK(j−1) we have

(Dγ(j)r, ,Dγ(j)(u+ q)) = 0 . (4.6-2)

Proof: First, suppose that Mk \K = γ(1), . . . , γ(s) is non-empty. For eachj ∈ [1, s], define the set of polynomials PK(j) as above. Since K(j − 1) ⊂ K(j),it follows that PK(j) ⊂ PK(j−1).

LetPγ(j) = Dγ(j)q : q ∈ PK(j−1) .

Note that since PK(j) ⊂ PK(j−1), it follows that Pγ(j) ⊂ Pγ(j−1).

Let r` be an orthonormal basis for Pγ(1). Let q0 ∈ PK(0) be such that

Dγ(1)q0 =∑`

r`(r`,Dγ(1)u)

It follows that for all `(r`,Dγ(1)(u− q0)) = 0 .

Since r` is a basis for Pγ(1), we have that for all r ∈ Pγ(1),

(r,Dγ(1)(u− q0)) = 0 .

Arguing in the same way, we can find q1 ∈ PK(1) so that for all r ∈ PK(2) wehave

(r,Dγ(1)(u− q0 − q1)) = 0 .


Note that by the definition of Pγ(1) we have Dγ(1)q1 = 0, so for all r ∈ Pγ(1) wehave

(r,Dγ(1)(u− q0 − q1)) = 0 .

Continuing on in this way, we can find q ∈ PK(s) = Mk so that for all j ∈ [1, s]and for all q ∈ PK(s) = Mk we have

(Dγ(j)q,Dγ(j)(u− q)) = 0 .

If K = Mk, we take q = 0.

Next, for all |α| = k − 1, let

cα = (1,Dα(u− q))/α! .

Then for all |β| = k − 1 we have1,Dβ

u− q −∑

|α|=k−1

cαxα

= 0 .

Since Dγ(1)∑

|α|=k−1 cαxα = 0, the modified polynomial still satisfies (4.6.1).

We can continue on to lower multi-index lengths to find a polynomial that alsosatisfies . 2

The following result will be the basis for our estimates of the error in piecewise polyno-mial approximation.

(Bramble-Hilbert) Lemma 4.6-2: [15]. Suppose that Ω ⊂ Rd is open, bounded andsatisfies the strong cone condition. Given an integer k > 0, suppose that K is a set ofmulti-indices such that

α : |α| = k and maxαi = k ⊂ K ⊂ α : |α| = k .

Define the following set of polynomials

PK = v : Dαv = 0 for all α ∈ K .

Then for all p ∈ [1,∞) there is a constant C > 0 so that for all u ∈W kp (Ω)∑

τ∈K‖Dτu‖Lp(Ω) ≤ inf

v∈PK

‖u− v‖Wkp (Ω) ≤ CK,p

∑τ∈K

‖Dτu‖Lp(Ω) .


Proof: We will prove the left-hand inequality first. Since Dτq = 0 for allq ∈ PK and for all τ ∈ K, it follows that for all u ∈W k

p (Ω) and all q ∈ PK∑τ∈K

‖Dτu‖Lp(Ω) =∑τ∈K

‖Dτ (u+ q)‖Lp(Ω) ≤ ‖u+ q‖Wkp (Ω) .

Taking the infimum over q ∈ PK gives us the left-hand inequality in lemma4.6-2.

Suppose that u ∈W kp (Ω). Then lemma 4.6-1 implies that there is a polynomial

q ∈ PK so that (4.6.1) and (4.6-1) hold. We may replace u by u+ q and assumethat

∀j ∈ [1, s]∀r ∈ PK(j−1) (Dγ(j)r,Dγ(j)u) = 0 (4.6-3)

and∀|γ| < k (1,Dγu) = 0 . (4.6-4)

All that remains is to prove the right-hand inequality. We will prove it bycontradiction. If this inequality is false, then there is a sequence un so that

∀j ∈ [1, s]∀r ∈ PK(j−1) (Dγ(j)r,Dγ(j)un) = 0 (4.6-5)

∀|γ| < k (1,Dγun) = 0 . (4.6-6)‖un‖Wk

p (Ω) = 1

∀τ ∈ K limn→∞

‖Dτun‖Lp(Ω) = 0 (4.6-7)

Then Rellich’s lemma (proved for p = 2 in [3, p. 30]) implies that there is asubsequence that converges in W k−1

p (Ω). We will also denote this subsequenceby un. Next, lemma 4.4-26 implies that u→u in W k

p (Ω), where u satisfies(4.6-5) and (4.6-6).

If K = Mk, then (4.6-7) implies that Dγ u = 0 almost everywhere, for all |γ| = k.If K 6= Mk, we will show by induction that this result is still true.

If K 6= Mk, we claim that Dγ(1)u = 0. For any τ ∈ K(0) = K, let

βτ + γ(1) = β∗τ + τ

where |βτ‖ is minimal. Then for all φ ∈ C∞− (Ω) equation (4.6-7) implies that

(Dβτφ,Dγ(1)u) = (Dβ∗τφ,Dτ u) = 0 .

Since Dγ(1)u ∈ N (Dβτ ) for all τ ∈ K(0) = K, it follows that Dγ(1)u is a

polynomial, namelyDγ(1)u =

∑β:∀τ∈K(0) β 6≥βτ

aβxβ .


Let β be one such multi-index in this sum. Then for all τ ∈ K(0) there is anindex j such that βj < (βτ )j . Since |βτ | is minimal, it follows that (βτ )j > 0and (β∗τ )j = 0. Thus (βτ )j = τj − γ(1)j , and βj < τj − γ(1)j . Thus

∑β:∀τ∈K(0) β 6≥βτ

aβxβ = Dγ(1)

∑β:∀τ∈K(0) β 6≥βτ

aββ!

(β + γ(1))!xβ+γ(1)

= Dγ(1)

∑α:∀τ∈K(0) α 6≥τ

aα−γ(1)(α− γ(1))!

α!xα

In other words, Dγ(1)u = Dγ(1)q for some q ∈ PK(0). It follows that Dγ(1)u isalmost everywhere equal to an element of Pγ(1). But then (4.6-5) implies thatDγ(1)u = 0 almost everywhere. We can continue on for other j ∈ [1, s] to seethat for all |γ| = k we have Dγ u = 0 almost everywhere.

Next, let |β| = k − 1, |α| = 1 and φ ∈ C∞0 (Ω). Then

(Dαφ,Dβu) = (φ,Dα+βu) = 0 ,

so Dβu is constant almost everywhere. Then (4.6-6) implies that Dβu = 0almost everywhere. Continuing on in this way, we see that Dγ u = 0 almosteverywhere for all |γ| ≤ k. It follows that ‖u‖Wk

p (Ω) = 0, which contradicts theassumption that ‖un‖Wk

p (Ω) = 1 for all n and that un → u in W kp (Ω). This

contradiction proves the right-hand inequality in lemma 4.6-2. 2

The following definition will assist our application of the Bramble-Hilbert lemma.

Definition 4.6-3: The function F : W kp (Ω) → R is a bounded sublinear functional

on W kp (Ω) if and only if it is nonnegative

F (u) ≥ 0 for all u ∈W kp (Ω)

sublinear

F (uα+ vβ) ≤ F (u)|α|+ F (v)|β| for all u, v ∈W kp (Ω) and for all α, β ∈ R

and bounded

thereexists CF > 0 such that for all u ∈W kp (Ω) F (u) ≤ CF ‖u‖Wk

p (Ω) .


Lemma 4.6-4: Suppose that the hypotheses of the Bramble-Hilbert lemma are satisfied,and that F is a bounded sublinear functional on W k

p (Ω) such that F (p) = 0 for all p ∈ PK .Then for all u ∈W k

p (Ω)

F (u) ≤ CFCK,p∑α∈K

‖Dαu‖W 0p (Ω) .

Proof: If u ∈ W kp (Ω), then the definition of a bounded sublinear functional

implies that for all p ∈ PK

F (u) = F ([u− p] + p) ≤ F (u− p) + F (p) = F (u− p) ≤ CF ‖u− p‖Wkp (Ω) .

Since p was arbitrary, the Bramble-Hilbert lemma now implies the final result.2

Here is an example of bilinear approximation on a square. Example 4.6-5: Let Ω =

x ∈ R2 : 0 < x1,x2 < 1 be the unit square, and let K =[

20

],

[02

]. Then the Bramble-

Hilbert lemma takes

PK = a+ bx1 + cx2 + dx1x2 : a, b, c, d ∈ R .

Given u ∈ W 22 (Ω), the Sobolev Imbedding Theorem 4.4-23 shows that u is continuous and

that the values at the corners of Ω are meaningful. Let u ∈ PK be the bilinear interpolantto u ∈W 2

2 (Ω):

u(x1,x2) = u(0, 0) + u(1, 0)− u(0, 0)x1 + u(0, 1)− u(0, 0)x2

+ u(1, 1)− u(1, 0)− u(0, 1) + u(0, 0)x1x2 .

Now let F (u) = ‖u − u‖W 02 (Ω). Note that F (u) is nonnegative and sublinear. Sobolev’s

inequality (4.4-3) shows that F is a bounded sublinear functional on W 22 (Ω). Since bilinear

interpolation reproduces bilinear functions exactly, we have p = p for all p ∈ PK . In otherwords, F (p) = 0 for all p ∈ PK . It follows from lemma 4.6-4 that there is a constant C > 0such that for all u ∈W 2

2 (Ω) we have

‖u− u‖W 02 (Ω) ≤ C

‖∂

2u

∂x21

‖W 02 (Ω) + ‖∂

2u

∂x22

‖W 02 (Ω)

.

Note that the mixed second-order derivative does not appear in the right-hand side of thiserror estimate.

Example 4.6-6: Next, suppose that Ω is divided into a union of disjoint squares oflength h. Given u ∈ W 2

2 (Ω), define u to be the piecewise bilinear interpolant to u at the


corners of the squares. Then u ∈ C0(Ω) ∩ W 12 (Ω). Let S be one of these squares, say

S = (0, h)× (0, h). DefineFh(u) = ‖u− u‖W 0

2 (Ω)(S) .

Using the discussion in the previous example, we see that Fh is a bounded sublinear func-tional. To see how Fh depends on h, let

v(y1,y2) = u(hy1, hy2) and v(y1,y2) = u(hy1, hy2) ,

where y ∈ Ω. Next, defineF (v) = ‖v − v‖W 0

2 (Ω) .

An easy change of variables shows that

Fh(u) =∫ h

0

∫ h

0[u(x)− u(x)]2 dx1 dx2

1/2

=h2

∫ 1

0

∫ 1

0[u(yh)− u(yh)]2 dy1 dy2

1/2

= hF (v) .

Since the previous example showed that there exists a constant C > 0 so that for all v ∈W 2

2 (Ω)

F (v) ≤ C

∥∥∥∥∂2v

∂x21

∥∥∥∥W 0

2 (Ω)

+∥∥∥∥∂2v

∂x22

∥∥∥∥W 0

2 (Ω)

,

it follows that

1hFh(u) = F (v) ≤ C

∥∥∥∥∂2v

∂x21

∥∥∥∥W 0

2 (Ω)

+∥∥∥∥∂2v

∂x22

∥∥∥∥W 0

2 (Ω)

= Ch

∥∥∥∥∂2u

∂x21

∥∥∥∥W 0

2 (Ω)(S)

+∥∥∥∥∂2u

∂x22

∥∥∥∥W 0

2 (Ω)(S)

.

In other words,

‖u− u‖W 02 (Ω)(S) ≤ Ch2

∥∥∥∥∂2u

∂x21

∥∥∥∥W 0

2 (Ω)(S)

+∥∥∥∥∂2u

∂x22

∥∥∥∥W 0

2 (Ω)(S)

.

This implies that

‖u− u‖2W 0

2 (Ω) ≤ C2h4∑Si

∥∥∥∥∂2u

∂x21

∥∥∥∥W 0

2 (Ω)(Si)

+∥∥∥∥∂2u

∂x22

∥∥∥∥W 0

2 (Ω)(Si)

= C2h4

∥∥∥∥∂2u

∂x21

∥∥∥∥W 0

2 (Ω)

+∥∥∥∥∂2u

∂x22

∥∥∥∥W 0

2 (Ω)

.


It is also possible to show that for a possibly different constant C,

‖u− u‖W 12 (Ω) ≤ Ch

∥∥∥∥∂2u

∂x21

∥∥∥∥W 0

2 (Ω)

+∥∥∥∥∂2u

∂x22

∥∥∥∥W 0

2 (Ω)

.

In order to apply this approach to estimating errors on general polygonal subdivisionsof problem domains, we will need to consider some additional details of the polygon. Inparticular, we will need to understand how the Jacobians of the coordinate transformationsfrom the mesh elements to the canonical elements behave. We will discuss these detailsfurther later in the chapter.

Note that if ∂Ω is a piecewise polynomial of degree k, u = 0 on ∂Ω, and our piecewisepolynomial approximant reproduces polynomials of degree k exactly, then the piecewisepolynomial approximant will be zero on all of ∂Ω.

4.6.2 Polynomial Approximation on Star-Shaped Domains

For certain domains, we can construct polynomials approximations to functions inSobolev spaces, and estimate the errors in the approximations.

Lemma 4.6-7: Define

ψ(y) ≡e−1/(1−‖y‖2), ‖y‖ < 1

0, ‖y‖ ≥ 1(4.6-8)

Cψ ≡∫‖y‖<1

ψ(y)dy ≈

.2219969081, d = 1.2332561967, d = 2.4410888872, d = 3

(4.6-9)

For any x0 ∈ Rd and any ρ > 0 define

φx0,ρ(x) ≡ 1Cψρd

ψ ((x− x0)/ρ) . (4.6-10)

Then ∫Rd

φx0,ρ(x) dx = 1

and∀x ∈ Rd , 0 ≤ φx0,ρ(x) ≤ 1

eCψρ−d


Proof: It is easy to see that for all y we have 0 ≤ ψ(y) ≤ 1/e. The bound onφx0,ρ follows easily from this fact. Further,

∫Rd

φx0,ρ(x) dx =∫‖x−x0‖<ρ

ψ ((x− x0)/ρ)Cψρd

dx =1Cψ

∫‖y‖<1

ψ(y) dy = 1 .

2

Lemma 4.6-8: Suppose that m > 0 is an integer, x,x0, z ∈ Rd, ρ > 0 and φx0,ρ is thecut-off function defined in equation (4.6-10). For all u ∈ C∞(Rd) define the averagedTaylor polynomial

Pmx0,ρu(x) ≡∫‖y−x0‖<ρ

∑|α|<m

1α!

Dαu(y)(x− y)αφx0,ρ(y) dy (4.6-11)

and the setSx0,ρ(x) ≡ x + (y − x)s : 0 < s < 1 and ‖y − x0‖ < ρ ,

which is the convex hull of the point x0 and the ball x : ‖y − x0‖ < ρ. Then Pmx0,ρu(x)is a polynomial of degree at most m− 1 in x, and for all |β| < m

DβPmx0,ρu(x) = Pm−|β|x0,ρ u(x) ,

and for all |β| ≥ m we have DβPmx0,ρu(x) ≡ 0. Finally,

|u(x)− Pmx0,ρu(x)| ≤ m

deCψ

(1 +

‖x− x0‖ρ

)d ∑|α|=m

1α!

∫Sx0,ρ(x)

|Dαu(z)| |(x− z)α|‖z− z‖d

dz .

Proof: First, we show that Pmx0,ρu is a polynomial of degree at most m − 1.Using the binomial expansion 4.3-2, we see that

Pmx0,ρu(x) =∑|α|<m

∑β≤α

1α!

(α

β

)xβ∫‖y−x0‖<ρ

Dαu(y)(−y)α−βφx0,ρ(y) dy .

This is obviously a polynomial of degree at most m− 1 in x.


Next, let us compute the derivatives of Pmx0,ρu. If |β| < m, then for all x ∈ Rd

DβPmx0,ρu(x) =∫‖y−x0‖<ρ

∑|α|<m,α≥β

1(α− β)!

Dαu(y)(x− y)α−βφx0,ρ(y) dy

=∫‖y−x0‖<ρ

∑|γ|<m−|β|

1γ!

Dγ+βu(y)(x− y)γφx0,ρ(y) dy

= Pm−|β|x0,ρ Dβu(x)

If |β| ≥ m it is obvious that DβPmx0,ρu(x) ≡ 0.

Let us prove the final claim. Using Taylor’s theorem (4.3-1) we can estimate

∣∣u(x)− Pmx0,ρu(x)∣∣

=

∣∣∣∣∣∣∫‖y−x0‖<ρ

u(x)−∑|α|<m

1α!

Dαu(y)(x− y)α

φx0,ρ(y) dy

∣∣∣∣∣∣=

∣∣∣∣∣∣∫‖y−x0‖<ρ

∑|α|=m

m

α!(x− y)α

∫ 1

0sm−1Dαu(x + [y − x]s) dsφx0,ρ(y) dy

∣∣∣∣∣∣=

∣∣∣∣∣∣∑|α|=m

m

α!

∫Sx0,ρ(x)

Dαu(z)(x− z)α∫

mins∈[0,1]:x+(y−x)s∈Sx0,ρ(x)s−1−dφx0,ρ

(x +

z− xs

)ds dz

∣∣∣∣∣∣≤m

∑|α|=m

1α!

∫Sx0,ρ(x)

|Dαu(z)(x− z)α|∫‖z−x‖/(ρ+‖x−x0‖

s−1−dφx0,ρ

(x +

z− xs

)ds dz

≤ m

eCψρd

∑|α|=m

1α!

∫Sx0,ρ(x)

|Dαu(z)(x− z)α|∫‖z−x‖/(ρ+‖x−x0‖

s−1−d ds dz

≤ m

deCψρd

∑|α|=m

1α!

∫Sx0,ρ(x)

|Dαu(z)(x− z)α|

[(ρ+ ‖x− x0‖‖z− x‖

)d− 1

]dz

≤ m

deCψ

(1 +

‖x− x0‖ρ

)d ∑|α|=m

1α!

∫Sx0,ρ(x

|Dαu(z)| |(x− z)α|‖z− x‖d

dz

2


Lemma 4.6-9: Suppose that Ω ⊂ x : ‖x − c‖ ≤ ρΩ ⊂ Rd, |α| > 0, p ∈ [1,∞] andf ∈ Lp(Ω). Let

Cd ≡

1, d = 1π, d = 2

4π, d = 3(4.6-12)

Then ∥∥∥∥∫Ω|f(z)| |(x− z)α|

‖x− z‖ddz∥∥∥∥Lp(Ω)

≤ Cd(2ρΩ)|α|

|α|‖f‖Lp(Ω) .

Proof: First, we note that the geometric mean is bounded above by thearithmetic mean [46, p. 17]. Thus for any y ≥ 0, the Schwarz inequality impliesthat

yα =d∏i=1

yαii <

[d∑i=1

yiαi|α|

]|α|

≤

[

d∑i=1

y2i

]1/2 [ d∑i=1

(αi|α|

)2]1/2

|α|

≤

‖y‖[

d∑i=1

αi|α|

]1/2|α|

= ‖y‖|α| .

To prove the lemma, first consider the case p = ∞. Then

ess supx∈Ω

∫Ω|f(z)| |(x− z)α|

‖x− z‖ddz

≤ ess supx∈Ω

∫Ω|f(z)|‖x− z‖|α|−d dz ≤ ‖f‖L∞(Ω)

∫‖x−c‖<ρΩ

‖x− z‖|α|−d dz ≤ ‖f‖L∞(Ω)

∫‖x−z‖<2ρΩ

‖x− z‖|α|−d dz

=‖f‖L∞(Ω)Cd

∫ 2ρΩ

0r|α|−1 dr = ‖f‖L∞(Ω)Cd

(2ρΩ)|α|

|α|

Next, consider the case p = 1:∫Ω

∫Ω|f(z)| |(x− z)α|

‖x− z‖ddz dx

≤∫

Ω|f(z)|

∫Ω‖x− z‖|α|−d dx dz ≤ Cd

(2ρΩ)|α|

|α|

∫Ω|f(z)| dz

Finally, suppose that 1 < p < ∞ and q = p/(p − 1). Then 1/p + 1/q = 1, and


Holder’s inequality implies∫Ω

[∫Ω|f(z)| |(x− z)α|

‖x− z‖ddz]p

dx

≤∫

Ω

[∫Ω‖x− z‖(|α|−d)/q‖x− z‖(|α|−d)/p|f(z)| dz

]pdx

≤∫

Ω

[∫Ω‖x− z‖|α|−d dz

]p/q [∫Ω‖x− z‖|α|−d|f(z)|p dz

]dx

≤

[Cd

(2ρΩ)|α|

|α|

]p/q ∫Ω

∫Ω‖x− z‖|α|−d dx|f(z)|p dz

≤

[Cd

(2ρΩ)|α|

|α|

]p/q [Cd

(2ρΩ)|α|

|α|

]∫Ω|f(z)|p dz =

[Cd

(2ρΩ)|α|

|α|

]p‖f‖pLp(Ω) .

2

Lemma 4.6-10: Suppose that Ω ⊂ Rd, and there is a ball ‖x−x0‖ ≤ ρ ⊂ Ω so that forall y ∈ Ω the convex hull Sx0,ρ(y) = y+(x−y)s : 0 < s < 1 and ‖x−x0‖ < ρ is containedin Ω. Also suppose that there exists c ∈ Rd and ρΩ > 0 so that Ω ⊂ x : ‖x− c‖ < ρΩ.Suppose that p ∈ [1,∞] and u ∈ Wm

p (Ω). Let Pmx0,ρu be the averaged Taylor polynomialof u, defined by (4.6-11). Then

‖u− Pmx0,ρu‖Lp(Ω) ≤CdeCψ

(2ρΩ)m(

1 +2ρΩ

ρ

)d|u|Wm

p (Ω) .

Proof: If p = ∞, then

ess supx∈Ω|u(x)− Pmx0,ρu(x)|

≤ ess supx∈Ω

m

deCψ

(1 +

‖x− x0‖ρ

)d ∑|α|=m

∫Sx0,ρ(x)

|Dαu(z)| |(x− z)α|‖x− z‖d

dz

≤ m

deCψ

(1 +

2ρΩ

ρ

)d ∑|α|=m

ess supx∈Ω

∫Sx0,ρ(x)

|Dαu(z)| ‖x− z‖|α|−d dz

≤ m

deCψ

(1 +

2ρΩ

ρ

)d ∑|α|=m

‖Dαu‖L∞(Ω)Cd|α|

(2ρΩ)|α|

=CddeCψ

(2ρΩ)m(

1 +2ρΩ

ρ

)d|u|Wm

∞(Ω)


Since d ≥ 1, the claimed result has been proved.

Next, suppose that 1 ≤ p < ∞. First, we note that for any array of valuesindexed by α, Holder’s inequality implies ∑

|α|=m

wα

p ≤ ∑|α|=m

wpα

∑|α|=m

1q

p/q

= dp/q∑|α|=m

wpα .

Also, since q = p/(p− 1), p/q = p− 1. Then

‖u− Pmx0,ρu‖pLp(Ω) =

∫Ω

∣∣u(x)− Pmx0,ρu(x)∣∣p dx

≤∫

Ω

m

deCψ

(1 +

‖x− x0‖ρ

)d ∑|α|=m

∫Sx0,ρ(x)

|Dαu(z)| |(x− z)α|‖x− z‖d

dz

p

dx

≤

[m

deCψ

(1 +

2ρΩ

ρ

)d]p ∫Ω

∑|α|=m

∫Ω|Dαu(z)| ‖x− z‖|α|−d dz

p dx≤ 1d

[m

eCψ

(1 +

2ρΩ

ρ

)d]p ∑|α|=m

∫Ω

[∫Ω|Dαu(z)| ‖x− z‖m−d dz

]pdx

≤ 1d

[m

eCψ

(1 +

2ρΩ

ρ

)d]p ∑|α|=m

[Cd|α|

(2ρΩ)|α|]p‖Dαu‖pLp(Ω)

=1d

[CdeCψ

(2ρΩ)m(

1 +2ρΩ

ρ

)d]p|u|pWm

p (Ω) ; .

Since d ≥ 1, the claimed result follows. 2

Lemma 4.6-11: Suppose that Ω ⊂ Rd, and there is a ball ‖x−x0‖ ≤ ρ ⊂ Ω so that forall y ∈ Ω the convex hull Sx0,ρ(y) = y+(x−y)s : 0 < s < 1 and ‖x−x0‖ < ρ is containedin Ω. Also suppose that there exists c ∈ Rd and ρΩ > 0 so that Ω ⊂ x : ‖x− c‖ < ρΩ.Suppose that p ∈ [1,∞] and u ∈ Wm

p (Ω). Let Pmx0,ρu be the averaged Taylor polynomialof u, defined by (4.6-11). Then for 1 ≤ k ≤ m

∣∣u− Pmx0,ρu∣∣Wk

p (Ω)≤ CdeCψ

(2ρΩ)m−k(

1 +2ρΩ

ρ

)d|u|Wm

p (Ω) .


Proof: First, suppose that p = ∞. Then the proof of the previous lemmashows that∣∣u− Pmx0,ρu

∣∣Wk

p (Ω)=∑|β|=k

ess supx∈Ω

∣∣∣Dβu− Pmx0,ρu

(x)∣∣∣ dx

=∑|β|=k

ess supx∈Ω

∣∣∣Dβu(x)− Pm−kx0,ρ Dβu(x)∣∣∣

≤ CddeCψ

(2ρΩ)m−k(

1 +2ρΩ

ρ

)d ∑|β|=k

∣∣∣Dβu∣∣∣Wm−k∞ (Ω)

=CddeCψ

(2ρΩ)m−k(

1 +2ρΩ

ρ

)d ∑|β|=k

∑|α|=m−k

ess supx∈Ω

∣∣∣Dα+βu(x)∣∣∣

=CdeCψ

(2ρΩ)m−k(

1 +2ρΩ

ρ

)d ∑|γ|=m

ess supx∈Ω |Dγu(x)|

=CdeCψ

(2ρΩ)m−k(

1 +2ρΩ

ρ

)d|u|Wm

∞(Ω)

Next, suppose that 1 ≤ p <∞. Then∣∣u− Pmx0,ρu∣∣pWk

p (Ω)=∑|β|=k

∫Ω

∣∣∣Dβu− Pmx0,ρu

(x)∣∣∣p dx

=∑|β|=k

∫Ω

∣∣∣Dβu(x)− Pm−kx0,ρ Dβu(x)∣∣∣p dx

=∑|β|=k

∣∣∣Dβu− Pm−kx0,ρ Dβu∣∣∣pLp(Ω)

≤ 1d

[CdeCψ

(2ρΩ)m−k(

1 +2ρΩ

ρ

)d]p ∑|β|=k

∣∣∣Dβu∣∣∣pWm−k

p (Ω)

=1d

[CdeCψ

(2ρΩ)m−k(

1 +2ρΩ

ρ

)d]p ∑|β|=k

∑|α|=m−k

∫Ω

∣∣∣Dα+βu(x)∣∣∣p dx

=

[CdeCψ

(2ρΩ)m−k(

1 +2ρΩ

ρ

)d]p ∑|γ|=m

∫Ω|Dγu(x)|p dx

=

[CdeCψ

(2ρΩ)m−k(

1 +2ρΩ

ρ

)d]p|u|pWm

p (Ω)


2

The last two lemmas show that if we approximate a function u ∈ Wmp (Ω) by a set of

polynomials of degree at most m− 1, then the W kp (Ω) error for 0 ≤ k < m can be bounded

by a dimensionally-dependent constant (independent of Ω) times two factors dependenton the shape of Ω. The first factor is the radius ρΩ of the smallest circumscribing ballcontaining Ω, and the second factor is the ratio of the ratio of the circumscribing ball tothe largest ball contained in Ω such that all convex hulls of that ball with a point in Ω arealso contained in Ω. For convex Ω, the latter is the ratio of the radius of the circumscribingball to the radius of the inscribed ball. If the latter ratio is bounded, then the W k

p (Ω) errorin piecewise polynomial approximation is proportional to ρm−kΩ . For example, if Ω is anelement in a finite element mesh, then ρΩ would be the mesh width; if all elements hada bounded ratio of the ratio of the circumscribed to inscribed radii with maximum meshcircumscribing radius h, then the W k

p (Ω) error in approximating u ∈ Wmp (Ω) by piecewise

polynomials of degree at most m− 1 would be O(hm−k).

4.7 Galerkin Methods

In the discussion that follows, we can consider the general second-order self-adjointelliptic Dirichlet problem described in equations (??). That problem involves finding u ∈Hm

0 (Ω) so that for all δu ∈ Hm0 (Ω)

A(δu, u) ≡∑|β|≤m

∑|α|≤m

∫Ωaα,βDαδu Dβu dx

= λ(δu) =∫

Ωδu f dx

We can consider generalizing this problem to different boundary conditions, but the generalformulation of such problems is difficult to state. For example, the biharmonic equationδδu = f could involve Dirichlet boundary data describing values for u and n · ∇xu, orNeumann boundary data prescribing values for δu on the boundary and some combinationof the Dirichlet boundary data.

Suppose that we can find a finite-dimensional subpace V ⊂ Hm0 (Ω) with good approxi-

mation properties for solutions of the Dirichlet problem. For example, if Ω is a polygon, wemight be able to find piecewise polynomials that have zero Dirichlet data on ∂Ω and goodapproximation properties in the interior. In general, our finite-dimensional subpace V willsatisfy the homogeneous essential boundary conditions approximately.

4.7. GALERKIN METHODS 283

4.7.1 Well-Posedness of Galerkin Equations

A Galerkin approximation to this Dirichlet problem is u ∈ V, such that for all δu ∈ V

A(δu, u) = (δu, f) . (4.7-1)


Although this statement is very succinct, it raises a number of issues.

Cea’s Lemma 4.7-1: Let A : Hm(Ω)×Hm(Ω) → R be bilinear. Let Hme (Ω) be the set of

all limits with respect to theHm(Ω) norm of C∞ functions satisfying the essential boundaryconditions (i.e., those boundary conditions involving derivatives of order less than m).Suppose that λ is a bounded linear functional on Hm

e (Ω). Suppose that u ∈ Hm(Ω) issuch that u satisfies the essential boundary conditions, and such that for all v ∈ Hm

e (Ω)we have

A(v, u) = λ(v) .

Let V ⊂ Hme (Ω) be finite dimensional with basis vn(x) : 1 ≤ n ≤ N. Finally, suppose

that there exists b ∈ Hm(Ω) and u ∈ b + V so that for all vinV we have that u satisfiesthe Galerkin equations

A(v, u) = λ(v) .

Then

1. Define the stiffness matrix A and vector f by

Amn = A(vm, vn) and fm = λ(vm)−A(vm, b) .

Then the Galerkin approximation

u(x) = b(x) +N∑n=1

vn(x)un

solves the Galerkin equations if and only if u ∈ RN solves the linear system Au = f .

2. If the bilinear form A is coercive (see equation (4.4-1)), then the stiffness matrix isnonsingular.

3. If A is symmetric, then the stiffness matrix is symmetric.

4. If A is symmetric and coercive, then u minimizes

E(w) ≡ 12A(w,w)− λ(w)

over all w ∈ b+ V, and E(u) ≤ E(u).

5. The error u− u is A-orthogonal to V, meaning that for all v ∈ V,

A(v, u− u) = 0

6. If A is coercive and bounded, meaning that

∃Ca ∀u, v ∈ Hme (Ω) , |A(v, u)| ≤ Ca‖v‖Hm(Ω)‖u‖Hm(Ω) ,

then the error in the Galerkin approximation satisfies

‖u− u‖Hm(Ω) ≤CaCA

infv∈V

‖u− b− v‖Hm(Ω) (4.7-2)


Proof: Since we can write

u(x) = b(x) +N∑n=1

vn(x)un ,

the Galerkin equations imply that

N∑n=1

A(vm, vn)un = λ(vm)−A(vm, b) ∀1 ≤ m ≤ N .

This is identical to the linear system Au = f .

Next, we will show that if A is coercive, then the stiffness matrix A is positive-definite. First notice that

u>Au =N∑m=1

N∑n=1

umAm,nun = A(N∑m=1

vmum,N∑n=1

vnun) ≥ CA‖N∑n=1

vnun‖2Hm(Ω) .

This shows that ∀u ∈ RN ,u>Au ≥ 0. Further, if u>Au = 0, then∑N

n=1 vnun =0. The linear independence of the basis functions vi then implies that u = 0. Inother words, we have shown that if Au = 0, then u = 0. Since A is a squarematrix, this proves that it is nonsingular. As a result, if A is coercive, then theGalerkin equations have a unique solution.

It is easy to see that if A is symmetric, then the stiffness matrix A is symmetric:

Am,n = A(vm, vn) = A(vn, vm) = An,m .

An easy modification of the proof of lemma 4.5-3 shows that A is symmetricand coercive then u minimizes

E(w) ≡ 12A(w,w)− λ(w)

over all w ∈ u+Hme (Ω). Since u− b ∈ Hm

e (Ω), it follows that

u ∈ b+Hme (Ω) =

[u+ (b− u)

]+Hm

e (Ω) = u+Hme (Ω) .

This implies thatE(u) ≤ E(u) .

Thus the Galerkin approximation cannot have lower total energy than the truesolution.


Let us prove that the Galerkin approximation minimizes the total energy overthe admissible functions. For an arbitrary δu ∈ V the definition of the totalenergy

E(u+ δu) =12A(u+ δu, u+ δu)− λ(u+ δu)

symmetry of A

= E(u)−A(δu, u)− λ(δu) +12A(δu, δu)

Galerkin equations

= E(u) +12A(δu, δu)

and coercivity

≥ E(u)

imply that the Galerkin approximatino u minimizes the total energy.

Next, let us show that the error in the Galerkin approximation is A-orthogonalto V. For any v ∈ V ⊂ Hm

e (Ω), we can subtract the Galerkin equations for ufrom the weak form for u to get:

A(v, u− u) = 0 .

Finally, let us prove inequality (??). If the bilinear form A is coercive and thecoefficients of the differential operator are uniformly bounded, then we can stillmake a useful statement about the error in the Galerkin approximation.

Coercivity shows that

CA‖u− u‖2Hm(Ω) ≤ A(u− u, u− u)

then the Galerkin equations imply that for all v ∈ V

= A(u− b− v, u− u)

then the boundedness of the bilinear form A implies that

≤ Ca‖u− b− v‖Hm(Ω)‖u− u‖Hm(Ω) .

If ‖u − u‖Hm(Ω) = 0, then inequality (??) is satisfied trivially. Otherwise, wecan cancel ‖u− u‖Hm(Ω) = 0 on both sides and take the infimum to get (??). 2


Inequality (??) says that the Sobolev norm of the error in the Galerkin approximation isat worst a constant factor (i.e., independent of V) times the smallest possible norm of theerror in approximation using the subspace V.

4.7.2 Approximation Assumption

Suppose that the finite-dimensional subspace V in our Galerkin approximation satisfiesthe following approximation assumption.

Assumption 4.7-2: Given Ω ⊂ Rd and an integers k > m > 0, the finite-dimensionalsubspace Vkh in the Galerkin approximation satisfies

∃h0 > 0 ∃C > 0 ∀v ∈W k2 (Ω) ∩Wm

e,2(Ω) ∀0 < h < h0

infV ∈Vk

h

‖v − V ‖W 02 (Ω) + hm‖v − V ‖Wm

2 (Ω) ≤ Cm,khk‖v‖Wk

2 (Ω) . (4.7-3)

In general, the integer k will be one plus the order of the polynomials that are reproducedexactly by our subspace, and the parameter h will be related to a mesh size. Note thatthe approximation assumption does not depend on the differential equation, but it doesdepend on the domain Ω. The approximation assumption has been proved for triangularmeshes in [13], and can be derived Hilbert scale arguments applied to our results for Taylorpolynomials in lemma 4.6-11.


4.7.3 Hm Error Estimates

Lemma 4.7-3: Suppose that k ≥ m > 0, and that Ω ⊂ Rd is of class Ck (see definition4.5-9). Let

A(v, u) ≡∑|β|≤m

∑|α|≤m


where the coefficients aα,β are right (k−m)-smooth (see definition 4.5-8). Suppose that Ais coercive with coercivity constant CA (see equation (4.4-1)) and bounded with constantc (see equation (4.4-2)), leading to a higher-order regularity result with constant C (seelemma 4.5-11). Supppose that f ∈W k−2m

2 (Ω), and that Let Hme (Ω) be the Sobolev space

formed by taking the completion of C∞ functions satisfying the homogeneous essentialboundary conditions. For all v ∈ Hm

e (Ω) define the linear functional

λ(v) ≡∫

Ωvf dx .

Suppose that u ∈ Hm(Ω) is such that for all v ∈ Hme (Ω)

A(v, u) = λ(v) .

Suppose that we are given a function b ∈ Hk(Ω) satisfying the essential boundary con-ditions. Suppose that we are given a finite-dimensional subspace Vkh ⊂ Hm

e (Ω) satisfyingthe approximation assumption (4.7-3). Finally, suppose that u ∈ b + Vkh ⊂ Hm(Ω) is theGalerkin approximation for this problem. Then the error in the Galerkin approximationsatisfies

‖u− u‖Hm(Ω) ≤ CCm,kc

CAhk−m

[‖f‖Wk−2m

2 (Ω) + ‖g‖Wk2 (Ω)

]. (4.7-4)

Proof: Lemma showed that

We use Cea’s lemma 4.7-1

‖u− u‖Wm2 (Ω) ≤

c

CAinfv∈Vk

h

‖u− v‖Hm(Ω)

the approximation assumption (4.7-3)

≤ Cm,kc

CAhk−m‖u‖Hk(Ω)


and the higher-order regularity result in lemma 4.5-11

≤ CCm,kc

CAhk−m

[‖f‖Hk−2m(Ω) + ‖b‖Hk(Ω)

].

2

Since the approximation assumption requires that k > m, we can let the mesh size h → 0and prove convergence of the Galerkin approximation in Hm(Ω).

4.7.4 Convergence for Rough Problems

Lemma 4.7-4: Let m > 0 and suppose that Ω ⊂ Rd is open, bounded and of class Cm+1

(see definition 4.5-9). Let

A(v, u) ≡∑|β|≤m

∑|α|≤m


where the coefficients aα,β are 1-smooth (see definition 4.5-8), and A is coercive (see equa-tion (4.4-1)). Suppose that f ∈ H−m(Ω), and b ∈ Hm+1(Ω) satisfies the given essentialboundary conditions. Since C∞ is dense in the Sobolev spaces Hm(Ω) and Hm+1(Ω) letthe sequences fn ⊂ C∞(Ω) and bn ⊂ C∞(Ω) converge to f and b, respectively. Sup-pose that Vkh ⊂ Hm(Ω) satisfies the approximation assumption (4.7-3) with k ≥ m + 1.Then for any ε > 0 there exists an hε > 0 and an nε > 0 such that if h < hε, n > nε andun ∈ bn + Vkh solves the Galerkin equations

A(v, u) = (v, f)

for all v ∈ Vkh and‖u− un‖Hm(Ω) ≤ ε .

Proof: For each n, let un ∈ bn +Hm+1e (Ω) solve the Dirichlet problem

A(v, un) = (v, f)

for all v ∈ Hm+1e (Ω). Now let un ∈ bn + Vkh be the Galerkin described in the

assumptions. Then the regularity result (4.5-1) implies that

‖u− un‖Hm(Ω) ≤ Cm(‖f − fn‖H−m(Ω) + ‖b− bn‖Hm(Ω)) .

and the higher-order regularity result (??) implies that

‖un − bn‖Hm+1(Ω) ≤ Cm+1‖fn‖H−m+1(Ω) .


Further, the Galerkin Hm+1(Ω) error estimate (4.7-4) implies that

‖un − un‖Hm(Ω) ≤ Ch‖un − bn‖Hm+1(Ω) .

It follows that

‖u− un‖Hm(Ω) ≤ ‖u− un‖Hm(Ω) + ‖un − un‖Hm(Ω)

≤ Cm

(‖f − fn‖H−m(Ω) + ‖b− bn‖Hm(Ω)

)+ Ch‖un − bn‖Hm+1(Ω) .

≤ Cm

(‖f − fn‖H−m(Ω) + ‖b− bn‖Hm(Ω)

)+ CCm+1h‖fn‖H−m+1(Ω) .

Given ε > 0, we choose n sufficiently large so that C‖f − fn‖H−m(Ω) < ε/3 andC‖b − bn‖Hm(Ω) < ε/3. Then we choose h sufficiently small in the Galerkinapproximation so that CCm+1h‖fn‖H−m+1(Ω) < ε/3. This shows that for anyε > 0, we can find a Galerkin approximation u so that ‖u− u‖Hm(Ω) < ε. 2

This proves the convergence of the Galerkin approximation for rough inhomogeneities fand b, minimally smooth coefficients aα,β and a minimally smooth domain Ω.

4.7.5 H0 Error Estimates

There are circumstances under which the Galerkin approximation converges at a higherrate in H0(Ω) than in Hm(Ω). In other words, the error in the function values convergesfaster than the error in the m-th order derivatives.

Lemma 4.7-5: Suppose that k ≥ 2m, and Ω ⊂ Rd is open, bounded and of class Ck (seedefinition 4.5-9). Also suppose that the bilinear form A is coercive (see equation (4.4-1))and bounded (see equation (4.4-2)), and such that A and its adjoint has coefficients thatare right k-smooth (see definition 4.5-8). Suppose that the inhomogeneity in the partialdifferential equation is f ∈ Hk−2m(Ω) and there is a function b ∈ Hk(Ω) representingthe boundary conditions so that u− b ∈ Hm

e (Ω). Finally, assume that the approximationassumption (4.7.5.1) is satisfied. Then there is an h0 > 0 so that for all h ∈ (0, h0) andfor all j ∈ [0,m] there is a constant C so that for all f ∈ Hk−2m(Ω) and for all b ∈ Hk(Ω)the error in the Galerkin approximation satisfies

‖u − u‖Hj(Ω) ≤ Chk−j‖f‖Hk−2m(Ω) + ‖g‖Hk(Ω)

Proof: Recall that for all v ∈ Vkh the Galerkin approximation u ∈ b + Vkhsatisfies the Galerkin equations

A(v, u− u) = 0 .


Suppose that we solve the following adjoint problem for y ∈ H2m(Ω) ∩Hme (Ω)

involving the error in the Galerkin approximation as the inhomogeneity. For allv ∈ Hm

e (Ω),A(y, v) = (u− u, v) .

Then for any v ∈ Vkh ,

‖u− u‖2H0(Ω) = (u− u, u− u)H0(Ω) = A(y, u− u) = A(y − v, u− u)

≤ C‖y − v‖Hm(Ω)‖u− u‖Hm(Ω) .

The proof of the Hm(Ω) error estimate in lemma 4.7-3 for y shows that

infv∈Vk

h

‖y − v‖Hm(Ω) ≤ Chm‖u− u‖H0(Ω) .

If we substitute this result into the previous inequality and cancel ‖u− u‖H0(Ω),it follows that

‖u− u‖H0(Ω) ≤ Chm‖u− u‖Hm(Ω) . (4.7-5)

Since A is right k-smooth, then we can use the Hm(Ω) error estimate (4.7-4)and our new H0(Ω) error estimate (4.7-5) to show that

‖u− u‖H0(Ω) ≤ Chk(‖f‖Hk−2m(Ω) + ‖b‖Hk(Ω)) . (4.7-6)

Note that we can use Hilbert scale arguments to show that ‖u − u‖Hj(Ω) =O(hk−j), for 0 ≤ j ≤ m. 2

4.7.5.1 H−p Error EstimatesThere are circumstances under which the Galerkin approximation converges at an even

higher rate in H−p(Ω) than in H0(Ω). This implies that repeated averages of the errorconverge more rapidly than the error.

Lemma 4.7-6: Suppose that p ≥ 0, m > 0 and k ≥ 2m+p. Assume that Ω ⊂ Rd is open,bounded and of class Ck (see definition 4.5-9). Also suppose that the bilinear form A iscoercive (see equation (4.4-1)) and such thatA and its adjoint has coefficients that are rightk-smooth (see definition 4.5-8). Further, we assume that the inhomogeneity in the weakform is f ∈ Hk−2m(Ω), and we have a function b ∈ Hk(Ω) so that u− b ∈ Hm

e (Ω). Finally,we assume that the the approximating space Vkh satisfies the approximation assumption .Then the error in the Galerkin approximation u ∈ b+ Vkh is such that there exists h0 > 0so that for all h ∈ (0, h0) and for all j ∈ [−p,m] there is a constant C > 0 so that for allf ∈ Hk−2m(Ω) and all b ∈ Hk(Ω)

‖u− u‖Hj(Ω) ≤ Chk−j‖f‖Hk−2m(Ω) + ‖b‖Hk(Ω)

.


Proof: Given any v ∈ Hp(Ω), define φv ∈ H2m+p(Ω) ∩Hme (Ω) so that for all

w ∈ Hme (Ω)

A(φv, w) = (v, w) .

If Ω is of class Cp+2m, then our elliptic regularity result shows that

‖φv‖p+2m ≤ C‖v‖p .

Then for any v ∈ Vkh , the definition of the H−p(Ω) norm, the definition of φv,the Galerkin equations for u and the boundedness of A imply that

‖u− u‖H−p(Ω) ≡ supv∈Hp(Ω)

|(v, u− u)|‖v‖Hp(Ω)

= supv∈Hp(Ω)

|A(φv, u− u)|‖v‖Hp(Ω)

= supv∈Hp(Ω)

|A(φv − v, u− u)|‖v‖Hp(Ω)

≤ C‖u− u‖Hm(Ω) supv∈Hp(Ω)

‖φv − v‖Hm(Ω)

‖v‖Hp(Ω).

Now we take the infimum over v, use the approximation assumption and thehigher-order regularity estimate in lemma 4.5-11 applied to φv to get

‖u− u‖H−p(Ω) ≤ C‖u− u‖Hm(Ω) supv∈Hp(Ω)

inf v∈Vkh‖φv − v‖Hm(Ω)

‖v‖Hp(Ω)

≤ C‖u− u‖Hm(Ω) supv∈Hp(Ω)

hp+m‖φv‖H2m+p(Ω)

‖v‖Hp(Ω)≤ Chp+m‖u− u‖Hm(Ω)

≤ Chp+k‖u‖Hk(Ω)

The result follows from the higher-order elliptic regularity estimates for u. 2

For smooth differential equations on smooth domains with smooth boundary conditionsand smooth inhomogeneities, the highest possible convergence rate would be O(h2(k−m))in H2m−k(Ω). These bounds on negative norms indicate that averages of the error in theGalerkin approximation converge faster than the error itself. This indicates that the erroroscillates around zero, so there should be points at which the error is zero. These aresuperconvergence points, and can be computationally useful.4.7.5.2 Convergence Without Coercivity

The Rayleigh-Ritz-Galerkin process is the same for problems with positive weak forms,even if they are not self-adjoint.


Lemma 4.7-7: Suppose that

1. the bilinear form B is positive:

∃c > 0∀v ∈ H10(0, L), B(v, v) ≥ c‖v2

1 ;

2. M⊂ H10(0, L) is finite dimensional;

3. λ is a bounded linear functional on H10(0, L).

Then for any bounded linear functional λ on H10(0, L), there is a unique U ∈M satisfying

the Rayleigh-Ritz-Galerkin equations

∀V ∈M , B(V,U) = λ(V ) .

Proof: If V1, . . . , VN is a basis for M, then the Rayleigh-Ritz-Galerkinequations are equivalent to solving the linear system

∀1 ≤ i ≤ N ,

N∑j=1

B(Vi, Vj)υj = λ(Vi) .

This system always has a unique solution if and only if the homogeneous system

∀1 ≤ i ≤ N ,

N∑j=1

B(Vi, Vj)υj = 0

has only the zero vector as its solution. To see that this is true, multiply by υiand sum over i and use the positivity of B:

0 = B(N∑i=1

Viυi,

N∑j=1

Vjυj) ≥ c‖N∑j=1

Vjυj‖21 .

Since the basis functions Vj are linearly independent, the υj are all zero, andlinear systems for the Rayleigh-Ritz-Galerkin equations have a unique solutionwhen B is positive. 2



1. The bilinear functional B is bounded and positive:

∃c > 0 ∀v, u ∈ H10(0, L) | B(v, u) |≤ c‖v‖1‖u‖1

and∃c > 0 ∀v ∈ H1

0(0, L)B(v, v) ≥ c‖v‖21

2. The linear functional λ is bounded:

∃Cλ > 0 ∀v ∈ H10(0, L) | λ(v) |≤ Cλ‖v‖1

3. u ∈ H10(0, L) solves the weak problem

∀v ∈ H10(0, L)B(v, u) = λ(v) ;

4. M⊂ H10(0, L) is finite dimensional;

5. U ∈M solves the Rayleigh-Ritz-Galerkin equation

∀V ∈MB(V,U) = λ(V ) .

Then

1. The error u − U in the Rayleigh-Ritz-Galerkin approximation satisfies the orthogo-nality condition

∀V ∈M B(V, u− U) = 0 .

2. The Rayleigh-Ritz-Galerkin approximation U is within a constant factor of being thebest possible approximation to u from the subspace M:

‖u− U‖1 ≤c

cinfV ∈M

‖u− V ‖1 .

Proof: Subtracting the Rayleigh-Ritz-Galerkin equation for U from the weakequation for u gives us the orthogonality condition. Since B is both positive andbounded, we see that for any V ∈M,

c‖u− U‖21 ≤ B(u− U, u− U) = B(u− V, u− U) ≤ c‖u− V ‖1‖u− U‖1 .


It follows that

‖u− U‖1 ≤c

cinfV ∈M

‖u− V ‖1 .

2

We can also estimate error in H0.



1. The bilinear form B is defined by

B(v, u) =∫ 1

0v′pu′ + v2qu′ + vru dx

where p, q and r are 1-smooth.

2. The bilinear functional B is bounded and positive:

∃c > 0 ∀v, u ∈ H10(0, L) | B(v, u) |≤ c‖v‖1‖u‖1

and∃c > 0 ∀v ∈ H1

0(0, L) B(v, v) ≥ c‖v‖21

3. The linear functional λ is bounded:

∃Cλ > 0 ∀v ∈ H10(0, L) | λ(v) |≤ Cλ‖v‖1

4. u ∈ H10(0, L) solves the weak problem

∀v ∈ H10(0, L)B(v, u) = λ(v)

5. M ⊂ H10(0, L) is finite dimensional, and approximates functions y ∈ H2 ∩ H1

0(0, L)sufficiently well:

∃CM > 0∀y ∈ H2 ∩H10(0, L) inf

V ∈M‖y − V ‖1 ≤ CM | y |2

6. U ∈M solves the Rayleigh-Ritz-Galerkin equation

∀V ∈MB(V,U) = λ(V ) .

Then the H0 norm of the error is proportional to the H1 norm of the error; in other words,there is a constant C > 0 independent of M and λ so that

‖u− U‖0 ≤ cCMC‖u− U‖1 .

Proof: Let y ∈ H10(0, L) solve the adjoint problem

∀w ∈ H10(0, L) , B(y, w) = (u− U,w) .


Since the coefficients in the boundary-value problem are 1-smooth, the ellipticregularity lemma ?? shows that there is a constant C so that

‖y‖2 ≤ C‖u− U‖0 .

The Rayleigh-Ritz-Galerkin equations imply that

∀V ∈M, ‖u− U‖20 = (u− U, u− U) = B(y − V, u− U) ≤ c‖y − V ‖1‖u− U‖1 .

Thus

‖u− U‖20 ≤ c‖u− U‖1 inf

V ∈M‖y − V ‖1 ≤ cCM‖y‖2‖u− U‖1

≤ cCMC‖u− U‖0‖u− U‖1

Cancelling ‖u− U‖0 on both sides gives us the claimed result. 2

For finite element approximations, we expect that CM is proportional to the mesh width.In these approximations, the error in H0 is therefore one power of the mesh width smallerthan the error in H1.

4.7.6 Non-Positive Weak Forms

In some cases, the bilinear form may not be positive. We may still be able to computesolutions to such problems, but the discussion becomes even more complicated. However,if Garding’s inequality

∃c1 > 0∃c0 ≥ 0∀v ∈ H10(0, L) , B(v, v) ≥ c1‖v‖2

1 − c0‖v‖20 (4.7-7)

is satisfied, then we will still be able to prove some interesting results.


1. The bilinear form B is given by

∀v, u ∈ H10(0, L), B(v, u) =

∫ 1

0v′pu′ + v2qu′ + vru dx ;

2. there exists c1 so that for all x ∈ (0, L), p(x) > c1 > 0;

3. there exists c0 ≥ 0 so that for all x ∈ (0, L),

c0 ≥q(x)2

p(x)− c1+ c1 − r(x) . (4.7-8)

Then B satisfies Garding’s inequality (4.7-7) with constants c1 and c0.


Proof: For the bilinear form B in this lemma, Garding’s inequality is equivalentto requiring that∫ L

0

[v′ v

] [p− c1 qq r + c0 − c1

] [v′

v

]dx ≥ 0 .

The matrix inside this integral will have nonnegative eigenvalues if and only ifits diagonal entries and its determinant are nonnegative. Since p > c1, the firstdiagonal entry is positive. Condition (4.7-8) implies that the second diagonalentry satisfies

r + c0 − c1 ≥q2

p− c1≥ 0 .

That the determinant is nonnegative follows directly from (4.7-8). 2

If Garding’s inequality is satisfied, it is possible that there are nonzero functions z(x)such that

∀w ∈ H10(0, L) , B(z, w) = 0 .

We will write the left nullspace of B in the form

NB ≡ z ∈ H10(0, L) : ∀w ∈ H1

0(0, L) , B(z, w) = 0 .

Suppose that λ(v) is a bounded linear functional on H10(0, L). Then the Fredholm The-

orem of the Alternative says that the weak equation

∀v ∈ H10(0, L) , B(v, u) = λ(v)

has a solution if and only if λ satisfies the following annihilation condition:

∀z ∈ NB , λ(z) = 0 .

For more details, see Agmon [3] or Lions and Magenes [60]. The former shows that the leftand right nullspaces are finite dimensional, and have the same dimension; he also showsthat if λ annihilates the left nullspace, then the weak problem has at least one solution.The latter show that when the weak problem has multiple solutions, the smallest solutiondepends continuously on the data.

Example 4.7-11: Suppose we want to solve

− u′′(x)− π2u(x) = f(x) , x ∈ (0, 1)u(0) = 0 = u(1) .

The weak form of this problem is

B(v, u) ≡∫ 1

0v′u′ − π2vu dx =

∫ 1

0vf dx ≡ λ(v) .


This problem has many solutions for f = 0, namely z(x) = α sin(πx) for any constant α.On the other hand, this problem has no solutions for f = sin(πx); this is because

λ(sin(πx)) =∫ 1

0sin(πx)f(x) dx > 0 .

The Fredholm Theorem of the Alternative implies that this boundary value problem has asolution if and only if f satisfies ∫ 1

0f(x) sin(πx) dx = 0 .


1. The bilinear form B satisfies Garding’s inequality

∃c1 > 0∃c0 ≥ 0∀v ∈ H10(0, L) , B(v, v) ≥ c1‖v‖2

1 − c0‖v‖20

2. for all f ∈ H0(0, L) there exists a unique solution w ∈ H2(0, L) ∩ H10(0, L) to the

adjoint problem∀v ∈ H1

0(0, L) , B(w, v) = (f, v)

and the solution of the adjoint problem satisfies the elliptic regularity estimate

∃C > 0 ∀f ∈ H0(0, L)‖w‖2 ≤ C‖f‖0 ;

3. Mh ⊂ H10(0, L) is finite dimensional for some sequence of parameters h → 0, and

satisfies the approximation assumption

∃Mh > 0 ∀y ∈ H2(0, L) ∩H10(0, L), inf

V ∈M‖y − V ‖1 ≤ CMh‖y‖2 .

Then for h sufficiently small, the Rayleigh-Ritz-Galerkin equations

∀V ∈M , B(V,U) = (V, f)

have a unique solution.

Proof: Suppose that U1 and U2 are two solutions to the Rayleigh-Ritz-Galerkinequations, and let Z = U1 − U2. Then

∀V ∈M , B(V,Z) = 0 .


In particular, B(Z,Z) = 0. Let w ∈ H2(0, 1)∩H10(0, 1) solve the adjoint problem

∀v ∈ H10(0, 1) , B(w, v) = (Z, v) .

Then for all V ∈M,

‖Z‖20 = (Z,Z) = B(w,Z) = B(w − V,Z) ≤ c‖w − V ‖1‖Z‖1 .

Thus our approximation assumption and our regularity assumption imply that

‖Z‖20 ≤ c‖Z‖1 inf

V ∈M‖w − V ‖1 ≤ c‖Z‖1CMh‖w‖2 ≤ c‖Z‖1CMCh‖Z‖0 .

Cancelling ‖Z‖0, we obtain

‖Z‖0 ≤ cCMCh‖Z‖1 .

This implies that

0 = B(Z,Z) ≥ c1‖Z‖21 − c0‖Z‖2

0 ≥c1 − c0(cCMCh)2

‖Z‖2

1

For h sufficiently small, this implies that ‖Z‖1 = 0, so the finite element equa-tions have a unique solution. 2

We can prove error estimates for non-self-adjoint problems satisfying Garding’s inequality.



1. The bilinear form

B(v, u) ≡∫ L

0v′pu′ + v2qu′ + vru dx

is bounded∃c > 0 ∀v, u ∈ H1

0(0, L), | B(v, u) |≤ c‖v‖1‖u‖1 ,

and satisfies Garding’s inequality

∃c1 > 0∃c0 ≥ 0∀v ∈ H10(0, L) , B(v, v) ≥ c1‖v‖2

1 − c0‖v‖20 ;

2. The coefficients p, q and r are `-smooth, where ` ≥ 1;

3. for all f ∈ H0(0, L) there exist a unique solutions u,w ∈ H10(0, L) to the weak problems

∀v ∈ H10(0, L) , B(v, u) = (v, f)

∀v ∈ H10(0, L) , B(w, v) = (f, v)

4. we have the elliptic regularity estimates

∃C > 0 ∀f ∈ H0(0, L), ‖w‖2 ≤ C‖f‖0 ;

∃C > 0 ∀f ∈ H0(0, L), ‖u‖2 ≤ C‖f‖0 ;

5. Mh ⊂ H10(0, L) is finite dimensional for some sequence of parameters h → 0, and

satisfies the approximation assumption

∃Mh > 0 ∀y ∈ H2(0, L) ∩H10(0, L), inf

V ∈M‖y − V ‖1 ≤ CMh‖y‖2 .

6. h satisfies c0(cCMCh)2 < c1, so that the Rayleigh-Ritz-Galerkin equations

∀V ∈M , B(V,U) = (V, f)

have a unique solution.

Then‖u− U‖0 ≤ cCMCh‖u− U‖1 .

If h is even smaller, so that h ≤√c1/2c0/(cCMC), then

‖u− U‖1 ≤2cCCMc1

h‖f‖0 ,

and

‖u− U‖0 ≤2(cCMCh)2

c1‖f‖0 .


Proof: Let w ∈ H10(0, L) solve the weak problem

∀v ∈ H10(0, L) , B(w, v) = (u− U, v)

Then for all V ∈ M, the orthogonality of the error to M and the boundednessof B imply that

‖u− U‖20 = B(y, u− U) = B(y − V, u− U) ≤ c‖y − V ‖1‖u− U‖1 .

Then our approximation assumption and elliptic regularity show that

‖u− U‖20 ≤ cCMh‖y‖2‖u− U‖1 ≤ cCMCh‖u− U‖0‖u− U‖1 .

Cancelling ‖u− U‖0, we obtain the first claim:

‖u− U‖0 ≤ cCMCh‖u− U‖1 .

From our proof of the uniqueness of the finite element solution, for all V ∈ Mwe have

(c1 − c0(cCMCh)2)‖u− U‖21 ≤ c1‖u− U‖2

1 − c0‖u− U‖20 ≤ B(u− U, u− U)

= B(u− V, u− U) ≤ c‖u− V ‖1‖u− U‖1

By taking h ≤√c1/2c0/(cCMC), we get the final claims. 2

4.7.6.1 L∞ Error EstimatesSee [11, p. 90], [16, p. 209] [25, p. 165].

Chapter 5

Finite Element Implementations

5.1 Computational Issues

There are several important issues to consider before applying a Galerkin method to anelliptic partial differential equation. First, we need to choose the finite-dimensional subspaceV of approximating functions. This involves a choice of basis functions, mesh generation,and the ordering of the mesh elements and unknowns. In order to generate an accuratescheme, the basis functions must have good approximation properties. In order to generatean efficient scheme, the basis functions must be computationally simple. Efficiency wouldalso be improved if the stiffness matrix were as sparse as possible.

After we select the approximating subspace V, we need to form the stiffness matrixand the right-hand side. The computation of both quantities involves integrating basisfunctions for V and their derivatives with the coefficients in the differential equation. Theseintegrals can be performed efficiently by transforming all integrals to a canonical element,applying an appropriately accurate quadrature rule, and assembling the integrals over themesh elements into a global stiffness matrix.

Once the stiffness matrix and right-hand side have been determined, we need to solvethe linear system for the unknown coefficients in the representation of the Galerkin approx-imation as a linear combination of the basis functions in V. This could involve any of thetechniques discussed in chapter 3.

In performing these computational steps, we need to consider several issues. First,we need to understand the maximum order of accuracy that the problem will admit. Thesmoothness of the coefficients in the differential operator, the right-hand side, the boundaryvalues and the boundary itself will all affect the outcome.

Next, we need to make an initial decision of how small a mesh we will need to obtainthe desired accuracy. Our error estimates will not be much help here, since they all involveunknown constant factors. Later, we will need to develop a posteriori error estimates so

303

304 CHAPTER 5. FINITE ELEMENT IMPLEMENTATIONS

that we can decide if a finer computational mesh is required.A related issue is to decide how the mesh will conform to the domain. In other words,

we will have to decide how to generate mesh elements that approximate ∂Ω with sufficientaccuracy.

Other decisions will affect the computational efficiency. We will need to decide whetherto use regular or irregular data structures. This decision will affect the data flow frommemory to the central processing unit, as well as the success of the iterative method forsolving the linear system. On a distributed memory machine, the organization of the datastructures will affect the interprocessor communication.

Finally, we will need to select numerical integration techniques to compute the integralsin the Galerkin equations. These quadrature rules need to preserve both the order ofaccuracy of the method, and the positive-definiteness of the linear system when the bilinearform is coercive.

A general rule of thumb is that for linear problems, most of the computational timeis spent in the iterative method for solving the linear system (at least in two and threedimensions). For nonlinear problems, the repeated assemblage of the stiffness matrix couldbecome a significant expense.

5.2 Finite Element Assumptions

The finite element method refers to a Galerkin method in which the finite dimensionalsubspace V has a particular form.

Definition 5.2-1: Given a problem domain Ω ⊂ Rd, a finite element (K,N, k,V,N )consists of the following:

1. an element domain K ⊂ Rd with piecewise smooth boundary,

2. an integer dimension N , prescribing the number of degrees of freedom in the elementapproximation,

3. an integer smoothness k, prescribing the number of pointwise derivatives requiredby the finite element approximation,

4. a set of N linearly independent shape functions Vn : K → R, 1 ≤ n ≤ N withspan V, and

5. a set of d linearly independent linear functionals N = νn : Ck(K) → R , 1 ≤ n ≤ N,called the nodal variables.

For a single scalar elliptic equation, the nodal variables could be the unknown coefficients

5.2. FINITE ELEMENT ASSUMPTIONS 305

of the shape functions in the representation of the solution of the differential equation asa linear combination of the shape functions. For higher-order shape functions, the nodalvariables could also be the unknown coefficients of derivatives of the shape functions indifferent directions. In our examples of finite elements, the shape functions will be eitherpiecewise polynomials, or transformations of piecewise polynomials from some canonicalcoordinate system.

Definition 5.2-2: Given a finite element (K,N, k,V,N ) and a function v ∈ Ck(K),suppose that there is a set of functions

Φ = φn : 1 ≤ n ≤ N ⊂ V

satisfying∀1 ≤ m,n ≤ N νn(φm) = δm,n .

Then the local interpolant is

IK(v) ≡N∑n=1

νn(v)φn ∈ V .

The local interpolant is a linear function that maps members of Ck(K) to linear combinationsof the shape functions.

It is not always the case that the nodal values νn are values of a given function or itsderivatives. For the examples we will consider in this section, Φ will always be easy todetermine.

Note the following:

Lemma 5.2-3: If (K,N, k,V,N ) is a finite element, v ∈ Ck(K) and

IK(v) ≡N∑n=1

νn(v)φn

is the local interpolant to v, then

∀1 ≤ n ≤ N νn(IK(v)) = νn(v) .

Proof: Since the nodal variables νn are linear functionals,

∀1 ≤ n ≤ N , νn(IK(v)) =N∑m=1

νm(v)νn(φm) = νn(v) .


2

We presented several examples of finite element spaces in chapter ?? on two-point bound-ary value problems for ordinary differential equations. In this chapter, we will confine ourdiscussion to problems in two and three dimensions.

5.3 Finite Elements in 1D

It is common to choose piecewise polynomials as the finite dimensional subspace M0

used in the Rayleigh-Ritz-Galerkin method; in such a case, the procedure is called a finiteelement method. The simplest example of a finite element method for a second-ordertwo-point boundary-value problem is to use piecewise linear functions to approximate thesolution.5.3.0.2 Piecewise Linear Approximations

We define a canonical basis function

V (ξ) = max0, 1− | ξ | .

Note that V is piecewise linear, V (0) = 1 and V (ξ) = 0 for | ξ |≥ 1. This function isillustrated in figure 5.3-1.

Given a mesh

0 = x0 < x1 < . . . < xN = 1

we define the linear transformations from a mesh cell to a unit interval by

ξj(x) =

x−xj

xj+1−xj, xj ≤ x

x−xj

xj−xj−1, xj ≥ x

and the nodal basis functions

Vj(x) = V (ξj(x)) , 0 ≤ j ≤ N . (5.3-1)

It is easy to see that each Vj(x) is piecewise linear. It is also easy to see that the setV0, . . . , Vn is linearly independent, since

Vj(xi) = δij .

5.3. FINITE ELEMENTS IN 1D 307

Figure 5.3-1: Canonical Basis Function for Piecewise Linear Finite Elements


Next, we note that if p and r are bounded, then B(Vj , Vj) <∞: for 1 ≤ j < N ,∫ 1

0Vj(x)2dx =

∫ xj

xj−1

V (x− xj

xj − xj−1)2dx+

∫ xj+1

xj

V (x− xj

xj+1 − xj)2dx

=∫ 0

−1V (ξ)2(xj − xj−1)dξ +

∫ 1

0V (ξ)2(xj+1 − xj)dξ

= (xj − xj−1)∫ 0

−1(1 + ξ)2dξ + (xj+1 − xj)

∫ 1

0(1− ξ)2dξ

=xj+1 − xj−1

3,

and ∫ 1

0V ′j (x)

2dx =∫ xj

xj−1

(dV

dξ

dξjdx

)2dx+∫ xj+1

xj

(dV

dξ

dξjdx

)2dx

=∫ xj

xj−1

(1

xj − xj−1)2dx+

∫ xj+1

xj

(− 1xj+1 − xj

)2dx

=1

xj − xj−1+

1xj+1 − xj

.

Note that for 1 ≤ j < N , Vj(0) = 0 = Vj(1), so each of these basis functions satisfies thehomogeneous Dirichlet boundary conditions.

We let M0 be the set of all linear combinations of V1, . . . , VN−1. Then M0 is the finitedimensional subspace of all piecewise linear functions on the given mesh that vanish at theboundary.5.3.0.3 Finite Element Linear Systems

To compute the finite element approximation U ∈M0 to the solution u of the two-pointboundary value problem, we need to develop equations for the undetermined coefficients inthe linear combination

U(x) =N−1∑j=1

Vj(x)υj .

Note that no matter what the values of υ1, . . . , υN−1 may be, we have U(0) = 0 andU(1) = 0; thus U(x) satisfies the boundary values specifed in the problem (??).

The undetermined coefficients υj will be determined by solving a linear system of equa-tions. Recall from section ?? that we need to compute the matrix entries

Bij = B(Vi, Vj) =∫ 1

0V ′i pV

′j + VirVjdx


and right-hand side entries

fi = (Vi, f) =∫ 1

0Vifdx .

Note that B = B>, so B ∈ R(N−1)×(N−1) is symmetric. Also note that Bij = 0 for| i− j |> 1; in other words, B is tridiagonal. Finally, note that for any u ∈ RN−1,

u>Bu = B(N−1∑j=1

Vjuj ,N−1∑j=1

Vjuj) ≥ 0 ;

thus B is positive-definite.The intervals (xj , xj+1) are called the mesh elements and the points xj are called the

mesh nodes. We also define the element widths

hj+ 12

= xj+1 − xj .

5.3.0.4 Frontal AssemblyNote that the entries of B and f can be assembled by computing integrals over elements.

For example, we can define

Bk+ 12(Vi, Vj) =

∫ xk+1

xk

V ′i pV′j + VirVjdx .

To compute all terms in the matrix B that involve integrals in the kth element, we needonly compute the symmetric part of the 2× 2 matrix[

Bk+ 12(Vk, Vk) Bk+ 1

2(Vk, Vk+1)

Bk+ 12(Vk+1, Vk) Bk+ 1

2(Vk+1, Vk+1)

]

and the array [ ∫ xk+1

xkVkfdx∫ xk+1

xkVk+1fdx

]

There are several approaches to computing these arrays.If p and r are constant, then we can compute the integrals exactly. We evaluate the

integrals by a change of variables to a unit interval, so that we can compute the integralsin terms of the canonical basis function. In this case, we obtain element-wise contributions


to the matrix B of the form[Bk+ 1

2(Vk, Vk) Bk+ 1

2(Vk, Vk+1)

Bk+ 12(Vk+1, Vk) Bk+ 1

2(Vk+1, Vk+1)

]

=

[ ∫ xk+1

xkV ′kpV

′k + VkrVkdx

∫ xk+1

xkV ′kpV

′k+1 + VkrVk+1dx∫ xk+1

xkV ′kpV

′k+1 + VkrVk+1dx

∫ xk+1

xkV ′k+1pV

′k+1 + Vk+1rVk+1dx

]

=

[ ∫ xk+1

xk(V ′k)

2dx∫ xk+1

xkV ′kV

′k+1dx∫ xk+1

xkV ′kV

′k+1dx

∫ xk+1

xk(V ′k+1)

2dx

]p+

[ ∫ xk+1

xkV 2k dx

∫ xk+1

xkVkVk+1dx∫ xk+1

xkVkVk+1dx

∫ xk+1

xkV 2k+1dx

]r

=[

1 −1−1 1

]p

hk+ 12

+

[∫ 10 (1− ξ)2dξ

∫ 10 (1− ξ)ξdξ∫ 1

0 ξ(1− ξ)dξ∫ 10 ξ

2dξ

]rhk+ 1

2

=[

1 −1−1 1

]p

hk+ 12

+[2 11 2

] rhk+ 12

6.

We also have element-wise contributions to the right-hand side vector f of the form[ ∫ xk+1

xkVkfdx∫ xk+1

xkVk+1fdx

]=[11

]12fhk+ 1

2.

A more common approach, that applies to variable coefficients p and r, is to use Gaussianquadrature. In this case, a single Gauss quadrature point in each element is sufficient topreserve the accuracy of the finite element method; this corresponds to using the midpointrule. In general,

Bk+ 12(Vi, Vj) =

∫ xk+1

xk

V ′i (x)p(x)V′j (x) + Vi(x)r(x)Vj(x)dx

=∫ 1

0V ′i (xk + ξhk+ 1

2)p(xk + ξhk+ 1

2)V ′j (xk + ξhk+ 1

2)hk+ 1

2dξ

+∫ 1

0Vi(xk + ξhk+ 1

2)r(xk + ξhk+ 1

2)Vj(xk + ξhk+ 1

2)]hk+ 1

2dξ .

Note that[V ′k(xk + ξhk+ 1

2)

V ′k+1(xk + ξhk+ 12)

]=[

V ′(ξ)V ′(ξ − 1)

]1

hk+ 12

and

[Vk(xk + ξhk+ 1

2)

Vk+1(xk + ξhk+ 12)

]=[

V (ξ)V (ξ − 1)

].

Also recall that the midpoint rule approximates∫ 1

0g(ξ) dξ ≈ g(

12) .


Thus,[Bk+ 1

2(Vk, Vk) Bk+ 1

2(Vk, Vk+1)

Bk+ 12(Vk+1, Vk) Bk+ 1

2(Vk+1, Vk+1)

]≈[V ′(1

2)V ′(−1

2)

] [p(xk+ 1

2h

k+12)

hk+1

2

] [V ′(1

2) V ′(−12)]

+[V (1

2)V (−1

2)

] [r(xk + 1

2hk+ 12)hk+ 1

2

] [V (1

2) V (−12)]

=[

1 −1−1 1

] p(xk+ 12)

hk+ 12

+[1 11 1

] r(xk+ 12)hk+ 1

2

4.

We also have element-wise contributions to the right-hand side vector f of the form[ ∫ xk+1

xkVkfdx∫ xk+1

xkVk+1fdx

]=

[ ∫ 10 V (ξ)f(xk + ξhk+ 1

2)hk+ 1

2dξ∫ 1

0 V (ξ − 1)f(xk + ξhk+ 12)hk+ 1

2dξ

]

≈[V (1

2)V (−1

2)

]f(xk+ 1

2)hk+ 1

2=[11

] f(xk+ 12)hk+ 1

2

2.

A program to implement a finite element approximation to the solution of a two-pointboundary value problem has been provided. This program consists of several pieces:

• Program 5.3-1: FiniteElementMain.C C++ main program for solving the two-point boundary value proglem and plotting the results interactively.

• Program 5.3-2: finite element.f Fortran 77 subroutines for computing the canon-ical basis functions, mesh, initial values, solution of linear system via conjugate gra-dients, and values of the finite element approximation.

• Program 5.3-3: input Input file for FiniteElementMain.C

• Program 5.3-4: perlmake Perl macros and subroutines to make executables.

If the input file sets nelements to a value greater than one, then the main program willsolve a boundary value problem with that number of elements. If nelements is equal to one,then the main program will perform a mesh refinement study, beginning with 2 elementsand refining repeatedly by a factor of 2.

The file finite element.f contains several subroutines and functions. Subroutinecanonical evaluates the gaussian quadrature points and associated weights that are appro-priate for the method, and then computes the canonical basis functions and their derivativesat the quadrature points. Function approximation evaluates the finite element approxi-mation at a point x, once the linear system has been solved for the coefficients of the basisfunctions. Subroutine grid determines a uniform mesh, and sets the array of nodes for



http://www.math.duke.edu/~johnt/math226/bvp2/input

http://www.math.duke.edu/~johnt/math226/bvp2/perlmake


each element. Function solution computes the analytical solution of the boundary valueproblem at a point x. Subroutine mult multiplies the stiffness matrix times an arbitraryvector of coefficients of the basis functions, and sets the resulting vector to zero at Dirichletboundary conditions. This annihilation is important for performing the conjugate gradientiteration: no change is made to the solution vector at Dirichlet boundary conditions duringthe conjugate gradient iteration. Subroutine initialize sets the flags for the Dirichletboundary conditions, computes the coefficients of the differential operator at the Gaussianquadrature points, selects initial values for the unknown coefficients of the basis functions,and computes the initial residual. Subroutine pre is a preconditioner for conjugate gra-dients; in this case, the preconditioner is the identity. Subroutine precg is adopted froma program of the same name available at netlib.org; it performs preconditioned conjugategradients to solve a linear system. In this case, the computation of the initial residual wasperformed before calling precg, because that calculation is done without use of Dirichletboundary condition flags. In addition, some basic linear algebra subroutines were used toreplace the corresponding loops in the original version of precg. Subroutine stopit is usedby precg to determine when to terminate the iteration.

Numerical results with this program are presented in figure 5.3-2. The results show thatthe L2 error is proportional to 4x2, as is the L∞ error at the element boundaries.5.3.0.5 Finite Differences

It is important to note that the finite element equations can be implemented as if theywere finite differences. For example, if p, r and f are constant and the integrals are computedexactly, then the finite element equations are equivalent to the finite difference equations

−p(uk+1 − ukhk+ 1

2

− uk − uk−1

hk− 12

) + r(hk+ 12

uk+1 + 2uk6

+ hk− 12

2uk + uk−1

6)

= fhk+ 1

2+ hk− 1

2

2for 1 ≤ k < N, u0 = 0 = uN .

If instead we use the midpoint rule, we obtain

−(pk+ 12

uk+1 − ukhk+ 1

2

− pk− 12

uk − uk−1

hk− 12

) + (rk+ 12hk+ 1

2

uk+1 + uk4

+ rk− 12hk− 1

2

uk + uk−1

4)

= fk+ 12

12hk+ 1

2+ fk− 1

2

12hk− 1

2for 1 ≤ k < N, u0 = 0 = uN .

Finally, if we use the trapezoidal rule we obtain

−(pk+1 + pk

2uk+1 − ukhk+ 1

2

− pk + pk−1

2uk − uk−1

hk− 12

) + (rkhk+ 1

2+ hk+ 1

2

2)uk

= fkhk+ 1

2+ hk− 1

2

2for 1 ≤ k < N, u0 = 0 = uN .


(a) L2 error at Gauss quadrature pts. = O(4x2) (b) L∞ error at mesh points = O(4x2)

Figure 5.3-2: Errors in Continuous Piecewise Linear Finite Elements: log base 10 of errorsversus log base 10 of number of basis functions


Exercises 5.3.01. Consider the two-point boundary-value problem

− d

dx(du

dx) = π2 cos(πx) , 0 < x < 1

u(0) = 1 , u(1) = −1 .


(b) Program the finite element method for this problem.


(d) Plot the log of the error in the derivative of the solution at the mesh points versus the log ofthe number of basis functions, for 2n elements, 1 ≤ n ≤ 10. Note that there are two valuesfor the derivative at each mesh point, associated with either of the two elements containing themesh point. What is the slope of these curves (i.e. the order of convergence)?


− d

dx(p

du

dx) + ru = f , 0 < x < 1

u(0) = 0 , u(1) = 0 .

Suppose that f(x) is a Dirac delta function associated with some point ξ ∈ (0, 1).

(a) If p(x) ≡ 1 and r(x) ≡ 0, find the analytical solution of this problem.

(b) Describe the finite element method for this problem, and the corresponding finite differenceequations.

(c) Suppose that ξ = 1/2, and consider uniform meshes with an even number of elements. Programthe finite element method for this problem, and plot the log of the error in the solution at themesh points versus the log of the number of basis functions.

(d) Suppose that ξ = 1/2, and consider uniform meshes with an odd number of elements. Programthe finite element method for this problem, and plot the log of the error in the solution at themesh points versus the log of the number of basis functions.

5.3.1 Essential and Natural Boundary Conditions

So far, our discussions of weak forms and Rayleigh-Ritz-Galerkin methods have assumedthat we have homogeneous Dirichlet boundary conditions. Fortunately, we can handle othertypes of boundary conditions as well.

Suppose that we want to solve a two-point boundary value problem with inhomogeneousboundary conditions,

− d

dx(pdu

dx) + ru = f , 0 < x < 1

u(0) = a , u(1) = b .


The weak form of this problem requires us to find u with finite total energy so that u(0) = a,u(1) = b and

∀v ∈ C∞0 (0, 1), B(v, u) ≡∫ 1

0v′pu′ + vru dx =

∫ 1

0vf dx ≡ (v, f) .

The test functions v are variations of u in our energy minimization process (see section??); since u has fixed values at x = 0 and x = 1, its variation v is required to have zerovalues at the endpoints. The boundary conditions on u are essential, because we mustimpose them on the candidates for the solution, and we must impose the homogeneousform of these boundary conditions on the variations v of the solution.

On the other hand, suppose we want to solve a two-point boundary value problem witha Neumann boundary condition,

− d

dx(pdu

dx) + ru = f , 0 < x < 1 (5.3-2a)

u(0) = a , pdu

dx(1) = b . (5.3-2b)

When we multiply by a test function and integrate by parts, we obtain the weak form

∀v ∈ C∞(0, 1) with v(0) = 0, B(v, u) ≡∫ 1


∫ 1

0vf dx+ vpu′ |10 (5.3-3)

=∫ 1

0vf dx+ v(1)b ≡ λ(v) .

Note that we require u(0) = a and v(0) = 0; the Dirichlet boundary condition at x = 0 isessential. However, the Neumann boundary condition at x = 1 is a natural boundarycondition; it enters into the weak form and does not impose any additional restrictionson the solution u or its variations v. It is easy to see that if u ∈ C2(0, 1) satisfies the weakform (5.3-3) and u(0) = a, then u solves the two-point boundary value problem (5.3-2).

The development of finite element methods for essential and natural boundary conditionsis straightforward. Let us consider problem (5.3-2), because it involves both essential andboundary conditions. We will use piecewise linear nodal basis functions, as in section 5.3.0.2.We will write

U(x) = V0(x)a+N∑j=1

Vj(x)υj .

Note that U(0) = a, because Vj(0) = 0 for j ≥ 1. The remaining coefficients υj , j ≥ 1will be determined by solving a linear system. The Rayleigh-Ritz-Galerkin equations willrequire that

B(Vi, U) = λ(Vi), 1 ≤ j ≤ N .


For simplicity, we will assume that p, r and f are constant. In elements (xk, xk+1) for1 ≤ k < N , we will compute the element-wise contribution to the stiffness matrix[


2(Vk, Vk+1)

Bk+ 12(Vk+1, Vk) Bk+ 1

2(Vk+1, Vk+1)

]

=

[ ∫ xk+1

xkV ′kpV

′k + VkrVkdx

∫ xk+1

xkV ′kpV

′k+1 + VkrVk+1dx∫ xk+1

xkV ′kpV

′k+1 + VkrVk+1dx

∫ xk+1

xkV ′k+1pV

′k+1 + Vk+1rVk+1dx

]

=

[ ∫ xk+1

xk(V ′k)

2dx∫ xk+1

xkV ′kV

′k+1dx∫ xk+1

xkV ′kV

′k+1dx

∫ xk+1

xk(V ′k+1)

2dx

]p+

[ ∫ xk+1

xkV 2k dx

∫ xk+1

xkVkVk+1dx∫ xk+1

xkVkVk+1dx

∫ xk+1

xkV 2k+1dx

]r

=[

1 −1−1 1

]p

hk+ 12

+

[∫ 10 (1− ξ)2dξ

∫ 10 (1− ξ)ξdξ∫ 1

0 (1− ξ)2dξ∫ 10 (1− ξ)ξdξ

∫ 10 ξ

2dξ

]rhk+ 1

2

=[

1 −1−1 1

]p

hk+ 12

+[2 11 2

]rhk+ 1

2.

For k = 0, we only need to compute[Bk+ 1

2(Vk+1, Vk+1)

]=[1] p

hk+ 12

+[2]rhk+ 1

2.

In addition, we need to compute element-wise contributions to the right-hand side. For1 ≤ k < N − 1, [ ∫ xk+1

xkVkfdx+ Vk(1)b∫ xk+1

xkVk+1fdx+ Vk+1(1)b

]=[11

]12fhk+ 1

2.

On the other hand, for k = 0 we compute[∫ xk+1

xkVk+1fdx+ Vk+1(1)b

]=[1] 1

2fhk+ 1

2,

and for k = N − 1 we compute[ ∫ xk+1

xkVkfdx+ Vk(1)b∫ xk+1

xkVk+1fdx+ Vk+1(1)b

]=[11

]12fhk+ 1

2+[01

]b .

The finite element equations can be rewritten in the form of finite differences. We have

−p(uk+1 − ukhk+ 1

2

− uk − uk−1

hk− 12

) + r(hk+ 12

uk+1 + 2uk6

+ hk− 12

2uk + uk−1

6)

= fhk+ 1

2+ hk− 1

2

2for 1 ≤ k < N,


and

−b+ puk − uk−1

hk− 12

+ rhk− 12

2uk + uk−1

6= f

hk− 12

2, k = N .

Exercises 5.3.11. Determine the finite difference equations for the two-point boundary-value problem (5.3-2), arising

from the finite element method using continuous linear basis functions and midpoint rule quadrature.

2. Repeat the previous exercise for the trapezoidal rule.

3. Program the finite element method for problem (5.3-2). Take p(x) = 1, r(x) = 0 and f(x) = 1. Plotthe log of the error in the solution at the element midpoints versus the number of nodal basis functionsas you refine the mesh. Also plot the error in the solution at the natural boundary condition.

4. The one-dimensional beam bending problem [88] takes the form

d2

dx2(p(x)

d2u

dx2) = f(x) , 0 < x < L

u(0) = 0 = u(L)

d2u

dx2(0) = 0 =

d2u

dx2(L)

Here u(x) is the displacement of the beam, p(x) is the flexural rigidity, f(x) is the applied load, andL is the length of the beam. The boundary conditions correspond to zero displacement at the ends ofthe beam. Determine the weak form of the beam bending problem. Which of the boundary conditionsare essential, and which are natural?

5. Bessel’s equation isd

dx(x

du

dx) + xu = 0

On the half-line x ≥ 0 it has two solutions, J0(x) and Y0(x). The former satisfies J0(0) = 1 andJ ′0(0) = 0, while the latter satisfies Y0(0) = ∞. The former has an infinite number of real positivezeros, the first of which is α + 1 ≈ 2.4048.

(a) Describe the weak form of Bessel’s equation with specified boundary values.

(b) What are natural boundary conditions for this equation?

(c) Describe the finite element method for this problem on a general interval 0 < x < L, usingspecified values at the boundaries, piecewise linear basis functions and midpoint rule quadrature.

(d) Program this finite element method for Bessel’s equation on the interval 0 < x < 1 with u(0) = 1and u(1) = 0. Perform a mesh refinement study.

(e) Program this finite element method for Bessel’s equation on the interval 0 < x < 2.4048 withu(0) = 1 and u(2.4048) = 0. Perform a mesh refinement study.

5.3.2 Higher Order Finite Elements

One very important task in using higher-order piecewise polynomials in finite elementmethods is to construct the basis functions for the approximating space. We will provideseveral examples.5.3.2.1 Continuous Piecewise Quadratic Splines


Since a piecewise quadratic on a mesh with N elements has 3N parameters, we have3N unknowns in the spline s(x). Continuity of s at the internal nodes x1, . . . , xN−1 givesus N − 1 equations. We have 2n+ 1 remaining degrees of freedom, which are specified by

1. interpolating f at the nodes x0, . . . , xN (which gives us N + 1 equations), and

2. interpolating f at the element midpoints xi+xi+1

2 , 0 ≤ i < N (which gives us Nequations).

Thus we have (N−1)+(N+1)+N = 3N conditions to specify the 3N parameters uniquely.There are two kinds of basis functions for piecewise quadratic splines. The first kind

is associated with the nodes x0, . . . , xN . The corresponding canonical basis function B0(ξ)is one at ξ = 0, and zero at ξ = ±1,±1

2 . The second kind is associated with the nodes12(xi +xi+1), and the corresponding canonical basis function v 1

2(ξ) is one at ξ = 1

2 and zeroat ξ = 0, 1.

We can compute the canonical basis functions by Newton interpolation. Note that B0(ξ)is an even function of ξ, so we only need to determine it for 0 ≤ ξ ≤ 1. In this interval, thedivided difference table is

0 1-2

12 0 2

01 0

Therefore, for 0 ≤ ξ ≤ 1, B0(ξ) = 1− 2ξ + 2ξ(ξ − 12) = (1− 2ξ)(1− ξ). Thus we define

B0(ξ) =

(1− 2 | ξ |)(1− | ξ |), | ξ |≤ 10, | ξ |≥ 1

.

Similarly, the divided difference table for B 12(ξ) is

0 02

12 1 -4

-21 0

Therefore,

B 12(ξ) =

4ξ(1− ξ), 0 ≤ ξ ≤ 10, 0 ≥ ξ or 1 ≤ ξ

.

These two canonical basis functions are shown in figure 5.3-3. The actual basis functionsare defined in terms of the canonical basis functions by a change of coordinates:

Vi(x) = B0(ξi(x)) where ξ(x) ≡

(x− xi)/(xi+1 − xi), x ≥ xi(x− xi)/(xi − xi−1), x ≤ xi

,


andVi+ 1

2(x) = B 1

2(ξi(x)) .

(a) B0 (b) B 12

Figure 5.3-3: Canonical Basis Functions for Piecewise Quadratic Finite Elements

The solution of the differential equation can be written in terms of the basis functionsin the form

U(x) =n∑j=0

Vj(x)υj +n−1∑j=0

Vj+ 12(x)υj+ 1

2.

For any given value of x, at most 3 terms in the sums are nonzero. That is, if xj ≤ x ≤ xj+1

then we compute

U(x) = Vj(x)υj + Vj+1(x)υj+1 + Vi+ 12(x)υj+ 1

2,

= B0(ξj(x))υj +B0(ξj(x)− 1)υj+1 +B 12(ξj(x))υj+ 1

2,

For finite element approximations, we need to compute element-wise contributions to


the stiffness matrix: Bk+ 12(Vk, Vk) Bk+ 1

2(Vk, Vk+ 1

2) Bk+ 1

2(Vk, Vk+1)

Bk+ 12(Vk+ 1

2, Vk) Bk+ 1

2(Vk+ 1

2, Vk+ 1

2) Bk+ 1

2(Vk+ 1

2, Vk+1)

Bk+ 12(Vk+1, Vk) Bk+ 1

2(Vk+1, Vk+ 1

2) Bk+ 1

2(Vk+1, Vk+1)

The entries of this matrix have the form∫ xk+1

xk

V ′i (x)p(x)V′j (x) + Vi(x)r(x)Vj(x)dx

=∫ 1

0V ′i (xk + ξhk+ 1

2)p(xk + ξhk+ 1

2)V ′j (xk + ξhk+ 1

2)

+ Vi(xk + ξhk+ 12)r(xk + ξhk+ 1

2)Vj(xk + ξhk+ 1

2)hk+ 1

2dξ

Using our quadrature rule, we obtain


2(Vk, Vk+ 1

2) Bk+ 1

2(Vk, Vk+1)

Bk+ 12(Vk+ 1

2, Vk) Bk+ 1

2(Vk+ 1

2, Vk+ 1

2) Bk+ 1

2(Vk+ 1

2, Vk+1)

Bk+ 12(Vk+1, Vk) Bk+ 1

2(Vk+1, Vk+ 1

2) Bk+ 1

2(Vk+1, Vk+1)

≈

B′0(q1) B′0(q2)B′1

2

(q1) B′12

(q2)

B′0(q1 − 1) B′0(q2 − 1)

p(xk+q1hk+1

2)ω1

hk+1

2

0

0p(xk+q2hk+1

2)ω2

hk+1

2

[B′0(q1) B′1

2

(q1) B′0(q1 − 1)

B′0(q2) B′12

(q2) B′0(q2 − 1)

]

+

B0(q1) B0(q2)B 1

2(q1) B 1

2(q2)

B0(q1 − 1) B0(q2 − 1)

[r(xk + q1hk+ 12)ω1hk+ 1

20

0 r(xk + q2hk+ 12)ω2hk+ 1

2

][B0(q1) B 1

2(q1) B0(q1 − 1)

B0(q2) B 12(q2) B0(q2 − 1)

]

Here we are using the Gaussian quadrature rule

∫ 1

0g(x) dx ≈

2∑`=1

g(q`)ω` ≡ g(1−

√1/3

2)12

+ g(1 +

√1/3

2)12

;

in other words, q1 = (1−√

1/3)/2, q2 = (1 +√

1/3)/2, and ω1 = 1/2 = ω2. We also have


element-wise contributions to the right-hand side vector f of the form∫ xk+1

xkVkfdx∫ xk+1

xkVk+ 1

2fdx∫ xk+1

xkVk+1fdx

=

∫ 10 Vk(xk + ξhk+ 1

2)f(xk + ξhk+ 1

2)hk+ 1

2dξ∫ 1

0 Vk+ 12(xk + ξhk+ 1

2)f(xk + ξhk+ 1

2)hk+ 1

2dξ∫ 1

0 Vk+1(xk + ξhk+ 12)f(xk + ξhk+ 1

2)hk+ 1

2dξ

≈

B0(q1) B0(q2)B 1

2(q1) B 1

2(q2)

B0(q1 − 1) B0(q2 − 1)

[f(xk + q1hk+ 12)ω1hk+ 1

2

f(xk + q2hk+ 12)ω2hk+ 1

2

]

Numerical results with the programs Program 5.3-5: FiniteElementMain.C andProgram 5.3-6: finite element.f are shown in figures 5.3-4 and 5.3-5. The former ofthese two figures shows the error in the finite element approximation as a function of x,and the error, scaled so that the max absolute value is 1 in each element, and mappedspatially to the unit interval. In this figure, it is clear that the error in the finite elementapproximation is more accurate at the element boundaries xi and the element midpoints.In figure 5.3-5 we show the results of a mesh refinement study with continuous piecewisequadratics. These results demonstrate that for continuous piecewise quadratics the L2 erroris proportional to 4x3, and the L∞ error at the element boundaries and element centers isproportional to 4x4. The higher rate of convergence of the solution at the mesh points iscalled superconvergence .5.3.2.2 Continuous Piecewise Cubic Splines

Since a cubic has 4 coefficients, a piecewise cubic on a mesh of n elements has 4nunknowns. Continuity of the piecewise cubic at the internal nodes x1, . . . , xn−1 gives usn− 1 equations. We have 3n+ 1 remaining degrees of freedom, which are specified by

1. interpolating function values at x0, . . . , xn (which gives us n+ 1 equations),

2. interpolating function values at 2xi+xi+1

3 , 0 ≤ i < n (which gives us n equations), and

3. interpolating function values at xi+2xi+1

3 , 0 ≤ i < n (which gives us n equations).

These conditions uniquely determine the piecewise cubic.The canonical basis functions are easily determined from the Newton divided difference

table to be

v0(ξ) =

12(1− | ξ |)(1− 3 | ξ |)(2− 3 | ξ |), | ξ |≤ 10, | ξ |≥ 1

,

v 13(ξ) =

92ξ(2− 3ξ)(1− ξ), 0 ≤ ξ ≤ 10, 0 ≥ ξ or 1 ≤ ξ

,

andv 2

3(ξ) = v 1

3(1− ξ) .




(a) error (b) scaled error mapped to canonical element

Figure 5.3-4: Errors in Continuous Piecewise Quadratic Finite Elements: 10 elements


(a) L2 error at Gauss quadrature pts = O(4x3) (b) L∞ error at mesh points = O(4x4)

Figure 5.3-5: Errors in Continuous Piecewise Quadratic Finite Elements: log base 10 oferrors versus log base 10 of number of basis functions


The actual basis functions are defined in terms of the canonical basis functions by a changeof coordinates:

Vi(x) = v0(ξi(x)) where ξi(x) ≡

(x− xi)/(xi+1 − xi), x ≥ xi(x− xi)/(xi − xi−1), x ≤ xi

,

vi+ 13(x) = w 1

3((x− xi)/(xi+1 − xi)) ,

andvi+ 2

3(x) = w 2

3((x− xi)/(xi+1 − xi)) .

These canonical basis functions are shown in figure 5.3-6.The piecewise cubic can be written in terms of the basis functions in the form

U(x) =n∑j=0

Vi(x)υj +n∑j=1

[Vj+ 13(x)υj+ 1

3+ Vj+ 2

3(x)υj+ 2

3] .

For any given value of x, at most 4 terms in the sums are nonzero. That is, if xj ≤ x ≤ xj+1

then we compute

U(x) = Vj(x)υj + Vj+1(x)υj+1 + Vj+ 13(x)υj+ 1

3+ Vj+ 2

3(x)υj+ 2

3.

= v0(ξj(x))υj + v0(ξj(x)− 1)υj+1 + v 13(ξj(x))υj+ 1

3+ v 2

3(ξj(x))υj+ 2

3.

Quadratures involving cubics can use the Gaussian quadrature rule∫ 1

0g(x) dx ≈

3∑`=1

g(q`)ω` ≡ g(1−

√3/5

2)

518

+ g(12)49

+ g(1 +

√3/5

2)

518

.

This is exact for g ∈ P5.Numerical results with the programs Program 5.3-7: FiniteElementMain.C and

Program 5.3-8: finite element.f are shown in figures 5.3-7 and 5.3-8. The former ofthese two figures shows the error in the finite element approximation as a function of x,and the error, scaled so that the max absolute value is 1 in each element, and mappedspatially to the unit interval. In this figure, it is clear that the error in the finite elementapproximation is more accurate at the element boundaries xi and two points in the interiorof each mesh element. In figure 5.3-8 we show the results of a mesh refinement studywith continuous piecewise cubics. These results demonstrate that for continuous piecewisecubics the L2 error is proportional to 4x4, and the L∞ error at the element boundariesis proportional to 4x6. This is another example of superconvergence . In general, weexpect that continuous piecewise polynomials of degree k > 1 lead to superconvergent finiteelement approximations that are accurate to order 2k. (These orders are valid for k = 1,but do not represent superconvergence.)




(a) v0

(b) v1 (c) v2

Figure 5.3-6: Canonical Basis Functions for Continuous Piecewise Cubic Finite Elements



Figure 5.3-7: Errors in Continuous Piecewise Cubic Finite Elements: 10 elements


(a) L2 error at Gauss quadrature pts = O(4x4) (b) L∞ error at mesh points = O(4x6)

Figure 5.3-8: Errors in Continuous Piecewise Cubic Finite Elements: log base 10 of errorsversus log base 10 of number of basis functions


5.3.2.3 C1 Piecewise Cubic SplinesThere are 4n parameters and 2(n−1) continuity constraints. This leaves 2n+2 degrees

of freedom, determined by interpolating function values and function slopes at x0, . . . , xn.There are two canonical basis functions. The first, v(ξ) is an even function such that

v(0) = 1 and v(1) = 0, v′(0) = 0 = v′(1). The Newton divided difference table for thisfunction in the interval [0, 1] is

0 10

0 1 -1-1 2

1 0 10

1 0

Thus for 0 ≤ ξ ≤ 1, v(ξ) = 1 − ξ2 + 2ξ2(ξ − 1) = (1 − ξ)2(1 + 2ξ). In other words, v(ξ) isdefined by

v(ξ) =

(1− | ξ |)2(1 + 2 | ξ |), | ξ |≤ 10, | ξ |≥ 1

.

The second canonical basis function, s(ξ), is an odd function such that s′(0) = 1 ands(0) = 0 = s(1), s′(1) = 0. the divided difference table for the second canonical basisfunction is

0 01

0 0 -10 1

1 0 00

1 0

Thus for 0 ≤ ξ ≤ 1, s(ξ) = ξ− ξ2 + ξ2(ξ− 1) = ξ(1− ξ)2. In other words, s(ξ) is defined by

s(ξ) =ξ(1− | ξ |)2, | ξ |≤ 10, | ξ |≥ 1

.

These canonical basis functions are shown in figure 5.3-9.The piecewise cubic can be written in terms of the basis functions in the form

U(x) =n∑j=0

[Vj(x)υj + Sj(x)υ′j ] .


(a) v (b) s

Figure 5.3-9: Canonical Basis Functions for Continuously Differentiable Piecewise CubicFinite Elements


For any given value of x, at most 4 terms in the sum are nonzero. That is, if xj ≤ x ≤ xj+1

then we compute

U(x) = Vj(x)υj + Vj+1(x)υj+1 + Sj(x)υ′j + Sj+1(x)υ′j+1

= v(ξj(x))υj + v(ξj(x)− 1)υj+1 + s(ξj(x))υ′j + s(ξj(x)− 1)υ′j+1 .

Numerical results with the programs Program 5.3-9: FiniteElementMain.C andProgram 5.3-10: finite element.f are shown in figures 5.3-10 and 5.3-11. The formerof these two figures shows the error in the finite element approximation as a function ofx, and the error, scaled so that the max absolute value is 1 in each element, and mappedspatially to the unit interval. In this figure, it is clear that the error in the finite elementapproximation is more accurate at two points in the interior of each mesh element. In figure5.3-11 we show the results of a mesh refinement study with piecewise Hermite cubics. Theseresults demonstrate that for continuous piecewise cubics the L2 error is proportional to 4x4,and the L∞ error at the zeros of the quadratic Legendre polynomial are proportional to4x4. DeBoor and Swartz [32] show that Hermite finite element methods of order k ≥ 3 aresuperconvergent with order 2k − 2 at the zeros of the Legendre polynomial of order k − 1.


Figure 5.3-10: Errors in C1 Piecewise Cubic Finite Elements: 10 elements




(a) L2 error at Gauss quadrature points (b) L∞ error at mesh points

Figure 5.3-11: Errors in C1 Piecewise Cubic Finite Elements: log base 10 of errors versuslog base 10 of number of basis functions; both errors are O(4x4)


5.3.2.4 C2 Piecewise Cubic SplinesThere are 4n parameters and 3(n− 1) continuity constraints. This leaves n+ 2 degrees

of freedom, determined by interpolating f and f ′ at x0, . . . , xn and 2 other conditions, oneat each of the endpoints x0 and xn. One common choice for the endpoint conditions is

s′′(x0) = 0 = s′′(xn);

another common endpoint condition is

s′(x0) = f ′(x0), s′(xn) = f ′(xn).

On a non-uniform mesh, it is not possible to determine local basis functions for the spline.As a result, these piecewise cubics are seldom used in practical finite element computations.

Exercises 5.3.21. Consider continuous piecewise quartic polynomials.

(a) Determine the degrees of freedom and appropriate interpolation points.

(b) Determine canonical basis functions for this interpolation, as well as the derivatives of thesefunctions.

(c) Describe the Gaussian quadrature rule that is exact for polynomials of degree at most seven onthe unit interval.

(d) Program the finite element method for continuous piecewise quartics.

(e) Perform a mesh refinement study: plot the log of the error at the quadrature points as a functionof the log of the number of basis functions, and plot the log of the error at the mesh points.

2. Repeat the previous exercise for continuously differentiable piecewise quartic polynomials.


− d

dx(du

dx) = π2 cos(πx) , 0 < x < 1

u(0) = 1 ,du

dx(1) = 0 .


(b) Program the finite element method for this problem, using continuous piecewise cubics. Describecarefully how you treat the boundary conditions.


(d) Plot the log of the error in the derivative of the solution at the mesh points versus the log ofthe number of basis functions, for 2n elements, 1 ≤ n ≤ 10. What is the slope of this curve (i.e.the order of convergence)?

4. Consider the problem in the previous exercise, approximated by C1 (Hermite) cubics.

(a) Program the finite element method for this problem, and plot the log of the error in the solutionat the quadrature points versus the log of the number of basis functions. What is the slope ofthis curve?

(b) Suppose that instead of treating the right-hand boundary condition as natural, we impose thisboundary condition directly on the finite element approximation. Program the correspondingnumerical method, plots the errors in mesh refinement as before, and find the slope of the curve.


5.3.3 Hierarchical Elements

Babuska and a succession of students have developed hierarchical spaces for finite ele-ment approximations [9, 87]. Instead of varying the mesh width 4x, it is often useful tovary the order of the finite element approximating functions. Efficient implementations ofthis approach requires nested piecewise polynomial spaces. Babuska accomplished this byemploying integrals of Legendre polynomials, since the Legendre polynomials are orthogo-nal with respect to a constant weight function. For constant coefficient problems, there aresignificant advantages to this approach, as we will see.

It is well-known that orthogonal polynomials can be generated via three-term recur-rences. It is also well-known that the orthogonal polynomial of degree n has n simple zerosin the interval of integration for the inner product. In particular, Legendre polynomials areorthogonal on I = (−1, 1) with respect to L2 inner product, and can be generated by thethree-term recurrence

P0(ξ) = 1, P1(ξ) = ξ,∀n ≥ 1 Pn+1(ξ) =2n+ 1n+ 1

ξPn(ξ)−n

n+ 1Pn−1(ξ) .

With this normalization of the Legendre polynomials,∫ 1

−1Pn(ξ)2 dξ =

22n+ 1

,

∀ − 1 ≤ ξ ≤ 1 , | Pn(ξ) |≤ 1 ,Pn(±1) = (±1)n ,P ′n+1(ξ)− P ′n−1(ξ) = (2n+ 1)Pn(ξ) .

Suppose that we want to find a quadrature rule of the form∫ 1

−1f(ξ) dξ ≈

n∑j=0

f(ξj) ωj .

If we want this rule to be exact for all polynomials of degree as high as possible for a givennumber of quadrature points, then we use Gaussian quadrature. In Gauss-Lobattoquadrature, we require that ξ0 = −1 and ξn = 1, so that the endpoints of the integrationinterval are quadrature points. Given this contraint on the quadrature points, it is possibleto show [57] that the quadrature rule will have the highest possible order for a given numberof quadrature points if ξjn−1

j=1 are chosen to be the zeros of P ′n(ξ), and the coefficients inthe quadrature rule are given by

ω0 = ωn =2

n(n+ 1)and ∀1 ≤ j < n ωj =

2n(n+ 1)Pn(ξj)2

.


With these choices, the Gauss-Lobatto quadrature rule with n + 1 quadrature points isexact for all f ∈ P2n−1.

We are now ready to describe our hierarchical elements. Piecewise linear functionscontribute the first two shape functions:

V ∗0 (ξ) =12(1− ξ) = 1− 1

2

∫ ξ

−1P0(t) dt ,

V ∗1 (ξ) =12(1 + ξ) =

12

∫ ξ

−1P0(t) dt .

The other basis functions are defined by

∀2 ≤ j V ∗j (ξ) =

√2j − 1

2

∫ ξ

−1Pj−1(t) dt .

Let us discuss some of the properties of these shape functions. Note that

∀2 ≤ j V ∗j (−1) = 0 = V ∗j (1)

The value at −1 is obvious by the definition of these shape functions; the value at 1 followsfrom the orthogonality of Pj−1 to P0(ξ) ≡ 1.

Suppose that we want to approximate the solution to the differential equation

− d

dx(D(x)

du

dx) = f , 0 < x < L .

In the stiffness matrix for this problem, we map a mesh element (xk, xk+1) to the canonicalelement (−1, 1) by the transformation

x(ξ) = xk +4xk2

(ξ + 1)

where 4xk = xk+1 − xk. The element contribution to the stiffness matrix is∫ xk+1

xk

dVidx

D(x)dVjdx

dx =2

4xk

∫ 1

−1

dV ∗idξ

D(x(ξ))dV ∗jdξ

dξ .

Note that if D is constant, then for 2 ≤ i, j and i 6= j this integral will be zero; this isbecause

∀2 ≤ i < j

∫ 1

−1

dV ∗idξ

dV ∗jdξ

dξ =

√2j − 1

2

√2i− 1

2

∫ 1

−1Pi−1(ξ)Pj−1(ξ) dξ = 0 .


Also note that

∀2 ≤ j

∫ 1

−1

dV ∗0dξ

dV ∗jdξ

dξ = −12

√2j − 1

2

∫ 1

−1P0(ξ)Pj−1(ξ) dξ = 0

and

∀2 ≤ j

∫ 1

−1

dV ∗ydξ

dV ∗jdξ

dξ =12

√2j − 1

2

∫ 1

−1P0(ξ)Pj−1(ξ) dξ = 0 .

The diagonal terms in the stiffness matrix for constant coefficient D would be

∀2 ≤ jD

4xk

∫ 1

−1(dV ∗jdξ

)2 dξ =D

4xk

2j − 12

∫ 1

−1Pj−1(ξ)2 dξ =

D

4xk.

In summary, for constant coefficient D the element-wise contribution to the stiffness matrixis

[∫ xk+1

xk

dVidx

D(x)dVjdx

dx

]=

12 −1

2 0 . . . 0−1

212 0 . . . 0

0 0 1 . . . 0...

... . . .. . .

...0 0 0 . . . 1

2D4xk

.

In the right-hand side, we need to compute terms of the form∫ xk+1

xk

Vi(x)f(x) dx =124xk

∫ 1

−1V ∗i (ξ)f(x(ξ)) dξ .

For constant f , we use the fact that for 2 ≤ j

V ∗j (ξ) =

√2j − 1

2

∫ ξ

−1Pj−1(t) dt =

Pj(ξ)− Pj−2(ξ)√2(2j − 1)

and the fact that the Legendre polynomials Pj are orthogonal to constants for j > 0. Thusthe element-wise contribution to the right-hand side with constant f is

[∫ xk+1

xk

Vif(x) dx]

=

11

−√

23

0...0

f4xk

2.

There are excellent theoretical reasons for using hierarchical finite element methods.Suppose that we want to solve a two-point boundary value problem with smooth data.


For the h−version of the Galerkin method, we use a fixed polynomial degree k and varythe number of elements N = L/4x. The norm of the error for a second-order differentialequation is

‖u− U‖1 = O(4xk) ≈ CN−k .

To see this graphically, we could plot ln ‖u − U‖1 versus ln d, where d = KN is the totalnumber of degrees of freedom in the finite element approximation. We would find that thecurve has slope −k:

ln ‖u− U‖1 ≈ ln(CN−k) = lnC − k lnN = (lnC + k lnK)− k ln d .

In the p−version of the Galerkin method, we use a fixed mesh with N = L/4x elements,and vary the degree k. The natural norm of the error for problems with smooth data isagain

‖u− U‖1 = O(4xk) ≈ CN−k ,

and the number of degrees of freedom is again d = kN , but the logarithm of the error as afunction of the number of degrees of freedom is

ln ‖u− U‖1 ≈ ln(CN−k) = lnC − k lnN = lnC − lnNN

d .

In other words, the error approaches zero exponentially fast in terms of the number ofdegrees of freedom.

For problems with non-smooth data, the h−version of the Galerkin method would haveerror satisfying

‖u− U‖1 = O(4xβ) ,

where β ≤ k is determined by the smoothness of the data. However, by using adaptivelyrefined meshes, it is possible to obtain

‖u− U‖1 = C(d−k) ,

which implies thatln ‖u− U‖1 ≈ c− k ln d .

On the other hand, if the data has a discontinuity at a mesh point in the p− version ofGalerkin’s method, the error behaves as

‖u− U‖1 = C(d−2β) ,

which implies thatln ‖u− U‖1 ≈ c− 2β ln d .

5.4. APPROXIMATION THEORY 337

This is twice the rate of the h−version. If the discontinuity is not located at a meshpoint, then the p−version and the h−version have the same dependence on their degrees offreedom.

If both the mesh and order of elements are varied, which is called the hp−version ofGalerkin’s method, then the error behaves as

‖u− U‖1 ≈ c− αdγ ,

where α and γ are positive constants. Typically, γ = 12 is found in practice, indicating an

exponential rate of convergence.

A program to implement hierarchical finite elements for a two-point boundary-valueproblem can be found in Program 5.3-11: HierarchicalMain.C . This C++ programuses Fortran routines in Program 5.3-12: hierarchical.f to implement the hierarchicalfinite element method. Figure 5.3-12 shows computational results for the hierarchical finiteelement method computed by this program. This figure displays the L∞ error at the Gauss-Lobatto quadrature points versus the number of basis functions, as the number of elementsis held fixed and the order is increased. Note that the limiting accuracy of the machine isreached with far fewer hierarchical basis functions than with any of the refinements in spacefor fixed order of polynomials.

Exercises 5.3

1. Describe the Gauss-Lobatto quadrature rules that are exact for P2n+1 in the cases n = 0, 1 and 2. Inother words, what are ξj and ωj in each case?

5.4 Approximation Theory

5.4.1 Piecewise Linear Approximation

It is fairly easy to estimate the error in piecewise linear interpolation.

http://www.math.duke.edu/~johnt/math226/bvp2/HierarchicalMain.C

http://www.math.duke.edu/~johnt/math226/bvp2/hierarchical.f


Figure 5.3-12: Errors in Hierarchical Finite Elements at Quadrature Points: log base 10 ofL∞ errors versus log base 10 of number of basis functions



1. u′′ ∈ L2(a, b),

2. a = x0 < x1 < . . . < xn = b is a mesh with maximum cell width

h ≡ max1≤i≤n

| xi − xi−1 |,

and

3. W is the continuous piecewise linear interpolant to u on this mesh.

Then

‖u−W‖0 ≤| u |2h2

√90

,

and| u′ −W ′ |1≤| u |2

h√6. (5.4-1)

Proof: Note that for any 1 ≤ i ≤ n and x ∈ [xi−1, xi],

u′(x)−W ′(x) = u′(x)− u(xi)− u(xi−1)xi − xi−1

=1

xi − xi−1

∫ xi

xi−1

[u′(x)− u′(t)]dt

=1

xi − xi−1

∫ xi

xi−1

∫ x

tu′′(s)ds dt

=1

xi − xi−1[∫ x

xi−1

∫ x

tu′′(s)ds dt+

∫ xi

x

∫ x

tu′′(s)ds dt]

=1

xi − xi−1[∫ x

xi−1

∫ s

xi−1

u′′(s)dt ds−∫ xi

x

∫ xi

su′′(s)dt ds]

=∫ x

xi−1

u′′(s)s− xi−1

xi − xi−1ds−

∫ xi

xu′′(s)

xi − s

xi − xi−1ds

=∫ xi

xi−1

u′′(s)χ1(s, x)ds

where

χ1(s, x) =

(s− xi−1)/(xi − xi−1), s < x−(xi − s)/(xi − xi−1), s > x

=Vi(s), s < x−Vi−1(s), s > x


Here Bi is the continuous piecewise linear nodal basis function, defined in equa-tion (5.3-1). Similarly,

u(x)−W (x) = u(x)− u(xi−1)−u(xi)− u(xi−1)

xi − xi−1(x− xi−1)

=∫ x

xi−1

u′(r)− u(xi)− u(xi−1)xi − xi−1

dr

=∫ x

xi−1

1xi − xi−1

[∫ xi

xi−1

u′(r)− u′(t)dt]dr

=1

xi − xi−1

∫ x

xi−1

∫ xi

xi−1

∫ r

tu′′(s)ds dt dr

=1

xi − xi−1

∫ x

xi−1

[∫ r

xi−1

∫ r

tu′′(s)ds dt+

∫ xi

r

∫ r

tu′′(s)ds dt

]dr

=1

xi − xi−1

∫ x

xi−1

[∫ r

xi−1

u′′(s)∫ s

xi−1

dt ds−∫ xi

ru′′(s)

∫ xi

sdt ds

]dr

=1

xi − xi−1

∫ x

xi−1

∫ r

xi−1

u′′(s)(s− xi−1) ds dr

− 1xi − xi−1

∫ x

xi−1

∫ xi

ru′′(s)(xi − s) ds dr

=1

xi − xi−1

∫ x

xi−1

u′′(s)(s− xi−1)∫ x

sdr ds

− 1xi − xi−1

[∫ x

xi−1

u′′(s)(xi − s)∫ s

xi−1

dr ds+∫ xi

xu′′(s)(xi − s)

∫ x

xi−1

dr ds

]

=∫ x

xi−1

u′′(s)(s− xi−1)(x− s)

xi − xi−1ds−

∫ x

xi−1

u′′(s)(xi − s)(s− xi−1)

xi − xi−1ds

−∫ xi

xu′′(s)

(xi − s)(x− xi−1)xi − xi−1

ds

= −∫ x

xi−1

u′′(s)(s− xi−1)(x− xi)

xi − xi−1ds−

∫ xi

xu′′(s)

(xi − s)(x− xi−1)xi − xi−1

ds

=∫ xi

xi−1

u′′(s)χ0(s, x)ds


where

χ0(s, x) =−(xi − x)(s− xi−1)/(xi − xi−1), s < x−(x− xi−1)(xi − s)/(xi − xi−1), s > x

= (xi − xi−1)Vi(s)Vi−1(x), s < xVi−1(s)Vi(x), s > x

By summing over the mesh cells and applying Schwarz’s inequality, we obtain

‖u′ −W ′‖20 =

n∑i=1

∫ xi

xi−1

[∫ xi

xi−1

u′′(s)χ1(s, x)ds]2dx

≤n∑i=1

∫ xi

xi−1

[∫ xi

xi−1

u′′(s)2ds

∫ xi

xi−1

χ1(s, x)2ds]dx

≤| u |22 max1≤i≤n

∫ xi

xi−1

∫ xi

xi−1

χ1(s, x)2dsdx ,

and

‖u−W‖20 =

n∑i=1

∫ xi

xi−1

[∫ xi

xi−1

u′′(s)χ0(s, x)ds]2dx

=| u |22 max1≤i≤n

∫ xi

xi−1

∫ xi

xi−1

χ0(s, x)2dsdx] .

All that remains is to compute

∫ xi

xi−1

∫ xi

xi−1

χ1(s, x)2dsdx =∫ xi

xi−1

∫ x

xi−1

[s− xi−1

xi − xi−1

]2

dsdx+∫ xi

xi−1

∫ xi

x

[− xi − s

xi − xi−1

]2

dsdx

=∫ xi

xi−1

∫ 1−xi−1

0

[t

xi − xi−1

]2

dtdx+∫ xi

xi−1

∫ x−xi

0

[t

xi − xi−1

]2

dtdx

=∫ xi

xi−1

13

(x− xi−1)3

(xi − xi−1)2dx+

∫ xi

xi−1

13

(xi − x)3

(xi − xi−1)2dx

=∫ xi−xi−1

0

13

y3

(xi − xi−1)2dy +

∫ xi−xi−1

0

13

y3

(xi − xi−1)2dy

=112

(xi − xi−1)4

(xi − xi−1)2+

112

(xi − xi−1)4

(xi − xi−1)2= (xi − xi−1)2/6


and

∫ xi

xi−1

∫ xi

xi−1

χ20(s, x)dsdx =

∫ xi

xi−1

∫ x

xi−1

[−(xi − s)(s− xi−1)

xi − xi−1

]2

ds dx

+∫ xi

xi−1

∫ xi

x

[−(s− xi−1)(xi − s)

xi − xi−1

]2

ds dx

=∫ xi

xi−1

(xi − x)2

(xi − xi−1)2

∫ x−xi−1

0t2 dt dx

+∫ xi

xi−1

(x− xi−1)2

(xi − xi−1)2

∫ xi−x

0t2 dt dx

=∫ xi

xi−1

(xi − x)2

(xi − xi−1)2(x− xi−1)3

3dx

+∫ xi

xi−1

(xi − x)2

(xi − xi−1)2(xi − x)3

3dx

=13

∫ xi−xi−1

0

(xi − xi−1 − y)2

(xi − xi−1)2y3 dy

+13

∫ xi−xi−1

0

(xi − xi−1 − y)2

(xi − xi−1)2y3 dy

=23

∫ xi−xi−1

0y3

[1− 2y

xi − xi−1

]+

y2

(xi − xi−1)2dy

=23

[(xi − xi−1)4

4− 2(xi − xi−1)5

5(xi − xi−1)− 2(xi − xi−1)6

6(xi − xi−1)2

]= (xi − xi−1)4/90 .

2



1. u′ ∈ L2(a, b),

2. a = x0 < x1 < . . . < xn = b is a mesh with maximum cell width

h ≡ max1≤i≤n

| xi − xi−1 |,

and

3. W is the continuous piecewise linear interpolant to u on this mesh.

Then‖u−W‖0 ≤

h√2| u |1 ,

and‖u′ −W ′‖0 ≤ 2 | u |1 .

Proof: We have

u(x)−W (x) = u(x)− u(xi−1)−x− xi−1

xi − xi−1[u(xi)− u(xi−1)]

=∫ x

xi−1

u′(t) dt− x− xi−1

xi − xi−1

∫ xi

xi−1

u′(t) dt

=∫ xi

xi−1

χ0(x, t)u′(t) dt

where

χ0(x, t) ≡

(xi − x)/(xi − xi−1), xi−1 < t < x−(x− xi−1)/(xi − xi−1), x < t < xi

=Vi−1(x), xi−1 < t < x−Vi(x), x < t < xi

It follows that∫ xi

xi−1

[u(x)−W (x)]2dx =∫ xi

xi−1

[∫ xi

xi−1

χ0(x, t)u′(t)dt]2dx

≤∫ xi

xi−1

[∫ xi

xi−1

χ0(x, t)2dt][∫ xi

xi−1

[u′(t)]2dt]dx

= [∫ xi

xi−1

∫ xi

xi−1

χ0(x, t)2dt dx]∫ xi

xi−1

[u′(t)]2dt


Thus we evaluate∫ xi

xi−1

∫ xi

xi−1

χ0(x, t)2dt dx =∫ xi

xi−1

∫ x

xi−1

[xi − x

xi − xi−1]2dt dx+

∫ xi

xi−1

∫ xi

x[x− xi−1

xi − xi−1]2dt dx

=∫ xi

xi−1

(xi − x)2(x− xi−1)(xi − xi−1)2

dx+∫ xi

xi−1

(xi − x)(x− xi−1)2

(xi − xi−1)2dx

=∫ xi

xi−1

(xi − x)(x− xi−1)(xi − xi−1)

dx

= (xi − xi−1)2∫ 1

0(1− y)y dy =

12(xi − xi−1)2

This proves the first of the two claimed inequalities.

To prove the second claim, note that

u′(x)−W ′(x) = u′(x)− u(xi)− u(xi−1)xi − xi−1

=1

xi − xi−1

∫ xi

xi−1

u′(x)− u′(t) dt

Squaring and integrating leads to∫ xi

xi−1

[u′(x)−W ′(x)]2 dx =1

(xi − xi−1)2

∫ xi

xi−1

[∫ xi

xi−1

u′(x)− u′(t) dt]2 dx

≤ 1xi − xi−1

∫ xi

xi−1

∫ xi

xi−1

[u′(x)− u′(t)]2 dt dx

≤ 2xi − xi−1

∫ xi

xi−1

∫ xi

xi−1

u′(x)2 + u′(t)2 dt dx

= 2[∫ xi

xi−1

u′(x)2 dx+∫ xi

xi−1

u′(t)2 dt]

= 4∫ xi

xi−1

u′(x)2 dx

2

Now we can combine our results in this section and the previous sections to obtain thefollowing convergence result for piecewise linear finite element approximations.



1. there are positive numbers pmin ≤ pmax and nonnegative numbers rmin ≤ rmax so thatfor all 0 < x < L, pmin ≤ p(x) ≤ pmax and rmin ≤ r(x) ≤ rmax;

2. p and r are 1-smooth;

3. f ∈ H0(0, L);

4. u solves the two-point boundary value problem

∀v ∈ H10(0, L) , B(v, u) ≡

∫ L


∫ L

0vf dx ≡ (v, f) ;

5. the constants c and c satisfy 0 < c ≤ 2pmin/(2 + L2) and c ≥ maxpmax, rmax;

6. M ⊂ H10(0, L) is the set of all piecewise linear functions that are zero at x = 0 and

x = L, on a mesh of width h;

7. U ∈M satisfies the Rayleigh-Ritz-Galerkin equations

∀V ∈M, B(V,U) = λ(V ) .

Then there is a constant c so that the error satisfies

‖u− U‖1 ≤ ch‖f‖0

and‖u− U‖0 ≤ ch2‖f‖0

Proof: Lemma ?? shows that the weak problem is well-posed, and lemma ??shows that there exists a constant C depending on p and r so that

‖u‖2 ≤ C‖f‖0 .

Lemma ?? shows that the error in the Galerkin approximation satisfies

‖u− U‖1 ≤c

cinf

W∈M‖u−W‖1 ,

and our analysis of piecewise linear approximation shows that if W is the piece-wise linear interpolant to u then

‖u−W‖1 ≤√

16

+h2

90h | u |2 .


It follows that

‖u− U‖1 ≤ Cc

c

√16

+h2

90h‖f‖0 ,

proving the first claim. Lemma ?? shows that

‖u− U‖0 ≤ cC

√16

+h2

90h‖u− U‖1

This can be combined with the first result to show that

‖u− U‖0 ≤ C2 c2

c16

+h2

90h2‖f‖0 .

2

If the coefficients p and r are not 1-smooth but merely bounded, or if f is not square-integrable but is in H−1

0 (for example, f is a point-load), then inequality ?? applies, andwe obtain

‖u− U‖1 ≤c

cminW∈M

‖u−W‖1 ≤c

c‖u‖1 ≤

1c2‖f‖−1 .

In this case, the derivative of the finite element solution does not necessarily converge tothe derivative of the true solution.

However, we do obtain convergence of the finite element solution. Recall that in order toobtain H0 error estimates for the Rayleigh-Ritz-Galerkin procedure, we made approxima-tion assumption ??. In the case of continuous piecewise linear elements, this approximationassumption has been established in the form 5.4-1. It follows that if the coefficients p andr are 1-smooth that

‖u− U‖0 ≤ c

√16h‖u− U‖1 .

If f ∈ H−10 , then the appropriate error estimate is

‖u− U‖0 ≤c

c2

√16h‖f‖−1 .

In other words, if p ∈ C1 is positive and r is bounded and nonnegative, then the error in thefinite element solution is one order of h better than the error in the finite element derivative.If in addition f is square-integrable, then the error in the finite element solution is O(h2),while if f involves δ-functions then the error is O(h). Even in the latter case, if we canarrange that the point load be located at a mesh node, we can still regain second-orderaccuracy.


5.4.2 Continuous Piecewise Quadratic Approximation

In order to estimate the error in continuous piecewise quadratic splines we can provethat

u′′(x)− p′′(x) =∫ b

au(3)(s)χ2(s, x)ds ,

where

χ2(s, x) =

2(s− a)2/(b− a)2, a < s < x < b and s < (a+ b)/22(s− a)2/(b− a)2 − 1, a < x < s < (a+ b)/21− 2(b− s)2/(b− a)2, (a+ b)/2 < s < x < b−2(b− s)2/(b− a)2, a < x < s < b and s > (a+ b)/2

We can also prove that

u′(x)− p′(x) =∫ b


where a < x < b and

χ1(s, x) =−1

2(s− a)2/(b− a), a < s < x < b−1

2(b− s)2/(b− a), a < x < s < b

+

(s− a)2(2x− a− b)/(b− a), a < s < (a+ b)/2(b− s)2(a+ b− 2x)/(b− a), (a+ b)/2 < s < b

From this, we can show that

u(x)− p(x) =∫ b


where

χ0(s, x) =∫ x

aχ1(s, t)dt .

These kinds of results are special cases of the more general Peano-Kernel Theorem [29].Instead of working out these kernel functions by hand, we will use a more general approach,discussed in the next section.

5.4.3 Higher Order Piecewise Polynomial Approximation

Here is a general result for spline approximation.


Definition 5.4-4: The function F : Hk(0, 1) → R is a bounded sublinear functionalon Hk if and only if

1. F is nonnegative: ∀u ∈ Hk(0, 1), F (u) ≥ 0;

2. F is sublinear:

∀u,w ∈ Hk(0, 1) ∀α, β ∈ R, F (uα+ wβ) ≤ F (u) | α | +F (w) | β |

3. F is bounded:∃C > 0 ∀u ∈ Hk(0, 1), F (u) ≤ C‖u‖k

Lemma 5.4-5: (Bramble-Hilbert [15]) If F is a bounded sublinear functional onHk+1(0, 1)such that

∀p ∈ Pk, F (p) = 0,

then∃C > 0 ∀u ∈ Hk+1(0, 1), F (u) ≤ C | u |k+1 .

The difficulty is that this lemma does not provide a value for C, so it cannot be used toestimate errors in actual approximations.

Example 5.4-6: Suppose that we can interpolate functions by polynomials of degree kin such a way that the interpolation reproduces polynomials of degree k exactly, and usesvalues of derivatives of u of order at most k. For any u ∈ Hk let pu be the polynomialinterpolant to u, and define

F (u) ≡ ‖u− pu‖0 .

Then F is obviously nonnegative and sublinear. The boundedness of F follows from theSobolev lemma ??. It follows from the Bramble-Hilbert lemma that there is a constant C0

such that for all u ∈ Hk+1,

‖u− pu‖0 = F (u) ≤ C0 | u |k+1 .

Example 5.4-7: Alternatively, for 0 ≤ ` ≤ k we could define

F`(u) ≡| u− pu |` .

The same discussion as before shows that there is a constant C` such that for all u ∈ Hk+1,

| u− pu |`= F`(u) ≤ C` | u |k+1 .


Example 5.4-8: Suppose that we use piecewise polynomial interpolation on a mesh. Let

hj+ 12

= xj+1 − xj ,

ξj+ 12(x) = (x− xj)/(xj+1 − xj) ,

u(ξj+ 12) = u(xj + ξj+ 1

2hj+ 1

2) ,

pu(ξj+ 12) = pu(xj + ξj+ 1

2hj+ 1

2) .

Then for 0 ≤ m ≤ k∫ xj+1

xj

[dmu

dxm− dmpu

dxm]2 dx =

∫ 1

0h−2mj+ 1

2

[dmu

dξm− dmpu

dξm]2hj+ 1

2dξj+ 1

2

≤ h1−2mj+ 1

2

Cm

∫ 1

0[dk+1u

dξk+1]2 dξj+ 1

2

= h2(j−m+1)

j+ 12

Cm

∫ xj+1

xj

[dk+1u

dxk+1]2 dx

Now leth = max

j(xj+1 − xj) .

Summing over all the mesh intervals and taking the square root, we obtain

∀0 ≤ m < k ∃Cm > 0 ∀u ∈ Hk+1 , | u− pu |m≤ Cmhk+1−m | u |k+1 .

Example 5.4-9: Consider the two-point boundary-value problem

− d

dx(p(x)

du

dx) = π2 sin(πx) , 0 < x < 1

u(0) = 0 = u(1)

where 0 < α < 1 and

p(x) =

1, x < α2, x > α

The solution of this problem is

u(x) =

sin(πx)− x sin(πα)

1+α , x < α12 sin(πx) + (1−x) sin(πα)

2+2α , x > α

Since p(x) is discontinuous, it is not even 1-smooth (see lemma ??), so we have no reasonto expect that u ∈ Hk(Ω) ∩ H1

0 (Ω) for k > 1. This means that the H0 error estimate oflemma ?? does not apply to this problem.


In practice, with α not on an element boundary, we see roughly first-order conver-gence no matter what the order of the approximating polynomials in the finite elementmethod. Figure 5.4-1 shows the errors with various finite element choices for α =

√0.24.

The Fortran code to implement these computations can be found in Program 5.4-10:rough finite element.f .

5.4.4 Barycentric Coordinates

In order to define the shape functions for various polynomial degrees in a canonicalinterval, triangle or tetrahedron, we will use barycentric coordinates. In d dimensions,we will represent these coordinates by d+ 1 functions β1(ξ), . . . , βd+1(ξ) satisfying

∀ξ ,d+1∑i=1

βi(ξ) = 1 .

For simplicity, we will choose the canonical barycentric coordinates to be

b(ξ) =

β1(ξ)

...

βd(ξ)βd+1(ξ)

=

ξ1...ξd

1−∑d

i=1 ξi

.

Note that the interior of the element can then be written T∗ = ξ : b(ξ) > 0. Alsonote that any function g(ξ) can be written in terms of the barycentric coordinates:

g(ξ) = g(b(ξ)) .

This representation is not unique. Finally, note the following.

Lemma 5.4-11: For any multi-index α ∈ Zd+ with 1 ≤ d ≤ 3,∫b(ξ)>0

bα dξ =α!

(d+ |α|)!.

Proof: It is well-known that∫ 1

0xm(1− x)n dx =

Γ(m+ 1)Γ(n+ 1)Γ(m+ n+ 2)

where the Γ function is defined by

Γ(n) ≡∫ ∞

0xn−1e−x dx .

http://www.math.duke.edu/~johnt/math226/bvp2/rough_finite_element.f

http://www.math.duke.edu/~johnt/math226/bvp2/rough_finite_element.f


(a) C0 piecewise linears ≈ O(4x1.23) (b) C0 piecewise quadratics ≈ O(4x0.945)

(c) C0 piecewise cubics ≈ O(4x0.735) (d) C1 piecewise cubics ≈ O(4x1.07)

Figure 5.4-1: L2 Errors in Finite Elements for Problem with Discontinuous Coefficient: logbase 10 of errors versus log base 10 of number of basis functions


Further, for all integers n ≥ 0 we have

Γ(n+ 1) = n! .

Finally, if m and n are integers, then

m!n!(m+ n+ 1)!

=∫ 1

0xm(1−x)n dx =

n∑k=0

(n

k

)∫ 1

0xm(−x)k dx =

n∑k=0

(n

k

)(−1)k

m+ k + 1.

As a result, the lemma is now easy to prove in one dimension:

∫b(ξ)>0

b(ξ)α dξ =∫ 1

0ξα11 (1− ξ1)α2 dξ1 =

α1! α2!α1 + α2 + 1

=α!

1 + |α|.

In two dimensions,

∫b(ξ>0

b(ξ)α dξ =∫ 1

0

∫ 1−ξ1

0ξα11 ξα2

2 (1− ξ1 − ξ2)α3 dξ2 dξ1

=∑k=0

(α3

k

)(−1)k

∫ 1

0ξα11 (1− ξ1)α3−k

∫ 1−ξ1

0ξα2+k2 dξ2 dξ1

=∑k=0

(α3

k

)(−1)k

α2 + k + 1

∫ 1

0ξα11 (1− ξ1)α2+α3+1 dξ1

=∑k=0

(α3

k

)(−1)k

α2 + k + 1α1!(α2 + α3 + 1)!

(α1 + α2 + α3 + 2)!

=α2! α3!

(α2α3 + 1)!α1!(α2 + α3 + 1)!

(α1 + α2 + α3 + 2)!=

α1! α2! α3!(α1 + α2 + α3 + 2)!

=α!

(|α|+ 2)!


Finally, in three dimensions,∫b(ξ)>0

b(ξ)α dξ

=∫ 1

0

∫ 1−ξ1

0

∫ 1−ξ1−ξ2

0ξα11 ξα2

2 ξα33 (1− ξ1 − ξ2 − ξ3)α4 dξ3 dξ2 dξ1

=∑k=0

(α4

k

)(−1)k

∫ 1

0ξα11

∫ 1−ξ1

0ξα22 (1− ξ1 − ξ2)α4−k

∫ 1−ξ1−ξ2

0ξα3+k3 dξ3 dξ2 dξ1

=∑k=0

(α4

k

)(−1)k

α3 + k + 1

∫ 1

0ξα11

∫ 1−ξ1

0ξα22 (1− ξ1 − ξ2)α3+α4+1 dξ2 dξ1

=∑k=0

(α4

k

)(−1)k

α3 + k + 1

α3+α4+1∑`=0

(α3 + α4 + 1

`

)(−1)`

∫ 1

0ξα11 (1− ξ1)α3+α4+1−`

∫ 1−ξ1

0ξα2+`2 dξ2 dξ1

=∑k=0

(α4

k

)(−1)k

α3 + k + 1

α3+α4+1∑`=0

(α3 + α4 + 1

`

)(−1)`

α2 + `+ 1∫ 1

0ξα11 (1− ξ1)α2+α3+α4+2 dξ1

=

[∑k=0

(α4

k

)(−1)k

α3 + k + 1

][α3+α4+1∑

`=0

(α3 + α4 + 1

`

)(−1)`

α2 + `+ 1

]α1!(α2 + α3 + α4 + 2)!

(α1 + α2 + α3 + α4 + 3)!

=[

α3! α4!(α3 + α4 + 1)!

] [α2! (α3 + α4 + 1)!

(α2 + α3 + α4 + 2)!

]α1!(α2 + α3 + α4 + 2)!

(α1 + α2 + α3 + α4 + 3)!

=α1! α2! α3! α4!

(α1 + α2 + α3 + α4 + 3)!=

α!(|α|+ 3)!

2

5.4.5 Coordinate Transformations

We will map from the canonical element T∗ to a given curvilinear element Te by meansof a coordinate transformation xe(ξ). These coordinate transformations will always mapvertices to vertices and sides to sides.

Example 5.4-12: If

Xe = [xn1 , xn2 , xn3 ]


is an array of vertices of a triangle Te with the vertices numbered counter-clockwise aroundthe perimeter of Te, then we can define a linear transformation

xe(ξ) = Xeb(ξ) .

It should be obvious that this coordinate transformation maps vertices of the canonical tri-angle T∗ to vertices of Te. Along the side β3 = 0 we have β2 = 1− β1 and

xe(ξ) = xn1β1 + xn2(1− β1)

so this side of the canonical triangle is mapped to the side of Te opposite the vertex xn3.Furthermore, this mapping is identical on sides shared by adjacent triangles. Similar argu-ments can be used to show that coordinate transformations using higher-order polynomials,of the sorts discussed below, lead to identical mappings on element boundaries.

So that this coordinate transformation is invertible, we require that the Jacobian

Je ≡∂xe∂ξ

must be nonsingular throughout the canonical element T∗. In order that the order of thefinite element approximation is not reduced, the inverse of this Jacobian cannot be allowedto be too large relative to the mesh width. We will discuss these error estimates later.

We also require that the determinant of the Jacobian must be positive. For piecewiselinear coordinate transformations, this requires that the vertices of the triangle be orderedproperly. For triangles, the vertices must be ordered counter-clockwise, corresponding tothe ordering of the canonical vertices chosen by our form of the barycentric coordinates.

Higher-order polynomials could be used to map the canonical triangle into a curvilin-ear triangle. If the polynomials used for the coordinate transformation are the same asthose used in the finite element approximation of the solution of the differential equation,then the coordinate transformation is called isoparametric. If the coordinate transforma-tion uses lower order polynomials than the finite element approximation, then it is calledsubparametric.

The order of the polynomials used in the coordinate transformation does not affect theorder of error in the finite element approximation, except for the treatment of boundaryconditions. In particular, it is acceptable to use fewer than the full set of shape functionsneeded for a given order of approximation in the coordinate transformation. For example,we could handle a triangle with one curved side by using higher-order shape functionsassociated with the curved side, and linear functions for the other sides.

Given a vector of canonical shape functions V∗(ξ), we will define the vector of shapefunctions within an element Te by

V(xe(ξ)) = V∗(ξ)

5.5. TRIANGULAR FINITE ELEMENTS 355

If the coordinate transformation xe is not linear, then the shape functions Vn(x) are notnecessarily polynomials in x, even though the canonical shape functions are polynomials inξ.

The coordinate transformation is useful in computing various integrals that occur inthe finite element method. For quadratures used in the right-hand side of the Galerkinequations, we would compute∫

Te

Vm(x)f(x) dx =∫T∗

V ∗m(ξ)f(xe(ξ))|Je| dξ

For the stiffness matrix will need values for integrals of the form∫Te

∇xVm(x) ·D(x)∇xVn(x) dx =∫Te

∂Vm∂x

D(x)(∂Vn∂x

)> dx

=∫T∗

∂V ∗m∂ξ

J−1e D(xe(ξ))J−>e (

∂V ∗m∂ξ

)>|Je| dx

Note that we do not invert the relationship xe(ξ) for these computations; rather, we invertthe Jacobian Je. Even so, these integrals would be very difficult to compute without theuse of numerical integration.

5.5 Triangular Finite Elements

It will suffice to consider the canonical triangle

T∗ ≡ (ξ1, ξ2) : 0 ≤ ξ1, ξ2 ≤ 1 and ξ1 + ξ2 ≤ 1 .

We will define canonical shape functions on this canonical triangle, and use transformationsto define shape functions on arbitrary curvilinear triangles. We will denote the set of shapefunctions by Vkj where j is the order of the polynomials in the shape functions, and k is thenumber of derivatives used by the shape functions in determining interpolants.

5.5.1 Quadrature Rules for Triangles

Generally, we will use shape functions that span the set of all polynomials of degree atmost k. In order to compute stiffness matrices and the right-hand sides in the Galerkinequations, we will need to approximate integrals by quadrature rules. Suppose that ourbilinear form A involves derivatives of order at most m. In order for the finite elementmethod to be convergent, the quadrature rule needs to be exact for polynomials of degreeat most k − m. However, in order for the quadrature rule to preserve the order of themethod with exact integration, we need for the quadrature rule to be exact for polynomialsof degree at most 2k −m.


Some useful quadrature rules can be found in [1, 84] For example,

∫T∗

g(b(ξ)) dξ ≈ g(

1/31/31/3

)∫T∗

dξ =12g(

1/31/31/3

) (5.5-1)

is exact for polynomials of degree at most one. Also

∫T∗

g(b(ξ) dξ ≈

g(1

00

) + g(

010

) + g(

001

)

140

(5.5-2)

+

g(1/2

1/20

) + g(

01/21/2

) + g(

1/20

1/2

)

115

+ g(

1/31/31/3

)940

is exact for polynomials of degree at most three. Finally,∫T∗

g(b(ξ)) dξ ≈ 155−√

152400

g(1− 2γ1, γ1, γ1) + g(γ1, 1− 2γ1, γ1) + g(γ1, γ1, 1− 2γ1)

(5.5-3)

+155 +

√15

2400g(1− 2γ2, γ2, γ2) + g(γ2, 1− 2γ2, γ2) + g(γ2, γ2, 1− 2γ2)

+980g(

13,13,13)

where

γ1 ≡6−

√15

21, γ2 ≡

6 +√

1521

,

is exact for polynomials of degree at most five.

5.5.2 Linear Lagrange Element

We have already seen the simplest shape functions on triangles. A general linear functionof two variables has

∑1i=0(i+ 1) = 3 coefficients. The three shape functions are

V∗(ξ) ≡

V ∗1,0,0(ξ)V ∗0,1,0(ξ)V ∗0,0,1(ξ)

=

ξ1ξ2

1− ξ1 − ξ2

= b(ξ) .


In other words, the shape functions are the barycentric coordinates. This choice of piecewiselinear elements is sometimes called Courant triangles or Turner triangles.

Given a continuous function v, the the nodal variables are νn(v) = v(xn), and thefunctions φn(x) = Vn(x) in the local interpolant are the basis functions Vn(x).

It is easy to show that these shape functions are linearly independent. Let V∗ be thearray of shape functions. If we have a linear combination of these functions that sums tozero,

∀ξ ∈ T∗ ,3∑

n=1

V∗n(ξ)wn = 0

then the value at any of the nodes is zero:

∀1 ≤ m ≤ 3 , 0 =3∑

n=1

V∗n(ξm)wn = Vm(ξm)wm = wm

Here we have used the fact that exactly one of the shape functions is one at node ξm. Thuswe have proved that all of the coefficients wm in the linear combination are zero, so theshape functions are linearly independent.

In order to approximate finite element integrals involving linear Lagrange elements, itsuffices to use the quadrature rule (5.5-1).

5.5.3 Quadratic Lagrange Element

A general quadratic function of two variables has∑2

i=0(i+1) = 6 coefficients. We define6 shape functions on the canonical triangle via barycentric coordinates:

V∗(ξ) ≡

V ∗1,0,0(ξ)V ∗0,1,0(ξ)V ∗0,0,1(ξ)V ∗1

2, 12,0(ξ)

V ∗0, 1

2, 12

(ξ)

V ∗12,0, 1

2

(ξ)

=

2β1(β1 − 12)

2β2(β2 − 12)

2β3(β3 − 12)

4β1β2

4β2β3

4β3β1

.

The nodal variables for these shape functions are just evaluation at the canonical nodes

Ξ∗ = (i1, i2)/2 : i1, i2 ∈ Z and 0 ≤ i1, i2 ≤ i1 + i2 ≤ 2

These are just the vertices and the midpoints of the sides. (See figure 5.5-1.) The functionsφi in the local interpolant (see definition 5.2-2) are just the shape functions. Each of theseshape functions are one at exactly one of these locations, and zero at the others. As aresult, these shape functions are obviously linearly independent.


(0,0) (1,0)

(0,1)

Figure 5.5-1: Canonical Element Nodes for Quadratic Lagrange Element

5.5.4 Cubic Lagrange Element

A general cubic function of two variables has∑3

i=0(i + 1) = 10 coefficients. We define10 shape functions on the canonical triangle in terms of the barycentric coordinates by

V(ξ) =

V ∗1,0,0(ξ)V ∗0,1,0(ξ)V ∗0,0,1(ξ)V ∗2

3, 13,0(ξ)

V ∗0, 2

3, 13

(ξ)

V ∗13,0, 2

3

(ξ)

V ∗13, 23,0(ξ)

V ∗0, 1

3, 23

(ξ)

V ∗23,0, 1

3

(ξ)

V ∗13, 13, 13

(ξ)

=

92β1(1

3 − β1)(23 − β1)

92β2(1

3 − β2)(23 − β2)

92β3(1

3 − β3)(23 − β3)

92β1β2(β1 − 1

3)92β2β3(β2 − 1

3)92β3β1(β3 − 1

3)92β1β2(β2 − 1

3)92β2β3(β3 − 1

3)92β3β1(β1 − 1

3)27β1β2β2

.

The nodal variables for these shape functions are just evaluation at the nodes

Ξ = (i1, i2)/3 : i1, i2 ∈ Z and 0 ≤ i1, i2 ≤ i1 + i2 ≤ 3


The functions φi in the local interpolant are just the shape functions. Each of these shapefunctions are one at exactly one of these nodes, and zero at the others. As a result, theseshape functions are obviously linearly independent.

(0,0) (1,0)

(0,1)

Figure 5.5-2: Canonical Element Nodes for Cubic Lagrange Element

For quadratures involving these shape functions, we will use (5.5-2).

5.5.5 General Lagrange Elements

In general, a piecewise polynomial function of degree k in two variables has∑k

i=0(i+1) =12(k + 1)(k + 2) coefficients. The nodes are the points (in barycentric coordinates)

N ∈ (i1, i2, i3)/k : (i1, i2, i3) ∈ Z3 and 0 ≤ i1, i2, i3 ≤ i1 + i2 + i+ 3 ≤ k .

Of these,

• 3 nodes are at the corners; these correspond to nodes in the finite element mesh thatare shared by possibly more than 2 elements.

• 3(k − 1) nodes are along the sides but not at the corners; these correspond to nodesin the finite element mesh that are shared by at most two elements.

• 12(k−1)(k−2) nodes are interior nodes; these do not belong to more than one elementin a finite element mesh.


5.5.6 Hermite Cubics

The higher-order Lagrange Elements have many nodes at the sides and interior of thetriangles, which means that these nodes are not shared by many elements. As a result, thedimension of the space goes up rapidly with the number of elements and the order of theapproximation. Another problem with the Lagrange elements is that they are continuous,but not continuously differentiable.

In order to avoid these problems, we can extend the idea of Hermite interpolationto the development of finite element spaces. On our canonical element, our shape functionswill be written in terms of the barycentric coordinates as

V∗(ξ) ≡

V ∗1,0,0(ξ)V ∗0,1,0(ξ)V ∗0,0,1(ξ)V ∗1

3, 13, 13

(ξ)

S∗1,0,0(ξ)T ∗1,0,0(ξ)S∗0,1,0(ξ)T ∗0,1,0(ξ)S∗0,0,1(ξ)T ∗0,0,1(ξ)

=

β21(3− 2β1)− 7β1β2β3

β22(3− 2β2)− 7β1β2β3

[1− β21(3− 2β1)− β2

2(3− 2β2)− 13β1β2β3] 1√2

27β1β2β3

β1(β21 − β1 + 2β2β3)

[β21(1− β1 + β2)− 3β1β2β3]/

√2

[−β22(1 + β1 − β2) + 3β1β2β3]/

√2

β2(−β21 + β2 − 2β1β3)

β2[β21 − (1− β2)2] + 3β1β2β3

β1[(1− β1)2 − β22 ]− 3β1β2β3

.

In this case, the set of nodes is (1, 0, 0), (0, 1, 0), (0, 0, 1), (13 ,

13 ,

13). The first four of these

shape functions are one at exactly one of the nodes, are zero at the other nodes, and havezero gradient at the corners. The next six shape functions are zero at the nodes. Theirgradients satisfy

∇ξS∗1,0,0(

[10

]) =

[10

]≡ t∗31 = ∇ξT

∗0,0,1(

[01

])

∇ξS∗0,1,0(

[01

]) =

[−11

]1√2≡ t∗12 = ∇ξT

∗1,0,0(

[10

])

∇ξS∗0,0,1(

[00

]) =

[0−1

]≡ t∗23 = ∇ξT

∗0,1,0(

[01

])

The gradients of these six basis functions at the other vertices are all zero.First, let us consider the use of these Hermite cubic shape functions for coordinate

transformations of the formxe(ξ) = XeV∗(ξ)

whereXe = [xn1 , xn2 , xn3 , xc , yn1 , zn1 , yn2 , zn2 , yn3 , zn3 ]


(0,0) (1,0)

(0,1)

Figure 5.5-3: Canonical Element Nodes for Hermite Cubic Element


Here the vertices of the curvilinear triangle are xn1 , xn2 and xn3 and the centroid is xc. Theother vectors are to be determined. Note that xe maps the vertices of the canonical triangleT∗ to the vertices of Te, and the centroid of T∗ to xc. All that remains is to determine thelast six columns of Xe. Suppose that tn1,2 is the unit tangent vector to the side oppositexn3 at xn1 and oriented away from xn1 , and t3,n1 is the unit tangent vector to the sideopposite xn2 at xn1 and oriented towards xn1 . If λj is the arc length of the side of thecurvilinear triangle opposite vertex xnj , then we want

t3,n1δλ2 ≈ xe([10

]+ t∗31δ)− xe(

[10

])

≈ Xe∂V∗

∂ξ([10

])t∗31δ = [yn1(t

∗31)

> + zn1(t∗12)

>]t∗31δ = [yn1 − zn1 ]δ

and

tn1,2δλ3 ≈ xe([10

]+ t∗12δ)− xe(

[10

])

≈ Xe∂V∗

∂ξ([10

])t∗12δ = [yn1(t

∗31)

> + zn1(t∗12)

>]t∗12δ = [−yn1 + zn12]δ

We can solve this system to get

yn1 = t3,n12λ2 + tn1,2λ3

zn1 = tn1,2λ3 + t3,n12λ2

In other words,

X>e =

x>n1

x>n2

x>n3

x>c2λ2t>3,n1

+ λ3t>n1,2

λ3t>n1,2+ λ2t3,n12

>

2λ3t>1,n2+ λ1t>n2,3

λ1t>n2,3+ 2λ3t>1,n2

2λ1t>2,n3+ λ2t>n3,1

λ2t>n3,1+ 2λ1t>2,n3

This determines the form of the coordinate transformation.

The preceding form of the Hermite cubics is most useful for coordinate trans-formation. Next, let us discuss the use of the Hermite cubic shape functions forapproximating solutions of differential equations. To simplify the discussion, we


will assume that all of our elements are triangles. We would like to determine shape func-tions V(x) in each triangle, corresponding to nodal values at the vertices and the centroid,and gradients with respect to x at the vertices. Let

Xe = [xn1 , xn2 , xn3 ]

be the array of vertices of the element, so that

xe(ξ) = Xeb(ξ)

is the coordinate transformation. Since this coordinate transformation is linear, the canon-ical shape functions V∗(ξ) must provide a basis. Thus we can write

V(x(ξ)) = AV∗(ξ)

where the matrix A is to be determined. Since all but the first shape function in x is zeroat xn1 ,

10...0

= V(xn1) = AV∗([10

]) = A

10...0

This tells us that the first column of A is the first column of the identity matrix. Similarresults hold for the next three columns, corresponding to evaluation at the other verticesand the centroid. Taking the derivative at the first vertex leads to

0 0...

...0 01 00 10 0...

...0 0

=∂V∂x

(xn1) = A∂V∗

∂ξ([10

])(∂xe∂ξ

)−1 = A

0 0...

...0 01 0

−1/√

2 1/√

20 0...

...0 0


Similar results hold for the derivatives at the other vertices. It follows that

V(xe(ξ)) =

V ∗1,0,0(ξ)V ∗0,1,0(ξ)V ∗0,0,1(ξ)V ∗1

3, 13, 13

(ξ)

[∂xe∂ξ ]

[1 01

√2

] [S1,0,0(ξ)T1,0,0(ξ)

][∂xe∂ξ ]

[−√

2 −10 −1

] [S0,1,0(ξ)T0,1,0(ξ)

][∂xe∂ξ ]

[0 1−1 0

] [S0,0,1(ξ)T0,0,1(ξ)

]

A common variant of the Hermite cubics is the space Z3. In this finite element space,

the shape functions for an individual curvilinear triangle are given by

• four functions that are one at exactly one of the nodes, zero at the other nodes, andhave zero gradient at the corners,

• three functions that are zero at all of the nodes and have gradient equal to the firstaxis vector at exactly one of the corners, and

• three functions that are zero at all of the nodes and have gradient equal to the secondaxis vector at exactly one of the corners.

This is a subspace of C0 cubics on a mesh; the dimension is decreased by the C1 constraintat the corners.

Occasionally, authors will suggest removing the node at the centroid by requiring (forexample) that the x2

1x2 and x1x22 terms have the same coefficients. Unfortunately, this

damages the accuracy of the finite element approximation, and in some cases may not beuniquely determined.

5.5.7 Hierarchical Triangular Elements

Recall from section 5.3.3 that the hierchical basis functions in one dimension are

V ∗0 (ξ) =12(1− ξ) = 1− 1

2

∫ ξ

−1P0(t) dt

V ∗1 (ξ) =12(1 + ξ) =

12

∫ ξ

−1P0(t) dt

V ∗j (ξ) =

√2j − 1

2

∫ ξ

−1Pj−1(t) dt =

Pj(ξ)− Pj−2(ξ)√2(2j − 1)


Here Pj(ξ) is the Legendre polynomial of order j, defined by the recurrence

P0(x) = 1, P1(x) = x,∀n ≥ 1 Pn+1(x) =2n+ 1n+ 1

xPn(x)− n

n+ 1Pn−1(x) .

Note that V ∗j (ξ) = 0 for ξ = ±1 and j ≥ 2. As a result, we can factor

φj−2(ξ) ≡4

1− ξ2V ∗j (ξ) =

√2

2j − 1Pj(ξ)− Pj−2(ξ)

1− ξ2

The first two of these functions are

φ0(ξ) = −√

6

φ1(ξ) = −√

10ξ

Since the one-dimensional hierarchical basis functions satisfy the recurrence

V ∗j+1(ξ) =

√4j2 − 1j + 1

ξV ∗j (ξ) +j − 2j + 1

√2j + 12j − 3

V ∗j−1(ξ)

the functions φj(ξ) satisfy the recurrence

φj(ξ) =

√4(j + 1)2 − 1j + 2

ξφj−1(ξ)−j − 1j + 2

√2j + 32j − 1

φj−2(ξ)

There are∑k

i=0(i + 1) = 12(k + 1)(k + 2) degrees of freedom in a polynomial of degree

at most k in two variables. The hierarchical shape functions corresponding to polynomialsof degree at most one are the barycentric coordinates

β1 , β2 , β3 .

Note that each of these functions are linear, and nonzero at exactly one vertex of thecanonical triangle.

To extend to polynomial approximation of degree at most 2, we need to add 3 additionalshape functions. These are taken to be

β1β2φ0(β2 − β1) , β2β3φ0(β3 − β2) , β3β1φ0(β1 − β3) .

Note that each of these shape functions are zero at all of the vertices, so they are linearlyindependent of the linear shape functions. Each of these quadratic functions are zero ondistinct sides of the triangle, so these are linearly independent functions.


In order to extend to polynomials of degree at most 3, we need to define 4 additionalshape functions. These are taken to be

β1β2φ1(β2 − β1) , β2β3φ1(β3 − β2) , β3β1φ1(β1 − β3) ,

andβ1β2β3; .

Note that the last of these four is zero on all sides of the triangle, so it is linearly independentof the linear and quadratic shape functions. The first three of these shape functions arenonzero on at most one side of the triangle. Their linear independence of the linear shapefunctions is determined by their zero values at the vertices, and their linear independence ofthe quadratic shape functions is determined by the linear independence of the polynomialsφj .

In general there are k + 1 hierarchical shape functions of order k. Three of these are

β1β2φk−2(β2 − β1) , β2β3φk−2(β3 − β2) , β3β1φk−2(β1 − β3) ,

so that the triangular shape functions agree with the hierarchical shape functions on quadri-laterals on the sides. The other k−2 of these shape functions are not well-described in [87].For k = 4 they take these two additional shape functions to be

β1β2β3P1(β2 − β1) , β1β2β3P1(2β3 − 1) ,

where Pn(x) is the Legendre polynomial of degree n.Quadrature rules for the hierarchical elements can be found in [33].Exercises 5.51. Suppose that the domain Ω ⊂ R2 can be subdivided into a rectangular array of m2 squares, each of

which is subdivided into 2 triangles. What is the dimension of the finite element space using Lagrangeelements of order k on this mesh?

5.6 Quadrilateral Finite Elements

We will choose our canonical quadrilateral to be the square

Q∗ = ξ ∈ R2 : −1 ≤ ξ1, ξ2 ≤ 1

We will define canonical shape functions on this canonical quadrilateral, and use coordinatetransformations to define shape functions on arbitrary curvilinear quadrilaterals. We willdenote the set of shape functions by VkJ , where j is the order of polynomials used in eithercoordinate direction in the shape functions, and k is the number of derivatives used indetermining interpolants using these shape functions.

5.6. QUADRILATERAL FINITE ELEMENTS 367

5.6.1 Quadrature Rules for Quadrilaterals

The easiest way to define quadrature rules on our canonical quadrilateral is to useGaussian quadrature in each coordinate direction. For example, the one-dimensional rule∫ 1

−1g(ξ) dξ ≈ 2g(0)

gives rise to the two-dimensional product rule∫ 1

−1

∫ 1

−1g(ξ) dξ1 dξ2 ≈ 4g(

[00

]) (5.6-1)

This quadrature rule is exact for functions g that are linear in each of ξ1 and ξ2. Unfortu-nately, this quadrature rule will not be useful for quadrilateral finite elements.

The next one-dimensional rule∫ 1

−1g(ξ) dξ ≈ g(−

√35) + g(

√35)

gives rise to the two-dimensional product rule∫ 1

−1

∫ 1

−1g(ξ) dξ1 dξ2 ≈ g(

−√13

−√

13

) + g(

√13

−√

13

) + g(

−√13√

13

) + g(

√13√13

) (5.6-2)

This quadrature rule is exact for functions g that are quadratic in each of ξ1 and ξ2.

5.6.2 Tensor Product Linears: V01

A general function that is linear in each coordinate has 22 = 4 coefficients. Accordingly,we will define the array of four shape functions

V∗(ξ) ≡

(1− ξ1)(1− ξ2)(1 + ξ1)(1− ξ2)(1− ξ1)(1 + ξ2)(1 + ξ1)(1 + ξ2)

14

=([

1− ξ11 + ξ1

]12

)⊗([

1− ξ21 + ξ2

]12

),

These functions are zero at all but one of the corners of the canonical quadrilateral Q∗, andare one at distinct corners of Q∗. As a result, these shape functions are obviously linearlyindependent, and their nodal variables are evaluation at the corners.

We can also write

V∗(ξ) =

12121212

12

+

−1

2 −12

12 −1

2−1

212

12

12

[ξ1ξ2]

12

+

12−1

2−1

212

ξ1ξ22≡ t

12

+ Sξ12

+ hξ1ξ22


Here the unit vector t corresponds to a translation mode , the matrix S has unit vectorsfor columns and corresponds to shear modes and the unit vector h corresponds to thehourglass mode.

Note that V∗(ξ) is not linear in the canonical coordinate ξ, so its derivative is notconstant:

∂V∗

∂ξ== S

12

+ h12ξ> ,

where

ξ ≡[ξ2ξ1

]As a result, coordinate transformations involving tensor product linear functions producenon-constant Jacobians. If the coordinate transformation is

xe(ξ) = XeV∗(ξ) ,

then the Jacobian is

Je(ξ) =∂xe∂ξ

= Xe∂V∗

∂ξ= XeS

12

+ Xeh12ξ> .

The corners of the original quadrilateral Qe need to be chosen carefully so that thecoordinate transformation is reasonable. Consider the set of all points xe(ξ) where ξ2 = −1.In the x coordinates, these points have the form

XeV∗(ξ1,−1) = Xe

1− ξ11 + ξ1

00

12

=[

x11

x211−ξ1

2

]+[

x12

x221+ξ1

2

].

This is a straight line. After we apply this approach to the other sides of the canonicalsquare, we see that the boundary of the canonical square could map either to a quadrilateralor to a “bow-tie.” If Qe is a quadrilateral and the columns of Xe are the corners of Qechosen according to an ”S”-shaped path around the boundary of Qe, then Je(ξ) will havepositive determinant for all ξ in the canonical square Q∗.

In order to apply the bilinear element to a finite element method, we need to computethe contribution from the element to the stiffness matrix. In this case, we compute∫

Qe

∂V∗

∂xD(x)(

∂V∗

∂x)> dx =

∫ 1

−1

∫ 1

−1

∂V∗

∂ξJe(ξ)−1D(xe(ξ))Je(ξ)−>(

∂V∗

∂ξ)>|Je(ξ)| dξ1dξ2

In practice, such integrals will be approximated by quadrature rules, such as (5.6-2). Wealso need to compute the contribution of the bilinear element to the right-hand side:∫

eV ∗f dx =

∫ 1

−1

∫ 1

−1V ∗(ξ)f(xe(ξ))|Je(ξ)| dξ1dξ2


If we were to compute the stiffness matrix analytically, we would see why the isopara-metric coordinate transformation leads to non-polynomial approximations. Note that xe(ξ)is a function of ξ, with Jacobian

Je(ξ) =∂xe∂ξ

= Xe∂V∗

∂ξ= XeS

12

+ Xehξ> .

As a result,

Je(0) = XeS12,

and whenever Je(0) is nonsingular we can write

Je(ξ) = Je(0) + Xeh12ξ> = Je(0)[I + Je(0)−1Xehξ>] ≡ Je(0)[I + y0ξ

>] .

Here y0 solves

Je(0)y0 = Xeh12.

It is interesting to note that Xeh = 0 if and only if the columns of Xe correspond tothe vertices of a parallelogram. Also note that if Qe is a parallelogram, then detJe(0)is the (signed) area of the parallelogram. Finally, recall that the Sherman-Morrison-Woodbury formula shows that

(I + y0ξ>)−1 = I− y0

11 + ξ · y0

ξ > ,

and thatdet(I + y0ξ

>) = 1 + ξ · y0 .

In the discussion that follows, we will assume that detJe(0) > 0.Let us re-examine the contributions to the stiffness matrix. We compute∫

Qe

∂V∗

∂xD(x)(

∂V∗

∂x)> dx =

∫ 1

−1

∫ 1

−1

∂V∗

∂ξJe(ξ)−1D(xe(ξ))Je(ξ)−>(

∂V∗

∂ξ)>|Je(ξ)| dξ1dξ2

=∫ 1

−1

∫ 1

−1[S + hξ >]

12[I− y0

11 + ξ · y0

ξ >]J(0)−1D(xe(ξ))J(0)−>

[I− ξ1

1 + ξ · y0

y>0 ]12[S> + ξh>]|Je(0)|(1 + ξ · y0) dξ1dξ2 .

This expression suggests that we define the “bent hourglass mode”

h = h− Sy0 = [I− SJe(0)−1Xe]h .


Then we can write∫Qe

∂V∗

∂xD(x)(

∂V∗

∂x)> dx

=∫ 1

−1

∫ 1

−1[S + h

11 + ξ · y0

ξ>]12J(0)−1D(xe(ξ))J(0)−>

12[S> + ξ

11 + ξ · y0

a>]

|J(0)|(1 + ξ · y0) dξ1dξ2

If D is constant, then we obtain∫e

∂V∗

∂xD(

∂V∗

∂x)> dx = SJe(0)−1DJe(0)−>S>|Je(0)|

+14h∫ 1

−1

∫ 1

−1

ξ · Je(0)DJe(0)−>ξ1 + ξ · y0

dξ1dξ2|Je(0)|h>

5.6.3 Biquadratics

In this case, the array of shape functions is

V ∗(ξ) =

−12ξ1(1− ξ1)

(1 + ξ1)(1− ξ1)12ξ1(1 + ξ1)

12⊗

−12ξ2(1− ξ2)

(1 + ξ2)(1− ξ2)12ξ2(1 + ξ2)

12.

The nodes are the integer lattice points in the unit square. For quadratures, we can use thetensor product of the Gaussian quadrature rule∫ 1

−1f(ξ) dξ ≈ 5

9f(−

√35) +

99f(0) +

59f(

√35) .

5.6.4 Bicubics


V ∗(ξ) =

− 1

16(1 + 3ξ1)(1− 3ξ1)(1− ξ1)916(1 + ξ1)(1− 3ξ1)(1− ξ1)916(1 + ξ1)(1 + 3ξ1)(1− ξ1)

− 116(1 + 3ξ1)(1− 3ξ1)(1 + ξ1)

⊗− 1

16(1 + 3ξ2)(1− 3ξ2)(1− ξ2)916(1 + ξ2)(1− 3ξ2)(1− ξ2)916(1 + ξ2)(1 + 3ξ2)(1− ξ2)

− 116(1 + 3ξ2)(1− 3ξ2)(1 + ξ2)

.


5.6.5 Hermite Bicubics


V ∗(ξ) =

−1

4(2 + ξ1)(1− ξ1)214(1 + ξ1)(1− ξ1)214(2− ξ1)(1 + ξ1)2

−14(1− ξ1)(1 + ξ1)2

⊗−1

4(2 + ξ2)(1− ξ2)214(1 + ξ2)(1− ξ2)214(2− ξ2)(1 + ξ2)2

−14(1− ξ2)(1 + ξ2)2

.

5.6.6 Serendipity Quadratics

The shape functions are quadratics in each of ξ1 and ξ2, but such that the ξ21ξ22 term

vanishes. This constraint does not destroy the accuracy of the element. The shape functionsare

V ∗(ξ) =

−14(1− ξ1)(1− ξ2)(1 + ξ1 + ξ2)

12(1 + ξ1)(1− ξ1)(1− ξ2)

−14(1 + ξ1)(1− ξ2)(1− ξ1 + ξ2)

12(1− ξ1)(1 + ξ2)(1− ξ2)12(1 + ξ1)(1 + ξ2)(1− ξ2)

−14(1− ξ1)(1 + ξ2)(1 + ξ1 − ξ2)

12(1 + ξ1)(1− ξ1)(1 + ξ2)

−14(1 + ξ1)(1 + ξ2)(1− ξ1 − ξ2)

.

In general, we can define serendipity elements of order p in each variable, with 4p nodeson the boundary of each element.

5.6.7 Hierarchical Quadrilateral Elements

We have discussed the development of hierarchical shape functions in one dimensionabove in section 5.3.3. The difficulty in extending the approach to multiple dimensions isthat we have to assume a fixed-order coordinate transformation to the canonical square.Typically, either a bilinear or biquadratic transformation is used. We will discuss the developof hierarchical quadrilateral elements using a bilinear map for the coordinate transformation.

Recall that in one dimension we developed hierarchical basis functions of the form

V ∗0 (ξ) =12(1− ξ) = 1− 1

2

∫ ξ

−1P0(t) dt ,

V ∗1 (ξ) =12(1 + ξ) =

12

∫ ξ

−1P0(t) dt ,

∀2 ≤ j V ∗j (ξ) =

√2j − 1

2

∫ ξ

−1Pj−1(t) dt .


In two dimensions, are basis functions are

V ∗(ξ) =

V ∗0 (ξ1)V ∗1 (ξ1)V ∗2 (ξ1)

...V ∗p (ξ1)

⊗V ∗0 (ξ2)V ∗1 (ξ2)V ∗2 (ξ2)

...V ∗p (ξ2)

.

Note that there are

• 4 corner nodes

• 4(p− 1) side nodes other than at the corners, and

• (p− 1)2 interior nodes.

In order to evaluate the integrals, Szabo and Babuska [87] suggest reformulating the one-dimensional basis functions into the form

V ∗j (ξ) =∏p−2i=0 (ξ − ηi)(1− ξ)(1 + ξ)∏p−2

i=0 (ηj−2 − ηi)(1− ηj−2)(1 + ηj−2), 2 ≤ j ≤ p .

Here η0, . . . , ηp−2 are the zeros of P ′p, the derivative of the pth Legendre polynomial. ThusV ∗j is one at exactly one of the Gauss-Lobatto quadrature points. Since Gauss-Lobattoquadrature with p+ 1 points is exact for polynomials of degree 2p− 1, this is sufficient tointegrate the entries of the stiffness matrix.

5.7 Three-Dimensional Finite Elements

5.7.1 3D Rectangular Finite Elements

We will briefly discuss the development of three-dimensional rectangular elements. Ourcanonical element is the cube

e∗ = ξ ∈ R3 : max1≤i≤3

|ξi| ≤ 1 .

For trilinear elements, our basis functions are

V ∗(ξ) =[1− ξ11 + ξ1

]12⊗[1− ξ21 + ξ2

]12⊗[1− ξ31 + ξ3

]12

5.7. THREE-DIMENSIONAL FINITE ELEMENTS 373

=

11111111

18+

−11−11−11−11

ξ18

+

−1−111−1−111

ξ28

+

−1−1−1−11111

ξ38

+

1−1−111−1−11

ξ1ξ28

+

11−1−1−1−111

ξ2ξ38

+

1−11−1−11−11

ξ3ξ18

+

−111−11−1−11

ξ1ξ2ξ3

8.

In order to simplify the discussion of the computation of the integrals involved in thefinite element method, we will define several arrays. The array of shear modes is

S =

−1 −1 −11 −1 −1−1 1 −11 1 −1−1 −1 11 −1 1−1 1 11 1 1

18,

the array of low-order hourglass modes is

A =

1 1 11 −1 −1−1 1 −1−1 −1 1−1 −1 1−1 1 −11 −1 −11 1 1

18,

and the high-order hourglass mode is

a =

−111−11−1−11

18.


We also define

L(ξ) =

0 ξ3 ξ2ξ3 0 ξ1ξ2 ξ1 0

and

q(ξ) =

ξ2ξ3ξ3ξ1ξ1ξ2

.

With this notation, we can write

∂V ∗

∂ξ= S +AL(ξ) + aq(ξ)> .

A general curvilinear element has the nodes

X =

x11 x12 x13 x14 x15 x16 x17 x18

x21 x22 x23 x24 x25 x26 x27 x28

x31 x32 x33 x34 x35 x36 x37 x38

.

In order to transform coordinates between the curvilinear element and the canonical ele-ment, we will use the isoparametric transformation

x(ξ) = XV ∗(ξ) .

Note that x(ξ) is a nonlinear function of ξ, with Jacobian

J(ξ) =∂x∂ξ

= X∂V ∗

∂ξ= XS +XAL(ξ) +Xaq(ξ)>

= J(0)[I + J(0)−1XAL(ξ) + J(0)−1Xaq(ξ)>] .

This suggests that we define

Y ≡ J(0)−1XA and y = J(0)−1Xa .

Then we can writeJ(ξ) = J(0)[I + Y L(ξ) + yq(ξ)>] .

Note that if the curvilinear element is a parallelpiped, then Y = 0 and y = 0.This preparation allows us to compute the array of contributions to the stiffness matrix:∫ ∫ ∫

e

∂V ∗

∂xD(x)(

∂V ∗

∂x)> dx1 dx2 dx3


=∫ 1

−1

∫ 1

−1

∫ 1

−1

∂V ∗

∂ξJ(ξ)−1D(x(ξ))J(ξ)−>(

∂V ∗

∂ξ)>|J(ξ)| dx1 dx2 dx3

=∫ 1

−1

∫ 1

−1

∫ 1

−1[S +AL(ξ) + aq(ξ)>][I + Y L(ξ) + yq(ξ)>]−1J(0)−1D(x(ξ))J(0)−>

[I + Y L(ξ) + yq(ξ)>]−>[S +AL(ξ) + aq(ξ)>]>|J(0)||I + Y L(ξ) + yq(ξ)>| dξ1 dξ2 dξ3 .

If the coefficient matrix D(x) is not constant, then it is useful to approximate∫ ∫ ∫e

∂V ∗

∂xD(x)(

∂V ∗

∂x)> dx1 dx2 dx3 ≈

= [∑

ξ3=±q

13

∑ξ2=±

q13

∑ξ1=±

q13

[S+AL(ξ)+aq(ξ)>][I+Y L(ξ)+yq(ξ)>]−1J(0)−1D(xIξ))J(0)−>

[I + Y L(ξ) + yq(ξ)>]−>[S +AL(ξ) + aq(ξ)>]>|I + Y L(ξ) + yq(ξ)>|]|J(0)| .

If the coefficient matrix D(x) is constant, then it is efficient to compute

D ≡ J(0)−1DJ(0)−>|J(0)|

and use it in the quadrature rule∫ ∫ ∫e

∂V ∗

∂xD(x)(

∂V ∗

∂x)> dx1 dx2 dx3 ≈

=∑

ξ3=±q

13

∑ξ2=±

q13

∑ξ1=±

q13

[S +AL(ξ) + aq(ξ)>][I + Y L(ξ) + yq(ξ)>]−1D

[I + Y L(ξ) + yq(ξ)>]−>[S +AL(ξ) + aq(ξ)>]>|I + Y L(ξ) + yq(ξ)>| .

If the curvilinear element is a parallelpiped and D is constant, then we can compute theintegral exactly: ∫ ∫ ∫

e

∂V ∗

∂xD(x)(

∂V ∗

∂x)> dx1 dx2 dx3 ≈

= SDS> +

D22 + D33 D21 D31

D12 D33 + D11 D32

D13 D23 D11 + D22

83

+

D11 0 00 D22 00 0 D33

89.

Similarly, the contribution to the right-hand side of the finite element equations is∫ ∫ ∫eV ∗f dx1 dx2 dx3 =

∫ 1

−1

∫ 1

−1

∫ 1

−1V ∗(ξ)f(x(ξ))|J(0)||I+Y L(ξ)+yq(ξ)>| dξ1 dξ3 dξ3


≈ [∑

ξ3=±q

13

∑ξ2=±

q13

∑ξ1=±

q13

V ∗(ξ)f(x(ξ))|I + Y L(ξ) + yq(ξ)>||J(0)| .

If the inhomogeneity f is constant and the element e is a parallelpiped, then we can compute∫ ∫ ∫eV ∗f dx1 dx2 dx3 =

∫ 1

−1

∫ 1

−1

∫ 1

−1V ∗(ξ)f |J(0)|dξ1 dξ3 dξ3

=

11111111

f |J(0)|

4.

5.7.2 3D Tetrahedral Finite Elements

The natural generalization of a triangle to three dimensions is a tetrahedron. Ourcanonical tetrahedral element is

e∗ = ξ ∈ R3 : ξ ≥ 0 and3∑i=1

ξi ≤ 1 .

This allows us to define canonical shape functions in terms of the barycentric coordinates:

V ∗(ξ) =

ξ1ξ2ξ3

1− ξ1 − ξ2 − ξ3

=

β1(ξ)β2(ξ)β3(ξ)β4(ξ)

.

It follows that

∂V ∗

∂ξ=

1 0 00 1 00 0 1−1 −1 −1

≡ S

is constant.The curvilinear element is assumed to have vertices

X =

x11 x12 x13 x14

x21 x22 x23 x24

x31 x32 x33 x34

.


This leads to the isoparametric transformation

x(ξ) = XV ∗(ξ)

with JacobianJ =

∂x∂ξ

= XS .

In this case, the Jacobian is also constant, which greatly simplifies the integrals in the finiteelement method.

The contribution to the stiffness matrix is∫ ∫ ∫e

∂V ∗

∂xD(

∂V ∗

∂x)> dx1 dx2 dx3 =

∫ ∫ ∫e∗SJ−1D(x(ξ))J−>S>|J | dx1 dx2 dx3

≈ SJ−1D(x(

141414

)J−>S>|J |6.

The contribution to the right-hand side is∫ ∫ ∫eV ∗f dx1 dx2 dx3 =

∫ ∫ ∫e∗v∗(ξ)f(x(ξ))|J | dx1 dx2 dx3

≈ [V ∗f ](

141414

)|J |6.

5.7.3 3D Wedge Elements

Our canonical wedge element is

e∗ = ξ ∈ R3 : −1 ≤ ξ3 ≤ 1 and0 ≤ ξ1, ξ2 andξ1 + ξ2 ≤ 1 .

Our canonical shape functions are

V ∗(ξ) =

ξ1ξ2

1− ξ1 − ξ2

⊗ [1− ξ31 + ξ3

]12.

A useful quadrature rule is to give equal weight to functions values at the points ξ =

[13 ,13 ,±

√13 ]>.

Sometimes people will form a wedge or tetrahedron by coalescing points in a 3D curvilin-ear rectangle. This leads to singularities in the Jacobian of the coordinate transformation,and may not be desirable.


5.8 Error Estimates for Linear Elements

The following discussion is adapted from Shewchuk [76].

Lemma 5.8-1: Suppose that we are given a triangle (d = 2) or tetrahedron (d = 3)T ⊂ Rd with vertices

X =[x1 . . . xd+1

]=

x1,1 . . . x1,d+1...

...xd,1 . . . xd,d+1

such that X has rank d. Suppose that f(x) ∈ C2(T ) satisfies

∃cf,T ∀x ∈ T ∀s ∈ Rd |s>(∇x∇x>f)(x)s| ≤ cf,T s

>s

Let r be the radius of the smallest circle (d = 2) or sphere (d = 3) that contains thetriangle or tetrahedron, and let x be the corresponding center. Let pf,T (x) be the linearfunction that interpolates f at the vertices of T . Then

∀x ∈ T |f(x)− pf,T (x)| ≤cf,T r

2

2

This estimate is the best possible: if x is the center of of the circumscribing circle or spherefor T , then the function

φ(x) =cf,T2

[r2 − ‖x− x‖2]

is such that its interpolant pφ,T (x) ≡ 0 and

|φ(x)− pφ,T (x)| =cf,T r

2

2

Proof: Recall the canonical barycentric coordinates

∀ξ ∈ Rd b(ξ) =

ξ1...ξd

1− ξ1 − . . .− ξd

and the canonical triangle or tetrahedron

T∗ ≡ ξ : b(ξ) ≥ 0

5.8. ERROR ESTIMATES FOR LINEAR ELEMENTS 379

Note that∀ξ ∈ Rd e>b(ξ) ≡ 1

and that

∂b

∂ξ=

1

. . .1

−1 . . . −1

is constant. We can define the coordinate transformation

∀ξ ∈ T∗ , x(ξ) ≡ Xb(ξ)

and find that x(ξ) ∈ T if and only if ξ ∈ T∗. Also note that

∂x(ξ)∂ξ

= X∂b

∂ξ

is constant. It follows that

∀ξ1, ξ2 ∈ T∗ , x(ξ2)− x(ξ1) =∂x∂ξ

(ξ2 − ξ1) .

Letf =

[f(Xe1) . . . f(Xed+1)

]be the array of values of f at the vertices of T . Then the linear function thatinterpolates f(x) at the vertices of T is

∀ξ ∈ Rd , pf,T (ξ) = f>b(ξ) .

Define the interpolation error

∀ξ ∈ T∗ , ε(ξ) ≡ f(x(ξ))− pf,T (ξ)

and note that the second derivative of ε is the second derivative of f . Theinterpolation error is implicitly a function of x ∈ T through the isoparametrictransformation x(ξ). We will introduce the notation

H(ν) ≡∫ 1

0

∫ 1

0∇x∇x

>f(x(ξ[1− τ2] + [ξτ1 + νi(1− τ1)]τ2))dτ2 (1− τ1)dτ1


and compute

ε(ξ)− ε(ν) =∫ 1

0

dε(ξτ1 + ν[1− τ1])dτ1

dτ1 =∫ 1

0∇ξε(ξτ1 + ν[1− τ1]) · (ξ − ν)dτ1

=∫ 1

0[∇ξε(ξ) +

∫ 1

0

d∇ξε(ξ[1− τ2] + [ξτ1 + ν(1− τ1)]τ2)dτ2

dτ2] · (ξ − ν) dτ1

= ∇ξε(ξ) · [ξ − ν]− [ξ − ν]>(∂x∂ξ

)>H(ν)∂x∂ξ

[ξ − ν]

= ∇ξε(ξ) · [ξ − ν]− [x(ξ)− x(ν)]>H(ν)[x(ξ)− x(ν)]

In particular, if ν is a vertex of the canonical triangle or tetrahedron T∗ (i.e., ifν is an axis vector or zero), then we have ε(ν) = 0. Thus for any vertex ν of T∗we have

ε(ξ) = ∇ξε(ξ) · [ξ − ν]− [x(ξ)− x(ν)]>H(ν)[x(ξ)− x(ν)] (5.8-1)

Note that

s>H(ν)s ≤ cf,T s>s

∫ 1

0

∫ 1

0dτ2 τ1dτ1 =

cf,T2s>s

Let νi = ei for 1 ≤ i ≤ d and νd+1 = 0 denote the vertices of the canonicaltriangle or tetrahedron T∗. Since the barycentric coordinates sum to 1, equation(5.8-1) implies that

ε(ξ) = e>b(ξ)ε(ξ)

=d+1∑i=1

e>i b(ξ)∇ξε(ξ) · (ξ − νi)− [x(ξ)− x(νi)]>H(νi)[x(ξ)− x(νi)]

= ∇ξε(ξ)

ξ

d+1∑i=1

e>i −d+1∑i=1

νie>i

b(ξ)

−d+1∑i=1

e>i b(ξ)[x(νi)− x(ξ)]>H(νi)[x(νi)− x(ξ)]

= ∇ξε(ξ) ξ − ξ −d+1∑i=1


= −d+1∑i=1


This result easily implies that

|ε(ξ)| ≤cf,T2

d+1∑i=1

‖x(νi)− x(ξ)‖2e>i b(ξ) ≡ β(ξ)


In order to determine a bound for β, we will compute its extrema. Note that

2cf,T

∂β

∂ξ=

d+1∑i=1

‖x(νi)− x(ξ)‖2e>i∂b

∂ξ−

d+1∑i=1

e>i b(ξ)[x(νi)− x(ξ)]>∂x∂ξ

=d+1∑i=1


∂ξ−

d+1∑i=1

[Xb(ξ)− x(ξ)]>∂x∂ξ

=d+1∑i=1


∂ξ

=d∑i=1

‖x(νi)− x(ξ)‖2e>i − ‖x(νd+1)− x(ξ)‖2e>

Thus at an extremum of β, we must have

∀1 ≤ i ≤ d , ‖x(νi)− x(ξ)‖2 = ‖x(νd+1)− x(ξ)‖2

This says that x(ξ) is equidistant from the vertices of T , which means that x(ξ)is the center of the circumscribing circle. It is now easy to see that β(ξ) reachesits maximum at the center of the circumscribing circle. Further, if x(ξcirc) isthe center of the circumscribing circle, for which the radius is rcirc, then

β(ξ) ≤ β(ξcirc) =cf,T2

d+1∑i=1

‖x(νi)− x(ξcirc)‖2e>i b(ξcirc)

=cf,T2r2circ

d+1∑i=1

e>i b(ξcirc) =cf,T r

2circ

2(5.8-2)

If ξcirc 6∈ T∗ then the bound (5.8-2) will be larger than necessary. In this case,we note that β(ξ) is a quadratic in x(ξ) of the form

β(ξ) =cf,T2

d+1∑i=1

e>i b(ξ)‖x(νi)‖2 − 2d+1∑i=1

x(νi)>x(ξ) +d+1∑i=1

e>i b(ξ)‖x(ξ)‖2

=cf,T2

d+1∑i=1

e>i b(ξ)‖x(νi)‖2 − 2[Xb(ξ)]>x(ξ) + ‖x(ξ)‖2

=cf,T2

d+1∑i=1

e>i b(ξ)‖x(νi)‖2 − ‖x(ξ)‖2


In other words, the quadratic terms do not involve any mixed products. Afterinverting x(ξ) to determine β(x) ≡ β(ε(x)), we see that β(x) is zero at thevertices of T . Further, the gradient of β(x) is zero at the center of the circum-scribing circle xcirc ≡ x(ξcirc), and β(xcirc) = cf,T r

2circ/2. Note that the gradient

condition is not independent of the zero value conditions at the vertices. Thereare d+ 1 conditions describing the zero values at the vertices, d conditions de-scribing the zero gradient at xcirc, and one condition describing the value atxcirc; these are 2d+ 1 independent conditions. These conditions determine thequadratic of this form uniquely. Another way to write this same quadratic is

β(x) =cf,T2

[r2circ − ‖x− xcirc‖2]

If xcirc 6∈ T , then the minimum circle or sphere containing T is smaller thanthe circumscribing circle. The maximum value of β(x) over x ∈ T occurs at thepoint x ∈ T that minimizes ‖x − xcirc‖2. Since xcirc is equidistant from all ofthe vertices of T , x cannot be a vertex of T , and cannot lie on an edge of T ifT is a tetrahedron. Rather, x lies in some side of T : for some 1 ≤ i ≤ d+ 1 wemust have x = x(ξ) where e>i b(ξ) = 0 and e>j b(ξ) > 0 for j 6= i. The theory forleast squares problems show us xcirc − x is orthogonal to this side of T . Sincethe Pythagorean theorem shows that for all j 6= i

r2circ = ‖Xej − xcirc‖2 = ‖Xej − x + x− xcirc‖2 = ‖Xej − x‖2 + ‖x− xcirc‖2

we see that the vertices in this side of T are all at the same distance r2 ≡r2circ − ‖x − xcirc‖2 from x. These vertices and x determine a circle (d = 2)or sphere (d = 3). Rajan [67] used duality theory for quadratic programmingto show that maxx∈T β(x) = r2 is the square of the radius of the minimumcontainment circle (d = 2) or sphere (d = 3), and that the maximum occurs atthe center of this circle or sphere. This completes the proof. 2

This lemma shows that piecewise linear interpolation can be accurate in approximatingfunction values, so long as the the triangles are chosen sufficiently small. Note that thereare no restrictions on the angles of the triangles. This does not mean, however, that thegradient of the function is well-approximated by the gradient of the piecewise linear function.Accurate approximation of the gradient will require that the largest angle in the element


not be too large.

Lemma 5.8-2: Suppose that we are given a triangle (d = 2) or tetrahedron (d = 3) Twith vertices

X =[x1 . . . xd+1

]=

x1,1 . . . x1,d+1...

...xd,1 . . . xd,d+1

such that X has rank d. Suppose that f(x) ∈ C2(T ) satisfies

∃cf,T ∀x ∈ T ∀s ∈ Rd |s>(∇x∇x>f)(x)s| ≤ cf,T s

>s

Let r be the radius of the smallest circle (d = 2) or sphere (d = 3) that contains thetriangle or tetrahedron, and let x be the corresponding center. Let pf,T (x) be the linearfunction that interpolates f at the vertices of T . For d = 2, let αi be the length of theside opposite vertex i in T , divided by twice the area; for d = 3, let αi be the area of theface opposite vertex i in T , divided by 3 times the volume:

αi = ‖(∂x∂ξ

)−>∇ξb>ei‖ .

Then for all x ∈ T ,

‖∇x[f(x)− pf,T (x)]‖ ≤ cf,T maxj

∑d+1i=1 ‖xj − xi‖αi∑d+1

i=1 αi+cf,T2

∑d+1i=1

∑j<i ‖xi − xj‖2αjαi∑d+1

i=1 αj

Proof: Define the interpolation error

∀ξ ∈ T∗ ε(ξ) ≡ f(x(ε))− pf,T (ξ)


and note that the proof of the previous lemma implies that

0 = [d+1∑i=1

∇xb>ei]ε(ξ)

=d+1∑i=1

[e>i∂b

∂ξ(∂x∂ξ

)−1]>∇ξε · (ξ − νi)− [x(ξ)− x(νi)]>H(νi)[x(ξ)− x(νi)]

= (

∂x∂ξ

)−>d+1∑i=1

[(ξ − νi)e>i∂b

∂ξ]>∇ξε

−d+1∑i=1

[e>i∂b

∂ξ(∂x∂ξ

)−1]>[x(ξ)− x(νi)]>H(νi)[x(ξ)− x(νi)]

= −(∂x∂ξ

)−>∇ξε−d+1∑i=1

[e>i∂b

∂ξ(∂x∂ξ

)−1]>[x(ξ)− x(νi)]>H(νi)[x(ξ)− x(νi)]

= −∇xε−d+1∑i=1

[e>i∂b

∂ξ(∂x∂ξ

)−1]>[x(ξ)− x(νi)]>H(νi)[x(ξ)− x(νi)]

It follows that

‖∇xε‖ ≤cf,T2

d+1∑i=1

‖x(ξ)− x(νi)‖2‖(∂x∂ξ

)−>∇ξb>ei‖

=cf,T2

d+1∑i=1

‖x(ξ)− x(νi)‖2αi ≡ γ(x)

For d = 2, we have

(∂x∂ξ

)−>∇ξb> =[

x2,2 − x2,3 x2,3 − x2,1 x2,1 − x2,2

−x1,2 + x1,3 −x1,3 + x1,1 −x1,1 + x1,2

]1|∂x∂ξ |

The columns of this array are rotations by 90 degrees of the sides oppositethe vertex corresponding to the column index, divided by twice the area of thetriangle. Thus for d = 2, αi is the length of the side opposite vertex i in T ,


divided by twice the area. For d = 3, we have

(∇ξb>)>(∂x∂ξ

)−1 =

1 0 00 1 00 0 1−1 −1 −1

[x1 x2 x3 x4

] 1 0 00 1 00 0 1−1 −1 −1

−1

=

1 0 00 1 00 0 1−1 −1 −1

[x1 − x4 x2 − x4 x3 − x4

]−1

=

1 0 00 1 00 0 1−1 −1 −1

(x2 − x4)× (x3 − x4)>(x3 − x4)× (x1 − x4)>(x1 − x4)× (x2 − x4)>

1|∂x∂ξ |

=

(x2 − x4)× (x3 − x4)>(x3 − x4)× (x1 − x4)>(x1 − x4)× (x2 − x4)>(x1 − x2)× (x3 − x2)>

16|T |

Thus for d = 3, αi is twice the area of the face opposite vertex i in T , dividedby 6 times the volume.

The minimum of γ(x) occurs at the zero of its gradient, which is

xin =∑d+1

i=1 x(νi)αi∑d+1i=1 αi

.


This turns out to be the center of the inscribed circle or sphere. Further,

γ(xin)2cf,T

=d+1∑i=1

‖x(νi)−∑d+1

i=1 x(νj)αj∑d+1i=1 αj

‖2αi

=

∑d+1i=1 ‖

∑d+1j=1(x(νi)− x(νj))αj‖2αi

(∑d+1

i=1 αj)2

=

∑d+1i=1

∑d+1j=1

∑d+1k=1(x(νi)− x(νj)) · (x(νi)− x(νk))αjαkαi

(∑d+1

i=1 αj)2

=

∑d+1i=1

∑d+1j=1

∑d+1k=1[‖x(νi)‖2 − x(νj) · x(νi)− x(νi) · x(νk) + x(νj) · x(νk)]αjαkαi

(∑d+1

i=1 αj)2

=12

∑d+1i=1

∑d+1j=1

∑d+1k=1[‖x(νi)‖2 + ‖x(νj)‖2 − x(νi) · x(νj)]αjαkαi

(∑d+1

i=1 αj)2

=12

∑d+1i=1

∑d+1j=1

∑d+1k=1[‖x(νi)− x(νj)‖2]αjαkαi

(∑d+1

i=1 αj)2

=12

∑d+1i=1

∑d+1j=1 [‖x(νi)− x(νj)‖2]αjαi∑d+1

i=1 αj

=

∑d+1i=1

∑j<i[‖x(νi)− x(νj)‖2]αjαi∑d+1

i=1 αj

For any x ∈ T we have

‖x−xin‖ = ‖x−∑d+1

i=1 x(νi)αi∑d+1i=1 αi

‖ =‖∑d+1

i=1 [x− x(νi)]αi‖∑d+1i=1 αi

≤∑d+1

i=1 ‖x− x(νi)‖αi∑d+1i=1 αi

Also note that this distance is maximized at one of the vertices of T . Since

∇xε(x)− ∇xε(xin) =∫ 1

0

d∇xε(xin(1− τ) + xτ)dτ

dτ

=∫ 1

0

∂2ε

∂x2|xin(1−τ)+xτ (x− xin) dτ


it follows that

‖∇xε(x)‖ ≤ ‖∇xε(x)− ∇xε(xin)‖+ ‖∇xε(xin)‖

≤ ‖∫ 1

0

∂2ε

∂x2|xin(1−τ)+xτ (x− xin) dτ‖+ γ(xin)

≤ cf,T maxj‖x(νi)− xin‖+

cf,T2

∑d+1i=1

∑j<i ‖x(νi)− x(νj)‖2αjαi∑d+1

i=1 αj

≤ cf,T maxj

∑d+1i=1 ‖x(νj)− x(νi)‖αi∑d+1

i=1 αi+cf,T2

∑d+1i=1

∑j<i ‖x(νi)− x(νj)‖2αjαi∑d+1

i=1 αj2

This error bound can be simplified for d = 2. Let

αi =Li2A

where Li is the length of the side of the triangle opposite vertex i, and A is the area of thetriangle. For simplicity, assume that the lengths of the sides have been ordered

L1 ≤ L2 ≤ L3 .

Thus the angles θi at the vertices are ordered in the same way. Also, let P =∑3

i=1 Li bethe perimeter of the triangle, and let A = 1

2L1L2 sin θ3 be the area of the triangle. Then∑d+1i=1

∑j<i[‖x(νi)− x(νj)‖2]αjαi∑d+1

i=1 αj

=1

2AP‖x(ν2)− x(ν1)‖2L2L1 + ‖x(ν3)− x(ν1)‖2L3L1 + ‖x(ν3)− x(ν2)‖2L3L2

=

12AP

L2

3L2L1 + L22L3L1 + L2

1L3L2

=L1L2L3

2Aand ∑d+1

i=1 ‖x(νj)− x(νi)‖αi∑d+1i=1 αi

=2∏i6=j Li

P

In other words,

‖∇xε(x)‖ ≤2cf,TP

maxj

∏i6=j

Li +cf,T4A

3∏i=1

Li

= cf,T

[2L2L3

P+

14A

]L1L2L3 = cf,TL3

[2L2

L1 + L2 + L3+

12 sin θ3

]Thus the gradient of the error should be small, provided that the largest angle in the triangleis not near π.


5.9 Condition Number Estimates for Linear Elements

Lemma 5.9-1: Suppose that we are given a triangle (d = 2) or tetrahedron (d = 3) T inRd with area (d = 2) or volume (d = 3) |T | > 0. Let β1 < . . . < βd+1 be the eigenvaluesof

B ≡∫T∗

b(ξ)b(ξ)> dξ

Suppose that f(x) is linear, with values

f =

f1...

fd+1

at the vertices of T . Then

β1d!|T |‖f‖2 ≤∫Tf(x)2 dx ≤ βd+1d!|T |‖f‖2

Proof: Since f is linear, f(x(ξ)) = f>b(ξ) where b(ξ) are the barycentriccoordinates for the canonical element T∗. Note that∫

Tf(x)2 dx =

∫T∗

f>b(ξ)2|∂x∂ξ| dξ = |∂x

∂ξ|f>∫T∗

b(ξ)b(ξ)> dξf

= d!|T |f>Bf .

The quadratic form can be bounded above and below by the eigenvalues of thematrix:

β1‖f‖2 ≤ f>Bf ≤ βd+1‖f‖2 .

2

If d = 2 t is straightforward to compute

B =

112

18 − 1

2418

14 − 1

24− 1

24 − 124

112

with eigenvalues β1 ≈ .011674 and β3 ≈ .32985. For d = 3 we have

B =

160

140

130 − 1

30140

120

115 − 7

120130

115

110 − 3

40− 1

30 − 7120 − 3

40112

5.9. CONDITION NUMBER ESTIMATES FOR LINEAR ELEMENTS 389

with eigenvalues β1 ≈ 0.0022997 and β4 ≈ .22818.

Lemma 5.9-2: Suppose that Ω ⊂ Rd is a union of disjoint triangles or tetrahedra Te, 1 ≤e ≤ E, such that no vertex of these elements is shared by fewer than smin elements, ormore than smax elements. Suppose that we order the vertices xn, 1 ≤ n ≤ N . If f(x) ispiecewise linear on these triangles, let f ∈ RN be the vector of values f(xn). If β1 andβd+1 are as described in lemma 5.9-1 then

β1d!smin mine|Te|‖f‖2 ≤

∫Ωf(x)2 dx ≤ βd+1d!smax max

e|Te|‖f‖2

Proof: Using lemma 5.9-1 we compute∫Ωf(x)2 dx =

E∑e=1

∫Te

f(x)2 dx ≤E∑e=1

β3d!|Te|∑

xn vertex of Te

f(xn)2

≤ βd+1d!smax maxe|Te|‖f‖2

and ∫Ωf(x)2 dx =

E∑e=1

∫Te

f(x)2 dx ≥E∑e=1

β1d!|Te|∑

xn vertex of Te

f(xn)2

≥ β1d!smin mine|Te|‖f‖2

2

Corollary 5.9-3: Suppose that Ω ⊂ Rd is a union of disjoint triangles or tetrahedraTe, 1 ≤ e ≤ E, such that no vertex of these triangles is shared by fewer than smin elements,or more than smax elements. Suppose that we order the vertices xn, 1 ≤ n ≤ N . LetVn(x) be the piecewise linear function such that V (xm) = δmn. Define the Gram matrixG ∈ RN×N to have entries

e>mGen =∫

ΩVm(x)Vn(x) dx

If β1 and βd+1 are as described in lemma 5.9-1 then the eigenvalues of G lie betweenβ1 d! smin mine |Te| and βd+1 d! smax maxe |Te|, and the condition number of G satisfies

κ(G) ≤ βd+1smax maxe |Te|β1smin mine |Te|


Proof: If f(x) is piecewise linear and f is the vector of vertex values f(xn),then f>Gf =

∫Ω f(x)2 dx. The Rayleigh quotient f>Gf/f>f can be used to

bound the eigenvalues. Further, we can bound the norm of G by

‖G‖ = maxf 6=0

f>Gff>f

≤ βd+1 d! smax maxe|Te|

Since G is symmetric and nonsingular, we can let h = G1/2f and see that

β1 d! smin mine|Te|h>G−1h ≤ h>h

This implies that

‖G−1‖ = maxh6=0

h>G−1hh>h

≤ 1β1 d! smin mine |Te|

2

Lemma 5.9-4: Suppose that we are given a triangle T in two dimensions with vertices

X =[x1 x2 x3

]such that X has rank 2. Let the angle at these vertices be θ1, θ2, θ3 and let the lengths ofthe sides opposite these vertices be `1, `2, `3. If the array of nodal basis functions on T isV (x(ξ)) = b(ξ), then the stiffness matrix for the Laplacian on T is

∫T

∂V

∂x(∂V

∂x)> dx =

12

cot θ2 + cot θ3 − cot θ3 − cot θ2− cot θ3 cot θ3 + cot θ1 − cot θ1− cot θ2 − cot θ1 cot θ1 + cot θ2

=

18|T |

2`21 `23 − `21 − `22 `22 − `21 − `23`23 − `21 − `22 2`22 `21 − `22 − `23`22 − `21 − `23 `21 − `22 − `23 2`23

The characteristic polynomial for this matrix is

0 = λ[λ2 − 2(cot θ1 + cot θ2 + cot θ3)λ+34]

= λ[λ2 − `21 + `22 + `234|T |

λ+34]

Finally, the largest eigenvalue of the local stiffness matrix satisfies

`21 + `22 + `238|T |

≤ λmax ≤`21 + `22 + `23

4|T |


Proof: Let us recall some facts about triangles. The area of a triangle withside of length `1 and adjacent angles θ2 and θ3 is

|T | = `212

tan θ2 tan θ3tan θ2 + tan θ3

=`212

sin θ2 sin θ3sin θ1

The area of a triangle with sides `1 and `2 intersecting with angle θ3 is

|T | = 12`1`2 sin θ3

Heron’s formula for the area of a triangle with sides of length `1, `2, `3 andperimeter p = `1 + `2 + `3 is

|T | = 14

√p(p− 2`1)(p− 2`2)(p− 2`3)

Multiplying out the terms in Heron’s formula leads to the equation

16|T |2 = (`1 + `2 + `3)(`1 + `2 − `3)(`2 + `3 − `1)(`3 + `1 − `2)

= 2(`21`22 + `22`

23 + `23`

21)− `41 − `42 − `43

It will also be helpful to recall the law of cosines:

`23 = `21 + `22 − 2`1`2 cos θ3

Let us turn now to the proof of the lemma. Recall that V (x(ξ)) = b(ξ) satisfies

(∂V

∂x)> = (

∂V ∗

∂ξ[∂x∂ξ

]−1)>

=[

0 1−1 0

] [x2 − x3 x3 − x1 x1 − x2

] 1|∂x∂ξ |

Also recall that the determinant of the Jacobian of the coordinate transformationis twice the area of the triangle |T |. It follows that the element stiffness matrixis ∫ ∫

T

∂V

∂x(∂V

∂x)> dx1 dx2

=(∂V ∗

∂ξ[∂x∂ξ

]−1)∫ ∫

T∗

dξ1dξ2|∂x∂ξ|(∂V

∗

∂ξ[∂x∂ξ

]−1)>

=1

2|T |

(x2 − x3)>

(x3 − x1)>

(x1 − x2)>

[0 −11 0

]12(2|T |)

[0 1−1 0

] [x2 − x3 x3 − x1 x1 − x2

] 12|T |

=1

4|T |

(x2 − x3)>

(x3 − x1)>

(x1 − x2)>

[x2 − x3 x3 − x1 x1 − x2

]


Note that, for example,

(x2 − x3)>(x2 − x3)/(2|T |) = `21/(2|T |) =tan θ2 + tan θ3tan θ2 tan θ3

= cot θ2 + cot θ3

and from the law of cosines

(x2 − x3)>(x3 − x1) = −`1`2 cos θ3 =12(`23 − `21 − `22)

These results prove the matrix forms for the local stiffness matrix in the state-ment of the lemma.

Direct computation of the characteristic polynomial leads to the form

0 = λ[λ2 − `21 + `22 + `234|T |

λ− 364|T |2

(`41 + `42 + `43 − 2`21`22 − 2`22`

23 − 2`23`

21)]

Heron’s formula leads to the result claimed in the lemma. The inequalities forthe largest eigenvalue of the stiffness matrix from the quadratic formula. 2

Lemma 5.9-5: Suppose that Ω ⊂ R2 is a union of disjoint triangles Te, 1 ≤ e ≤ E, suchthat no vertex of these elements is shared by fewer than smin elements, or more than smax

elements. Suppose that there is a line L such that for all lines L′ parallel to L the lengthof the line segment L′ ∩ Ω is at most R. If f(x) is piecewise linear on Ω, then

2β1smin

R2mine|Te|‖f‖2 ≤

∫Ω∇xf · ∇xf(x) dx ≤ smax max

e

∑i:`ie is length of a side of Te

`2ie4|Te|

‖f‖2

Proof: Using the Poincare inequality (??) we see that∫Ω∇xf · ∇xf(x) dx ≥ 1

R2

∫Ωf(x)2 dx

Lemma 5.9-2 completes the lower bound in the lemma. Using lemma 5.9-4 wecompute∫

Ω∇xf ·∇xf(x) dx =

∑e

∫Te

∇xf ·∇xf(x) dx ≤∑e

`21e + `22e + `23e4|Te|

∑xi a vertex of Te

f(xi)2

The claimed upper bound follows immediately. 2


Corollary 5.9-6: Suppose that Ω ⊂ R2 is a union of disjoint triangles Te, 1 ≤ e ≤ E,such that no vertex of these elements is shared by fewer than smin elements, or more thansmax elements. Suppose that there is a line L such that for all lines L′ parallel to L thelength of the line segment L′ ∩ Ω is at most R. The condition number of the stiffnessmatrix A satisfies

κ(A) ≤ smaxR2

8sminβ1

maxe(∑

i:`ie is length of a side of Te`2ie)/(|Te|)

mine |Te|

Lemma 5.9-7: Suppose that Ω ⊂ R2 is a union of disjoint triangles Te, 1 ≤ e ≤ E, suchthat no vertex of these triangles is shared by fewer than smin triangles. Then

supu piecewise linear on Ω,u=0 on ∂Ω

∫Ω ‖∇xu‖2 dx∫

Ω u2 dx

≥ 12smin

maxe(max`= length of side of Te `2/|Te|)

mine |Te|

Proof: Let u(x) be piecewise linear on the union of triangles Ω, and supposethat u(x) is one at exactly one interior vertex x of Ω. Let T be a triangle thathas x as its nth vertex. Then

∫T‖∇xu‖2 dx = e>n

18A

2`21 `23 − `21 − `22 `22 − `21 − `23`23 − `21 − `22 2`22 `21 − `22 − `23`22 − `21 − `23 `21 − `22 − `23 2`23

en =`2n

4|T |

Furthermore,

∫Tu2 dx = e>n

112

18 − 1

2418

14 − 1

24− 1

24 − 124

112

en2|T | ≤ 12|T |

The result follows from these formulas. 2


Corollary 5.9-8: Suppose that Ω ⊂ R2 is a union of disjoint triangles Te, 1 ≤ e ≤ E,such that no vertex of these triangles is shared by fewer than smin triangles or more thansmax of these triangles. Let β1 be the constant defined in lemma 5.9-3. Let `ie, 1 ≤ i ≤ 3be the lengths of the sides of Te, and

h ≡ maxemax1≤i≤3

`ie

Further, suppose that∃C1 ∀1 ≤ e ≤ E |Te| ≥ C1h

2

and

∃C2 ∀1 ≤ e ≤ E

3∑i=1

`2ie ≤ C2|Te|

Then √3

8C1smin≤ h2 sup

f∈C0(Ω):f linear on each Te

∫Ω ‖∇xf‖2 dx∫

Ω f2 dx

≤ smax

smin

C2

2β1C1

Proof: From lemmas 5.9-5 and 5.9-2 we have

∫Ω‖∇xf‖2 dx ≤ smax max

e

∑3i=1 `

2ie

|Te|

∫Ω f

2 dx4β1smin mine |Te|

The upper bound in the claim follows from this inequality and the inequalitiesin the assumption. From lemma 5.9-7 and lemma 5.9-2 we see that

supf∈C0(Ω):f linear on each Te

∫Ω ‖∇xf‖2 dx∫

Ω f2 dx

≥maxe

∑3i=1 `

2ie/|Te|

2smin mine |Te|

The numerator is smallest for equilateral triangles, and the lower bound in theclaim follows from the assumptions. 2


Lemma 5.9-9: Suppose that we are given a tetrahedron T ⊂ R3 with vertices

X =[x1 x2 x3 x4

]such that X has rank 3. Let the lengths of the edges be `ij = ‖xi−xj‖, let the area of theside opposite vertex xi be Ai, and let the volume of the tetrahedron be |T |. If the array ofnodal basis functions on T is V (x(ξ)) = b(ξ), then the stiffness matrix for the Laplacianon T is

∫T

∂V

∂x(∂V

∂x)> dx =

136|T |

1 0 00 1 00 0 1−1 −1 −1

(x2 − x4)× (x3 − x4)>(x3 − x4)× (x1 − x4)>(x1 − x4)× (x2 − x4)>

[(x2 − x4)× (x3 − x4) (x3 − x4)× (x1 − x4) (x1 − x4)× (x2 − x4)

] 1 0 0 −10 1 0 −10 0 1 −1

The characteristic polynomial for this matrix is

0 = λ[λ3 −∑4

i=1A2i

9|T |λ3 +

∑1≤i≤j≤4 `

2ij

36λ2 − |T |

9] .

Finally, the largest eigenvalue of the local stiffness matrix satisfies∑4i=1A

2i

27|T |≤ λmax ≤

∑4i=1A

2i

9|T |

Proof: Before we begin, note that the area of the side opposite vertex xi isAi, where

A1 =12‖(x2 − x4)× (x3 − x4)‖ ,

A2 =12‖(x3 − x1)× (x4 − x1)‖ ,

A3 =12‖(x4 − x2)× (x1 − x2)‖ ,

A4 =12‖(x1 − x3)× (x2 − x3)‖ ,


Further, the volume of the tetrahedron is

|T | = 16(x1 − x4) · (x2 − x4)× (x3 − x4)

=16(x2 − x1) · (x3 − x1)× (x4 − x1)

=16(x3 − x2) · (x4 − x2)× (x1 − x2)

=16(x4 − x3) · (x1 − x3)× (x2 − x3) .

Recall that x(ξ) = Xb(ξ) where b(ξ) is the vector of barycentric coordinates, so

∂x∂ξ

= X∂b

∂ξ=[x1 − x4 x2 − x4 x3 − x4

]Also recall that V (x(ξ)) = b(ξ) satisfies

∂V

∂x=∂b

∂ξ[∂x∂ξ

]−1

=[

0 1−1 0

] [x2 − x3 x3 − x1 x1 − x2

] 1|∂x∂ξ |

Since(∂x∂ξ

)−>|∂x∂ξ| =

[(x2 − x4)× (x3 − x4) (x3 − x4)× (x1 − x4) (x1 − x4)× (x2 − x4)

]it is easy to compute the element stiffness matrix∫

T

∂V

∂x(∂V

∂x)> dx

=∂b

∂ξ[∂x∂ξ

]−1

∫T∗

dξ|∂x∂ξ|(∂b

∂ξ[∂x∂ξ

]−1

)>

=1

36|T |

1 0 00 1 00 0 1−1 −1 −1

[(x2 − x4)× (x3 − x4)]>

[(x3 − x4)× (x1 − x4)]>

[(x1 − x4)× (x2 − x4)]>

[(x2 − x4)× (x3 − x4) (x3 − x4)× (x1 − x4) (x1 − x4)× (x2 − x4)

] 1 0 0 −10 1 0 −10 0 1 −1


A very tedious calculation shows that the characteristic polynomial for thismatrix is as claimed in the lemma. The bounds on the largest polynomial followfrom the fact that the coefficient of the λ3 term in the characteristic polynomialis the sum of the eigenvalues. 2

Lemma 5.9-10: Suppose that Ω ⊂ R3 is a union of disjoint tetrahedra Te, 1 ≤ e ≤ E,such that no vertex of these elements is shared by fewer than smin elements, or more thansmax elements. Suppose that there is a line L such that for all lines L′ parallel to L thelength of the line segment L′ ∩ Ω is at most R. If f(x) is piecewise linear on Ω f is thevector of values of f at the vertices, then

2β1smin

R2mine|Te|‖f‖2 ≤

∫Ω∇xf · ∇xf(x) dx ≤ smax max

e

∑i:Aie is area of a side of Te

A2ie

9|Te|‖f‖2

Proof: Using the Poincare inequality (??) we see that∫Ω∇xf · ∇xf(x) dx ≥ 1

R2

∫Ωf(x)2 dx

Lemma 5.9-2 completes the lower bound in the lemma. Using lemma 5.9-9 wecompute

∫Ω∇xf ·∇xf(x) dx =

∑e

∫Te

∇xf ·∇xf(x) dx ≤∑e

∑4i=1A

2ie

9|Te|∑

xi a vertex of Te

f(xi)2

The claimed upper bound follows immediately. 2

Corollary 5.9-11: Suppose that Ω ⊂ R3 is a union of disjoint tetrahedra Te, 1 ≤ e ≤ E,such that no vertex of these elements is shared by fewer than smin elements, or more thansmax elements. Suppose that there is a line L such that for all lines L′ parallel to L thelength of the line segment L′ ∩ Ω is at most R. The condition number of the stiffnessmatrix A satisfies

κ(A) ≤ smaxR2

18sminβ1

maxe(∑

i:Aie is area of a side of TeA2ie)/(|Te|)

mine |Te|‖


Lemma 5.9-12: Suppose that Ω ⊂ R3 is a union of disjoint tetrahedra Te, 1 ≤ e ≤ E,such that no vertex of these triangles is shared by fewer than smin triangles. Then

supu piecewise linear on Ω,u=0 on ∂Ω

∫Ω ‖∇xu‖2 dx∫

Ω u2 dx

≥ 10216smin

maxe(maxA= length of side of TeA2

|Te|)

mine |Te|

Proof: Let u(x) be piecewise linear on the union of triangles Ω, and supposethat u(x) is one at exactly one interior vertex x of Ω. Let T be a tetrahedronthat has x as its nth vertex. Then

∫T‖∇xu‖2 dx =

136|T |

e>n

1 0 00 1 00 0 1−1 −1 −1

[(x2 − x4)× (x3 − x4)]>

[(x3 − x4)× (x1 − x4)]>

[(x1 − x4)× (x2 − x4)]>

[(x2 − x4)× (x3 − x4) (x3 − x4)× (x1 − x4) (x1 − x4)× (x2 − x4)

]1 0 0 −10 1 0 −10 0 1 −1

en =A2n

36|T |

Furthermore,

∫Tu2 dx = e>n

160

140

130 − 1

30140

120

115 − 7

120130

115

110 − 3

40− 1

30 − 7120 − 3

40112

en6|T | ≤ 610|T |

The result follows from these formulas. 2

5.10. INTERPOLATION ERROR 399

Corollary 5.9-13: Suppose that Ω ⊂ R3 is a union of disjoint tetrahedra Te, 1 ≤ e ≤ E,such that no vertex of these elements is shared by fewer than smin or more than smax ofthese elements. Let β1 be the constant defined in lemma 5.9-3. Let Aie, 1 ≤ i ≤ 4 be theareas of the sides of Te, and let h be the maximum of the length of any edge of any elemetTe. Further, suppose that

∃C1 ∀1 ≤ e ≤ E |Te| ≥ C1h3

and

∃C2 ∀1 ≤ e ≤ E

4∑i=1

A2ie ≤ C2|Te|

Then √2

96C1smin≤ h2 sup

f∈C0(Ω):f linear on each Te

∫Ω ‖∇xf‖2 dx∫

Ω f2 dx

≤ smax

smin

C2

54β1C1

Proof: From lemmas 5.9-10 and 5.9-2 we have∫Ω‖∇xf‖2 dx ≤ smax max

e

∑4i=1A

2ie

|Te|

∫Ω f

2 dxβ16smin mine |Te|

The upper bound in the claim follows from this inequality and the inequalitiesin the assumption. From lemma 5.9-12 and lemma 5.9-2 we see that

supf∈C0(Ω):f linear on each Te

∫Ω ‖∇xf‖2 dx∫

Ω f2 dx

≥10 maxe

∑4i=1A

2ie/|Te|

216smin mine |Te|

The numerator is smallest for equilateral tetrahedra, and the lower bound inthe claim follows from the assumptions. 2

5.10 Interpolation Error

5.10.1 Low-Order Polynomials on Triangles

Let us illustrate the issues in estimating the errors in finite element approximation byconsidering some simple cases on triangles. Suppose that we wanted to estimate the errorsin approximating u ∈W 1

2 (Ω) by piecewise constant functions on triangles. If Te is a trianglein our triangulation, we will let he be the length of the longest side of Te. In addition, wewill assume that ∀e he ≤ h.

On the canonical triangle, the Bramble-Hilbert lemma shows us that

∃C > 0∀u ∈ H12 (T∗) inf

p∈P0(T∗)‖u− p‖H0

2 (T∗) ≤ C|u|H12 (T∗) .


Since ∇ξu = J>e ∇xu where Je is constant with each element Te, if x1, x2 and x3 are thevertices of element e then we can compute the determinant

|Je| = det[x1 − x3,x2 − x3] = (x1 − x3)>[

0 1−1 0

](x2 − x3)

≤ ‖x1 − x3‖∥∥∥∥[ 0 1−1 0

](x2 − x3)

∥∥∥∥ = ‖x1 − x3‖‖x2 − x3‖

and the Frobenius norm

‖J3‖2F = ‖x1 − x3‖2 + ‖x2 − x3‖2 . (5.10-1)

It follows that

infp∈P0(Te)

‖u− p‖2H0

2 (Te)= inf

p∈P0(Te)

∫Te

[u(x)− p(x)]2 dx

= infp∈P0(T∗)

∫T∗

[u(x(ξ))− p(x(ξ))]2|Je| dξ

= |Je| infp∈P0(T∗)

‖u− p‖2H0

2 (T∗)≤ C|Je|‖u‖2

H12 (T∗)

= C|Je|∫T∗

∇ξu · ∇ξu dξ = C

∫Te

∇xu · JeJ>e ∇xu dx

= C

∫Te

‖J>e ∇xu‖2 dx ≤ C‖Je‖2F |u|2H1

2 (Te)≤ 2Ch2

e|u|2H12 (Te)

.

We can sum over all elements and then take the square root to get

infp∈Vk

h

‖u− p‖W 02 (Ω) ≤

√2Ch‖u‖W 1

2 (Ω) .

The difficulty here is that this result gives us a false sense of security. While h maymeasure the size of the largest side of any triangle in the grid, there could be many shortsides. For example, if Ω is the unit square then all of our triangles could be right triangleswith hypotenuse of length h and one side of length h sin θ where θ is arbitrarily small. If thesmall side is aligned with the second coordinate axis, then the number of triangles along thefirst coordinate axis is 1/(h cos θ), and the number of triangles along the second coordinateaxis is 1/(h sin θ). The total number of triangles would be 4/(h2 sin 2θ). Depending on therelative size of sin θ and h, the error in the piecewise constant polynomial approximationmight not be proportional to the square root of the number of grid triangles. In other words,there may be no connection between the accuracy of the method and the work involved inobtaining that accuracy.

5.10. INTERPOLATION ERROR 401

5.10.2 General Interpolation Errors

However, we would also like for the error in the derivatives of the finite element approx-imations to converge. Let h be the length of the longest side of any triangle Te in the mesh,and let h be the perimeter minus the longest side. Heron’s formula for the area of a triangleshows that if `j , 1 ≤ j ≤ 3 are the lengths of the sides of T − e and p = `1 + `2 + `3 is theperimeter, then

|Je| =12

√p(p− 2`1)(p− 2`2)(p− 2`3) ≥

12

√hh3

Let x1, x2 and x3 be the vertices of Te. Then

J−1e =

[x22 − x32 −x21 + x31

−x12 + x32 x11 − x31

]1|Je|

so

‖J−1e ‖2

F ≤2h2

|Je|2≤ 2h2

14hh

3=

8hh3 .

Recall that inequality (5.10-1) bounds the Frobenius norm of the Jacobian Je.Given any triangle in the mesh, the Bramble-Hilbert lemma shows us that

∃C > 0∀u ∈ H22 (T∗) inf

p∈P1(T∗)|u− p|H1

2 (T∗) ≤ C|u|H22 (T∗) .

Since ∇ξu = J>e ∇xu where Je is constant, for any element Te in the grid,

infp∈P1(Te)

|u− p|2H12 (Te)

= infp∈P1(Te)

∫Te

‖∇x[u(x)− p(x)]‖2 dx

= infp∈P1(T∗)

∫T∗

‖J−>e ∇ξ[u(x(ξ))− p(x(ξ))]‖22|Je| dξ

= ‖J−>e ‖22|Je| inf

p∈P1(T∗)|u− p|2H0

2 (T∗)≤ C‖J−>e ‖2

2|Je|‖u‖2H2

2 (T∗)

≤ C‖J−>e ‖22|Je|

∫T∗

d∑i=1

[e>i (∇ξ∇>ξ u)ei]2 dξ

= C‖J−>e ‖22

∫e

d∑i=1

[ei(J>e (∇x∇x>u)Je)ei]2 dx

= C‖J−1e ‖2

2‖Je‖4F

∫Te

‖∇x∇x>u‖2

F dx

≤ C‖J−1e ‖2

2‖Je‖4F |u|2H2

2 (Te)≤ C

16h3e

h3e

|u|2H22 (Te)

.


We can sum over all elements and then take the square root to get

infp∈Vk

h

‖u− p‖W 02 (Ω) ≤ C

h3/2

h3/2‖u‖W 2

2 (Ω) .

At this point, convergence depends on how the ratio h/h behaves as the mesh is refined.In general, the Jacobian of the coordinate transformation is not constant, and a more

general approach must be applied to estimate the errors. Let xe(ξ) be the function thatmaps a given element Te ⊂ Rd into a canonical element T∗. Suppose that h is the largestpossible length of a side of Te and h is the smallest possible length of a side. If u ∈ Hk+1

2 (Te)and α is a multi-index satisfying |α| < k + 1, then

infp∈Pk(Te)

‖Dα(u− p)‖H02 (Te) ≤ Ch

d/2h−|α|‖Dα(u− p)‖H0

2 (T∗)

≤ Chd/2h−|α||u|Hk+1

2 (T∗)

≤ Chk+1+d/2

h−|α|−d/2|u|Hk+12 (Te)

.

Note that these estimates depend on bounds for |∂x∂ξ | and |(∂x∂ξ )−1|; they are invalid if |∂x∂ξ | = 0

in T∗.In order for Dα(u− p) → 0 as h→ 0, we will use the following assumptions:

Assumption 5.10-1: The finite element space is quasi-uniform, meaning that

∃ρ > 0∀e he/he ≤ ρ diam(Ω) .

Assumption 5.10-2: The finite element space in Rd has a uniform mesh map, meaningthat

∃0 < α < β ∀Te ∀ξ ∈ T∗ αhde ≤ det |∂xe(ξ)∂ξ

| ≤ βhe . .

It is possible to show that a triangular mesh is quasi-uniform and has a uniform meshmap if all angles in the triangles are bounded uniformly below by some angle θ > 0. Forquadrilaterals, we have a uniform mesh map if all angles are bounded uniformly above bysome angle θ < π. Note that in this statement, a bow-tie quadrilateral is assumed to involvetwo angles greater than π.

A discussion in Strang [81] draws the following conclusion. If our finite element space isquasi-uniform with a uniform mesh map and reproduces exactly all polynomials of degreeat most k on

⋃Te, then

∃C ∀u ∈ Hk+12 (

⋃Te) ∀h ∀|α| ≤ k , inf

p∈Vkh

‖Dα(u− p)‖H02 (

STe) ≤ Ch

k+1−|α||u|Hk+12 (

STe)

.

5.11. NUMERICAL QUADRATURE 403

In order for the finite element space to be useful in approximating the weak solution ofan elliptic equation involving derivatives of order at most 2m, it is also necessary that thefinite element space reproduce exactly all polynomials of degree at most m.

In the case that isoparametric transformations are used to define the mesh map xe(ξ),the error estimates depends on the size of the derivatives in the mesh map xe(ξ). Assumingthat linear combinations of the canonical basis functions reproduce all polynomials of degreeat most k, that all angles in the elements Te are bounded uniformly away from 0 and π,and that the sides of the elements Te are graphs of polynomials with uniformly boundedderivatives, then

∃C∀u ∈ Hk+12 (

⋃Te)∀h∀|α| ≤ 1 inf

p∈Vkh

‖Dα(u− p)‖H02 (

STe) ≤ Ch

k+1−|α||u|Hk+12 (

STe)

.

A proof of this result can be found in Strang [81, p. 160].In order to decide between competing elements of the same polynomial order, it is useful

to know the constants involved in the error estimates. A good estimate can be obtained bymeasuring the error in the approximation of polynomials of order k + 1. This approach isdiscussed in Strang [81, p. 147ff].

For alternate discussions of the error in finite element interpolation, see [53, p. 84]or [16, p. 103]. The latter reference includes a constructive proof of the Bramble-Hilbertlemma, but does not really describe how to compute the constants in the error estimates.

5.11 Numerical Quadrature

Numerical quadrature is absolutely essential to the computational success of the finiteelement method, but is not well-described in the literature. Finite element pioneers such asO.C. Zienkiewicz [94, p. 156] have presented very simple rules for linear elasticity due toIrons [51, 52]:

Convergence of the finite element process will occur in elastic displacement anal-ysis problems if the integration is sufficient to evaluate exactly the volume ofthe element.

Zienkiewicz and others have noted that better accuracy is sometimes achieved with quadra-ture rules of lower order than that needed to preserve the accuracy of the finite elementapproximation. Bathe and Wilson [10, section 4.7, pp. 162-165] list some additional exam-ples of success with reduced order quadrature and make suggestions of orders for quadraturerules on curvilinear quadrilaterals.

Strang [81, p. 182] presents a minimal condition for convergence of the finite elementmethod in somewhat more general terms. Suppose that the weak form of the problem


involves a coercive bilinear form B satisfying

∃c > 0 ∀v ∈ HMess(Ω) , B(v, v) ≥ c‖v‖2

m

and the numerical quadrature rule produces an approximate bilinear form BQ satisfying

∃cQ > 0 ∀V ∈ V ⊂ HMess(Ω) , BQ(V, V ) ≥ cQ‖V ‖2

m

Then convergence in Hmess(Ω) occurs if and only if

∃CV > 0 ∀p ∈ Pm ∀V ∈ V , |B(p, V )− BQ(p, V )| ≤ CVh

where h is the mesh size. Strang also claims that the integrals of the squares of the mthderivatives of the shape functions must be integrated exactly to maintain the full accuracyof the method.

Ciarlet [25, section 4.1, pp. 178-207] presents a more careful analysis of numericalintegration under the assumption of linear coordinate transformations. We will modifysome of his proofs below.


Lemma 5.11-1: Suppose that we have a finite element (Ω, N, k,V,N ) with linearly in-dependent canonical shape functions V∗(ξ) defined on the canonical element T∗. Alsosuppose that we have a quadrature rule∫

T∗

φ(ξ) dξ ≈Q∑q=1

φ(ξ)q)wq

with positive weights∀1 ≤ q ≤ Q , wq > 0 .

Further, suppose that the quadrature points satisfy the unisolvency condition: if p(ξ)is a linear combination of the canonical shape functions V∗(ξ), then

∀1 ≤ q ≤ Q ∇ξp(ξq) = 0 =⇒ ∇ξp(ξ ≡ 0 (5.11-1)

Suppose that the bilinear form B is uniformly strongly elliptic:

B(v, u) =∫

Ω∇xv ·D(x)∇xu dx

where D(x) satisfies

∃δ > 0 ∀x ∈ Ω ∀y ∈ Rd , y>D(x)y ≥ δY>y

Finally, suppose that Ω = ∪Ee=1Te, where each element Te has maximum radius he, and canbe mapped from the canonical element T∗ by coordinate transformations xe(ξ) satisfyingthe following inequalities:

∃c, c > 0 ∀1 ≤ e ≤ E ∀ξ ∈ T∗ ∀y ∈ Rd , c‖y‖2 ≤ hd−2e ‖(∂xe

∂ξ)>y‖2 1

|∂xe∂ξ |

≤ c‖y‖2

(5.11-2)Then the approximate bilinear form

BQ(V,W ) =E∑e=1

Q∑q=1

∇xV (xe(ξq)) ·D(xe(ξq))∇xV (xe(ξq))wq|∂xe∂ξ

(ξq))|

is uniformly coercive:

∃β > 0 ∀hV > 0 ∀V ∈ V , BQ(V, V ) ≥ β|V |21


Proof: Let S∗ be the set of all linear combinations of the canonical shapefunctions V∗(ξ). First, we will prove that

∃Cuni > 0 ∀v∗ ∈ S∗ ,Q∑q=1

‖∇ξv∗(ξq)‖2wq ≥ Cuni

∫T∗

‖∇ξv∗‖2 dξ (5.11-3)

Note that if v∗ ∈ S∗, then

ν(∇ξv∗) ≡

√√√√ Q∑q=1

‖∇ξv∗(ξq)‖2wq

is nonnegative and satisfies homogeneity

∀α ∈ R ∀v∗ ∈ S∗ , ν(∇ξv∗α) = ν(∇ξv∗)|α|

and the triangle inequality

∀v∗, w∗ ∈ S∗ , ν(∇ξv∗ + ∇ξw∗) ≤ ν(∇ξv∗) + ν(∇ξw∗)

In order to prove that ν is a norm on the gradients of functions in S∗, we needonly prove that ν(∇ξv∗) = 0 =⇒ ∇ξv∗ = 0. But the positivity of the quadratureweights shows that ν(∇ξv∗) = 0 implies that ∇ξv∗(ξq) = 0 for 1 ≤ 1 ≤ Q, so theunisolvency condition (5.11-1) implies that ∇ξv∗ ≡ 0. Thus ν is a norm on thegradients of functions in S∗. Since S∗ is finite-dimensional, and all norms on afinite dimensional space are equivalent, inequality (5.11-3) must hold. Note thatthe constant Cuni depends on the shape functions V∗, the canonical element T∗and the quadrature rule (5.11-1), but not on the mesh size h associated withthe elements Te.

We can now examine the numerical stiffness matrix for an element as follows.Suppose that Te is an element in Ω, and that we can map from T∗ to Te by meansof the coordinate transformation xe(ξ). Then the vector of shape functions onTe is V(xe(ξ)) = V∗(ξ). Using inequalities (5.11-2) and (5.11-3) we can see that


the approximate contribution to the stiffness matrix from Te satisfies

Q∑q=1

[∂V∂x

D(x)(∂V∂x

)>|∂xe∂ξ

|]x=xe(ξq)wq

=Q∑q=1

[∂V∗

∂ξ(∂xe∂ξ

)−1D(xe)(∂xe∂ξ

)−>|∂xe∂ξ

|(∂V∗

∂ξ)>]ξ=ξqwq

≥δchd−2e

Q∑q=1

[∂V∗

∂ξ(∂V∗

∂ξ)>]ξ=ξqwq

≥δCunic

hd−2e

∫T∗

∂V∗

∂ξ(∂V∗

∂ξ)> dξ

=δCunic

hd−2e

∫Te

∂V∂x

∂xe∂ξ

(∂xe∂ξ

)>(∂V∂x

)>1

|∂xe∂ξ |

dx

≥δCunicc

∫Te

∂V∂x

(∂V∂x

)> dx

The claimed result follows by summing over all of the elements. 2

Lemma 5.11-2: If the shape functions V∗(ξ) in lemma 5.11-1 are polynomials of degreeat most k and the quadrature rule (5.11-1) is exact for all polynomials of degree at most2k − 2, then the quadrature points satisfy the unisolvency condition (5.11-1).

Proof: If v∗(ξ) ∈ S∗ is a linear combination of the shape functions, then ∇ξv∗is a polynomial of degree at most k − 1, so the quadrature rule computes theintegral of the square of its norm exactly:

∫T∗

‖∇ξv∗‖2 dξ =Q∑q=1

‖∇ξv∗(ξq)‖2wq

From this expression, it is easy to see that if ∇ξv∗ is zero at each of the quadraturepoints, then ∇ξv∗ ≡ 0 in T∗. 2

Example 5.11-3: Consider single-point Gaussian quadrature for numerical quadratureswith tensor-product linear functions in two dimensions. The gradient of an arbitrary linearcombination v(ξ) ≡ τ + s>ξ + ηξ1ξ2 of the canonical shape functions is

∇ξv = s> + η[ξ2 ξ1

]


Single-point Gaussian quadrature would evaluate this gradient at ξ = 0 to get

∇ξv(0) = s>

Thus single-point Gaussian quadrature fails the unisolvency condition (5.11-1), because theshape function could have zero gradient at ξ = 0 and still have a nonzero value for thehourglass mode η. This is in spite of the fact that single-point Gaussian quadrature computesthe area of parallelograms exactly.

Single-point quadrature cannot be used for tensor product linear functions in three di-mensions for similar reasons.

The following lemma is due to Strang [80].

Lemma 5.11-4: Suppose that H is a Banach space, and that V ⊂ H is finite dimensional.Let B : H ×H → R be bilinear and bounded:

∃c ∀v, w ∈ H , |B(v, w)| ≤ c‖v‖H‖w‖H

Let BQ : V × V → R be bilinear and coercive:

∃cQ ∀V ∈ V , BQ(V < V ) ≥ cQ‖V ‖2H

Suppose that λ : H → R and λQ : V → R are linear functionals. Let u ∈ H satisfy theweak form

∀v ∈ H , B(v, u) = λ(v) (5.11-4)

and let U ∈ V satisfy the Galerkin equations

∀V ∈ V , B(V,U) = λ(V ) (5.11-5)

Then the error u− U satisfies

‖u− U‖H ≤ infV ∈V

(1 +

c

cQ)‖u− V ‖H +

1cQ

supW∈V,W 6=0

|B(W,V )− BQ(W,V )|‖W‖H

+1cQ

supW∈V,W 6=0

|λ(W )− λQ(W )|‖W‖H


Proof: Using the coercivity of BQ and the boundedness of B, it is easy to seethat for any V ∈ V

cQ‖U − V ‖2H ≤ BQ(U − V,U − V )

= B(U − V, u− V ) + B(U − V, V )− BQ(U − V, V )+ λQ(U − V )− λ(U − V )≤ c‖U − V ‖H‖u− V ‖H + |B(U − V, V )− BQ(U − V, V )|+ |λQ(U − V )− λ(U − V )|

We can divide by ‖U − V ‖H and take the supremum over W to get

‖U − V ‖ ≤ c

cQ‖u− V ‖H +

1cQ

supW∈V,W 6=0

|BW,V − BQ(W,V )|‖W‖H

+1cQ

supW∈V,W 6=0

|λ(W )− λQ(W )|‖W‖H

Then the triangle inequality implies that

‖u− U‖H ≤ ‖u− V ‖H + ‖U − V ‖H

≤ (1 +c

cQ)‖u− V ‖H +

1cQ

supW∈V,W 6=0

|BW,V − BQ(W,V )|‖W‖H

+1cQ

supW∈V,W 6=0

|λ(W )− λQ(W )|‖W‖H

We can take the infimum over V ∈ V to get the final result. 2

This lemma says that if the quadrature rule produces a coercive approximation to thedifferential operator, then the errors in the Galerkin approximation are small if we canapproximate the solution u and control the errors in numerical integration.

Lemma ?? shows how to estimate the errors B − BQ and λ − λQ in one dimension.We will describe how to extend those results to d dimensions. Suppose that our canonicalshape functions are polynomials of degree at most k, and our quadrature rule is exact forpolynomials of degree at most r, where r > k ≥ bd2c and bd2c is the largest integer less thand2 . Given a non-constant polynomial W ∗(ξ) of degree at most k, let

F 1W ∗(u) ≡

|∫T∗

∇ξW∗ · u dξ −

∑Qq=1 ∇ξW

∗(ξq) · u(ξq)wq||W ∗|1

Note that F 1W ∗(u) = 0 whenever u is a polynomial of degree at most r − k + 1. Also note

that F 1W ∗ the Sobolev Imbedding Theorem 4.4-23 shows that if r − k + 1 > 1 + d/2 then


FW ∗W 1 is a bounded sublinear functional on W s(Ω). By lemma 4.6-4 there is a constantC1W ∗ such that for all u ∈W r−k+2(Ω) we have F 1

W ∗(u) ≤ C1W ∗ |u|r−k+2; in other words,

∀W ∗ ∈ Pk(T∗) ∃C1W ∗ > 0 ∀u ∈W r−k+2(Ω)(T∗) ,

|∫T∗

∇ξW∗ · u dξ −

Q∑q=1

∇ξW∗(ξq) · u(ξq)wq| ≤ C1

W ∗ |W ∗|1|u|r−k+2

We can determine the constants C1W ∗ for a basis of these polynomials, and take the maximum

to replace C1W ∗ with a constant that is independent of W ∗.

A similar argument, working with

F 0W ∗(u) ≡

|∫T∗W ∗u dξ −

∑Qq=1W

∗(ξq)u(ξq)wq||W ∗|0

shows that

∃C0W ∗ > 0 ∀W ∗ ∈ Pk(T∗) ∀u ∈W r−k+1(Ω)(T∗) ,

|∫T∗

W ∗u dξ −Q∑q=1

W ∗(ξq)u(ξq)wq| ≤ C0W ∗‖W ∗‖0|u|r−k+1

By changing coordinates, assuming smooth and regular coordinate mappings, for anyelement Te with radius he we get

∃C1 > 0 ∀W ∈ Pk(Te) ∀V ∈W r−k+2(Ω) ,

h2−de |

∫Te

∇xW · u dx−Q∑q=1

∇xW (xq) · u(xq)wq| ≤ C1h1−d/2e |W |1 hr−k+2−d/2

e |u|r−k+2

and

∃C0 > 0 ∀W ∈ Pk(Te) ∀u ∈W r−k+1(Ω)(Te) ,

|h−de∫Te

Wu dx−Q∑q=1

W (xq)u(xq)wq| ≤ C0h−d/2e ‖W‖0 h

r−k+1−d/2e |u|r−k+1

If the coefficient matrix D in the bilinear form B(v, w) =∫Ω ∇xv · D(x)∇xw dx is

sufficiently smooth (i.e. has r − k + 2 continuous derivatives) then we can take u(ξ) =D(xe(ξ))∇V (xe(ξ)) in the former of these inequalities allow us to bound

supW∈V,W 6=0

|B(W,V )− BQ(W,V )|‖W‖1

≤ CD maxehr−k+1e ‖V ‖r−k+1 (5.11-6)


(See the proof of lemma ?? in one dimension for more details.) Similarly, if the function fin the linear functional λ(v) =

∫Ω v(x)f(x) dx is sufficiently smooth (i.e., f has r − k + 1

continuous derivatives) then

|λ(W )− λQ(W )|‖W‖H

≤ Cf maxehr−k+1e (5.11-7)

Suppose that the shape functions are polynomials of degree at most k, and reproduceall polynomials of degree at most ` ≤ k exactly. Also suppose that the order of polynomialsintegrated exactly by the quadrature rule is r. Then the order of the error due to numericalintegration (namely r − k + 1 – see inequalities (5.11-6) and (5.11-7)) preserved the orderof the error in the finite element method with exact integration (namely `) provided that rsatisfies

r ≥ k + `− 1 .

Example 5.11-5: Consider the use of piecewise linear functions on triangles as ourfinite element approximating functions. These have the general form

p(x, y) = a+ bx+ cy

so

∇p =[bc

].

Since the gradient of p involves two coefficients, we would need at least one quadrature pointin order to enforce a unisolvency condition on the gradient.

Now linear functions reproduce polynomials of degree ` = 1 exactly, and involve polyno-mials of degree k = 1. To maintain the order of convergence, the quadrature rule needs tobe exact for polynomials of degree r ≥ k + `− 1 = 1. Single-point Gaussian quadrature willmaintain the order of convergence in this case.

Example 5.11-6: Consider the use of piecewise bilinear functions on quadrilaterals asour finite element approximating functions. These have the general form

p(x, y) = a+ bx+ cy + dxy

so

∇p =[b+ cyc+ dx

].

Since the gradient of p involves three coefficients, we would need at least two distinct quadra-ture points in order to enforce a unisolvency condition on the gradient (two conditions ateach of the two quadrature points).

Now bilinear functions reproduce polynomials of degree ` = 1 exactly, and involve poly-nomials of degree k = 2. To maintain the order of convergence, the quadrature rule needs


to be exact for polynomials of degree r ≥ k+ `− 1 = 2. Tensor product two-point Gaussianquadrature will maintain the order of convergence in this case, because it integrates cubicsexactly.

Example 5.11-7: Consider the use of piecewise bi-quadratic functions on quadrilateralsas our finite element approximating functions. These have the general form

p(x, y) = a+ bx+ cy + dx2 + exy + fy2 + gx2y + hxy2 + kx2y2

so

∇p =[

b+ 2dx+ ey + 2gxy + hy2 + 2kxy2

c+ dx+ ex+ 2fy + gx2 + 2hxy + 2kx2y

].

Since the gradient of p involves eight coefficients, we would need at least four distinct quadra-ture points in order to enforce a unisolvency condition on the gradient (two conditions ateach of the four quadrature points).

Now biquadratic functions reproduce polynomials of degree ` = 2 exactly, and involvepolynomials of degree k = 4. To maintain the order of convergence, the quadrature ruleneeds to be exact for polynomials of degree r ≥ k + ` − 1 = 5. Tensor product three-pointGaussian quadrature will maintain the order of convergence in this case, because it integratesquintics exactly.

Example 5.11-8: Consider the use of piecewise trilinear functions on hexahedrons asour finite element approximating functions. These have the general form

p(x, y, z) = a+ bx+ cy + dz + exy + fyz + gzx+ hxyz

so

∇p =

b+ ey + gz + hyzc+ ex+ fz + hxzd+ fy + gx+ hxy

.

Since the gradient of p involves seven coefficients, we would need at least three distinctquadrature points in order to enforce a unisolvency condition on the gradient (three condi-tions at each of the three quadrature points).

Now trilinear functions reproduce polynomials of degree ` = 1 exactly, and involve poly-nomials of degree k = 3. To maintain the order of convergence, the quadrature rule needsto be exact for polynomials of degree r ≥ k+ `− 1 = 3. Tensor product two-point Gaussianquadrature will maintain the order of convergence in this case, because it integrates cubicsexactly.

5.12 Interpolated Boundary Conditions

Suppose that we want to solve an elliptic equation on a domain Ω with a curved bound-ary. In this case, it may be impossible to find piecewise polynomials that satisfy the bound-ary conditions exactly, even if the boundary conditions are homogeneous. Instead, we

5.12. INTERPOLATED BOUNDARY CONDITIONS 413

will approximate the solution of the weak form of the differential equation by using afinite-dimensional space of (mapped) piecewise polynomials that interpolate the boundaryconditions at appropriate boundary nodes. In this case, the boundary conditions will besatisfied by the numerical solution at a finite set of points on the boundary, and the rulesof the ordinary Galerkin theory are violated. For more details of this approach, see [75]or [23]. This approach is one of two alternatives that we will discuss; the other is to useisoparametric transformations.


∀v ∈W 12,0(Ω)B(v, u) = (v, f)

whereB(v, u) =

∑|α|,|β|≤1

(cα,βDβv,Dαu) .

We assume that the coefficients cα,β are all smooth, and that the bilinear form is coercive.A finite element approximation U can be defined by a similar variational equation with

respect to a finite dimensional subspace Vkh ⊂ W 12 (Ω). To define Vkh , we divide Ω into a

collection of non-overlapping triangular elements e. The elements are of two types when ∂Ωis curved. In the interior, we can use elements with straight sides, such as triangles. Alongthe boundary, however, the elements will have at least one side that is replaced by a segmentof ∂Ω. We assume that the subdivision is quasi-uniform. Then Vkh will denote the space ofcontinuous piecewise polynomials of degree at most k on the corresponding mesh. Note thatthese functions are polynomials even in the boundary elements. (For quadrilateral meshes,the coordinate transformation prevents the interpolants from being polynomials in x.)

When h is sufficiently small relative to the radius of curvature of ∂Ω, the functions inVkh can be determined by their nodal values, namely the 3 vertices of a given triangle, k− 1points on the interior of each element side (including the curved boundary segments), and12(k − 2)(k − 1) points in the interior of each element (when k ≥ 3).

It is worthwhile to choose the boundary nodes carefully. Let e be a boundary elementand assume that the coordinate system has been chosen so that the boundary vertices of elie on the x1 axis, with one of the vertices at the origin. We assume that h is small enoughthat e ∩ ∂Ω can be represented as

e ∩ ∂Ω = (x1, ρ(x1)) : 0 ≤ x1 ≤ x1 .

Let p be a polynomial of degree at most k that approximates ρ accurately:

sup0≤x1≤x1

|ρ(x)− p(x)| ≤ Cxk+11

and such thatp(0) = 0 = p(x1) .


We will place boundary nodes at the points

(ξix1, p(ξix1)) , 1 ≤ i ≤ k − 1 ,

where the values 0 = ξ1 < . . . < ξk−1 = 1 will be described below. These boundary nodesare allowed to be slightly off ∂Ω in order to make the method computationally simple.Alternatively, we could have parameterized ∂e by arc length σ for σ1 ≤ σ ≤ σ2, and placedthe nodes along ∂Ω at arc lengths σ1 + ξi(σ2 − σ1).

The optimal choice of the numbers ξi is based on the Lobatto quadrature rule (some-times called Radau quadrature Radau quadrature quadrature!Radau). These quadraturepoints are zeros of

ξ(1− ξ)P ′k(ξ) = ckdk

dξkξk(1− ξ)k .

Here Pk is the k’th Legendre polynomial. The weights are

w1 = wk−1 =2

n(n+ 1)

wi =2

n(n+ 1)Pn(ξi)2, 2 ≤ i ≤ k − 2 .

For more discussion of Lobatto quadrature, see [30] or [84].

For homogeneous Dirichlet boundary conditions, each function in Vkh vanishes at theLobatto quadrature points of ∂e in each boundary element e. For inhomogeneous boundaryconditions, each Galerkin approximation interpolates the Dirichlet boundary data at theboundary nodes. In this case the resulting Galerkin approximation U satisfies

∃C∀h∀ − 1 ≤ s ≤ k − 1‖u− U‖W s2 (Ω) ≤ Ch

k+1+s.

This is proved in [75]. A weaker form of the proof appears in [16].

In order to understand the error in interpolating the boundary conditions, we will firstprove the following general error estimate.

5.12. INTERPOLATED BOUNDARY CONDITIONS 415

Lemma 5.12-1:[16, p. 196] Suppose that H is a Hilbert space, V,Vh ⊂ H are subspaces,and λ is a continuous linear functional on H:

∃Λ > 0∀v ∈ H|λ(v)| ≤ Λ‖v‖ .

Also suppose that a : H×H → R is a bilinear form, that is both bounded on H

∃C > 0∀v, w ∈ H|a(v, w)| ≤ C‖v‖‖w‖

and coercive on Vh∃γ > 0∀V,W ∈ Vh|a(V,W )| ≥ γ‖V ‖‖W‖ .

Let u ∈ H solve∀v ∈ Va(v, u) = λ(v) ,

and let U ∈ Vh approximate u by solving

∀V ∈ Vha(V,U) = λ(V ) ,

Then the error satisfies

‖u− U‖ ≤ (1 +C

γ) infV ∈Vh

‖u− V ‖+1γ

supW∈Vh,W 6=0

|a(W,u− U)|‖W‖

,

andmax inf

V ∈Vh

‖u− V ‖, 1C

supW∈Vh,W 6=0

|a(W,u− U)|‖W‖

≤ ‖u− U‖; .

Proof: The proof is straightforward: forall V ∈ Vh,

‖u− U‖ ≤ ‖u− V ‖+ ‖V − U‖ ≤ ‖u− V ‖+1γ

supW∈Vh,W 6=0

|a(W,V − U)|‖W‖

≤ ‖u− V ‖+1γ

supW∈Vh,W 6=0

|a(W,V − u)|‖W‖

+1γ

supW∈Vh,W 6=0

|a(W,u− U)|‖W‖

≤ ‖u− V ‖+C

γ‖V − u‖+

1γ

supW∈Vh,W 6=0

|a(W,u− U)|‖W‖

After we take the infimum over all V , we obtain the first result. To obtain thesecond result, note that we need only observe that

∀W ∈ Vh,W 6= 0‖u− U‖ ≤ 1C

|a(W,u− U)|‖W‖

.


2

Note that this result shows that the two terms on the upper bound on the error ‖u − U‖provide both upper and lower bounds on the error. The first term in the error bound,inf ‖u − V ‖, measures the error in approximating functions in H by functions in Vh. Thesecond term measures the extent to which the approximation U violates the variationalform for the true solution.

Next, let us discuss how we might use this error estimate. For a second-order ellipticproblem, we will take H = W 1

2 (Ω). For homogeneous boundary conditions, the true solutionu lies in the Sobolev space W 1

2,0(Ω), and the finite element space Vh is not a subpace of thisspace. We will let Vh be the set of all continuous piecewise polynomials on Ω with respect tosome triangulation, which are zero at the Lobatto boundary nodes. The only troublesometerm in the error estimate is the supremum. For concreteness, suppose that

a(v, u) =∫

Ω∇xv ·D∇xu dx

where D is positive definite and uniformly bounded, and

λ(v) =∫

Ωvf dx .

Note that the true solution u solves

∀v ∈ Va(v, u) = λ(v) .

If u is sufficiently smooth,

∀W ∈ Vha(W,u− U) =∫

Ω∇xW ·D∇x(u− U) dx

=∫∂ΩWn ·D∇xu dx−

∫Ω∇xW∇x ·D∇xu dx−

∫Ω∇xWf dx

=∫∂ΩWn ·D∇xu dx .

Scott shows that if our finite element space reproduces polynomials of degree k exactly,then for h sufficiently small

supW∈Vh,W 6=0

|a(W,u− U)|‖W‖

≤ Chk+12 ‖u‖Wk+1

2 (Ω) .

The other term in the error estimate is O(hk).

5.13. FINITE ELEMENTS AND MULTIGRID 417

5.13 Finite Elements and Multigrid

5.13.1 Full Multigrid

The discussion in this section is due to [14, 62].We will assume that we have a finite nested chain of finite dimensional vector spaces

VC ⊂ . . . ⊂ Vc ⊂ Vf ⊂ . . .VF ⊂ H10 (Ω) (5.13-1)

We will use the term “coarser” to refer to whichever of a pair of these subspaces is includedin the other, the “finer” subspace. If Vc consists of piecewise linear functions on a union oftriangles or tetrahedra, then Vf could be formed by subdividing each triangle into 4 trianglesor 8 tetrahedra by connecting the midpoints of the sides, and considering all piecewise linearfunctions on the smaller triangles or tetrahedra.

Given b, u(0) in the finest space VF , and some symmetric positive-definite linear operatorAF : VF → VF , we want to solve AFu = b. Although these equations are posed in terms offunctions, they will have interpretations in terms of vectors and matrices. We will describethe linear operator AF in section 5.13.3, and will make the matrix-vector interpretations insection 5.13.6. The properties of the linear operator AF will depend on the elliptic boundaryvalue problem from which it is derived; we will describe our assumptions on this problemin section 5.13.2.

Given an initial guess u(0), we will approximately solve AFu = b by performing theiteration

for k ≥ 0

r(k)F = AFu(k) − b

u(k+1) = u(k) −MF (r(k)F )

Here the multigrid operator MF is defined recursively by the following multigrid algo-rithm

if Vf 6= VC (5.13-2)

d(0)f = 0 (5.13-3)

d(1)f = d

(0)f + S>f rf (5.13-4)

d(2)f = d

(1)f +Mc(Rcf (rf −Afd

(1)f )) (5.13-5)

Mf (rf ) ≡ d(3)f = d

(2)f + Sf (rf −Afd

(2)f ) (5.13-6)

else (5.13-7)

MC(rC) ≡ A−1C rC . (5.13-8)


It is obvious that Mf is linear for each subspace Vf in the chain (5.13-1). It is also obviousthat if MF is nonsingular and u(k) converges to u, then AFu = b.

This algorithm uses two additional linear operators. The smoother Sf : Vf → Vf isdefined for all spaces Vf , except possibly for the coarsest VC . The restriction Rcf : Vf →Vc is defined between all successive pairs of spaces in the chain (5.13-1). We will discussthe smoother operator in section 5.13.4, and the restriction operator in section 5.13.3.

The error in the approximate solution of the linear system by the multigrid iteration oneach finite dimensional vector space in the chain is itself a linear operator. We will developa recursive formula for this error operator, and estimate its norm in section 5.13.1.

5.13.2 Weak Formulation of the Problem

We will consider two inner products on any space V in the chain (5.13-1). The first isthe usual L2 inner product:

∀V,W ∈ V , (V,W )V ≡∫

ΩV (x)W (x) dx .

The second inner product works with those spaces V that have nodal basis functionsVn(x), 1 ≤ n ≤ N and associated nodes xn ∈ Rd, 1 ≤ n ≤ N so that

∀1 ≤ m,n ≤ N , Vn(xm) = δmn .

In such a case, we will let h be the maximum length of any mesh element associated withV, and define

∀V,W ∈ V , (V,W )V ≡ hdN∑n=1

V (xn)W (xn) . (5.13-9)

In either case, the norm associated with the inner product is

∀V ∈ V , ‖V ‖2V ≡ (V, V )V .


−∇x ·D∇xu+ ru = f in Ω ⊂ Rd

u = 0 on ∂Ω

We assume that the coefficient array D(x) ∈ Rd×d is is symmetric positive-definite andbounded,

∃δ ≥ δ > 0 ∀x ∈ Ω , δ ≤ supu∈Rd

,u6=0

u>D(x)uu>u

≤ δ , (5.13-10)


and that the coefficient r(x) is bounded and nonnegative,

∃ρ ≥ ρ ≥ 0 ∀x ∈ Ω , ρ ≤ r(x) ≤ ρ . (5.13-11)

It follows that∀v, w ∈ H1(Ω) , B(v, w) ≡

∫Ω∇xv ·D∇xw + vrw dx (5.13-12)

satisfies∀V ∈ H1(Ω) , δ|V |21 ≤ B(V, V ) ≡ ‖V ‖2

B ≤ (δ + ρ)|V |21 . (5.13-13)

The Poincare inequality (??) shows that B is an inner product on H10 (Ω).

Also suppose that we have a finite dimensional subspace V ⊂ H10 (Ω). Then the Galerkin

method for approximating the solution to our problem seeks U ∈ V so that

∀V ∈ V , B(V,U) ≡∫

Ω∇xV ·D∇xU + V rU dx =

∫ΩV f dx ≡ (f, V ) . (5.13-14)

We assume that V satisfies the approximation property

∃δV ∀w ∈ H10 (Ω) , inf

V ∈V‖w − V ‖0 ≤ δV |w|1 . (5.13-15)

We expect that δV is small, on the order of some positive power of the mesh width.

5.13.3 Linear Operators

5.13.3.1 Differential OperatorLet us begin by defining a linear operator that represents both the left-hand side in the

differential equation and the boundary condition.

Lemma 5.13-1: Let the assumptions of section 5.13.2 be valid. Suppose that D(x) ∈Rd×d satisfies the bounds (5.13-10), r(x) ∈ R satisfies the bounds (5.13-11), and thebilinear form B : H1

0 (Ω) × H10 (Ω) is defined by (5.13-12). Suppose that V ⊂ H1

0 (Ω) isfinite-dimensional. Define A : V → V by

∀V,W ∈ V , (AV,W )V = B(V,W ) .

Then

1. A is self-adjoint with respect to both the (, )V inner product and with respect to thenatural inner product B.

2. if V consists of piecewise linear functions, either on triangles satisfying the hypothesesof lemma 5.9-8 or on tetrahedra satisfying the hypotheses of lemma 5.9-13, then thespectral radius of A satisfies ρ(A) = O(h−2), where h is the longest length of an edgeof an element.


Proof: Since (·, ·)V and B(·, ·) are symmetric in their arguments, A is self-adjoint with respect to the (·, ·)V inner product:

∀V,W ∈ V , (AV,W )V = B(V,W ) = B(W,V ) = (AW,V )V = (V,AW )V .

Symmetry of A with respect to B is even easier to prove:

∀V,W ∈ V , B(AV,W ) = (AV,AW )V = B(V,AW ) .

We will prove the second claim for piecewise linear functions only, and in twodimensions only, but for both choices of (·, ·)V . First, we consider the L2 innerproduct. In this case,

ρ(A) ≡ supV ∈V,V 6=0

(AV, V )V(V, V )V

= supV ∈V,V 6=0

∫Ω ∇xV ·D∇xV + rV 2 dx∫

Ω V2 dx

Recall that corollary 5.9-8 showed that for triangles, if

∃C1 ∀1 ≤ e ≤ E , |Te| ≥ C1h2

and

∃C2 ∀1 ≤ e ≤ E ,

3∑i=1

`2ie ≤ C2|Te|

then√

38C1smin

≤ h2 supf∈C0(Ω):f linear on each Te

∫Ω ‖∇xf‖2 dx∫

Ω f2 dx

≤ smax

smin

C2

2β1C1.

It follows that

ρ(A) ≤ supV ∈V,V 6=0

δ∫Ω ‖∇xV ‖2 dx + ρ

∫Ω V

2 dx∫Ω V

2 dx≤ δ

smax

smin

C2

2β1C1h−2 + ρ

and

ρ(A) ≥ supV ∈V,V 6=0

δ∫Ω ‖∇xV ‖2 dx + ρ

∫Ω V

2 dx∫Ω V

2 dx≥ δ

√3

8C1sminh−2 .

Similar results hold for tetrahedra, using corollary 5.9-13.

For the nodal inner product (5.13-9), we can use corollary 5.9-3 to bound theratio

∫Ω V

2 dx/‖V ‖2V above and below by constants:

β1d!sminmine |Te|

hd≤∫Ω V

2 dx‖V ‖2

V≤ βd+1d!smax

maxe |Te|hd


Multplying the results for the L2 inner product by these bounds, we obtain

δ

√3

8β1d! ≤ ρ(A)h2 ≤ [ρ+ δ

smax

smin

C2

2β1C1]C2βd+1smaxd!

2

We expect that ρ(A) = O(h−2) for other piecewise polynomial finite element spaces aswell. The power h−2 comes from the derivatives in the weak form B of the differentialoperator, and the bounds on the spectral radius depend on the uniformity of the grid. Notethat we will find matrix representations for the linear transformation A in section 5.13.6.Afterward, it will be easier to compute bounds on the spectral radius via the Gerschgorincircle theorem.5.13.3.2 L2 Projection

The L2 projection onto the subspace V will be useful in describing the operator formof the Galerkin equations (5.13-14), defining the restriction operator (see section 5.13.3.3),and relating the linear operators A between successive pairs of subspaces belonging to thechain (5.13-1) (see section 5.13.1).

Lemma 5.13-2: Suppose that Ω ⊂ Rd is open, and V ⊂ H10 (Ω) is finite dimensional.

Define the linear operator Q : H−10 (Ω) → V by

∀v ∈ H−10 (Ω) ∀W ∈ V , (Qv,W )V = (v,W ) .

If A is the linear operator defined in lemma 5.13-1, then for any f ∈ H−10 (Ω) the solution

U ∈ V of the Galerkin equations (5.13-14) satisfies

AU = Qf .

Furthermore, if (·, ·)V is the L2 inner product, then Q is a projection, and for any v ∈H0(Ω), Qv is the best approximation to v from the subspace V:

∀v ∈ H0(Ω) ∀W ∈ V , ‖v −Qv‖0 ≤ ‖v −W‖0 .

Proof: The Galerkin equations can be written in the form

∀v ∈ V , (v,AU)V = B(v, U) = (v, f) = (v,Qf)V .

In other words, AU −Qf ∈ V satisfies

∀V ∈ V , (V,AU −Qf)V = 0 .

Choosing V = AU −Qf proves the first claim.


Now suppose that (·, ·)V is the L2 inner product. To show that the linear oper-ator Q is a projection, we need only show that Q2 = Q. For any v ∈ H−1

0 (Ω),we have Qv ∈ V ⊂ H1

0 (Ω) ⊂ H−10 (Ω), so Q(Qv) ∈ V as well. Since

∀v ∈ H−10 (Ω) ∀W ∈ V , (QQv,W ) = (Qv,W )

it follows that

∀v ∈ H−10 (Ω) ∀W ∈ V , (QQv −Qv,W ) = 0 .

We can choose W = QQv − Qv to prove that Q2 = Q, and thus that Q is aprojection.

The orthogonality of v −Qv to V implies that the Pythagorean theorem holds:

∀v ∈ H0(Ω) ∀W ∈ V , ‖v −W‖20 = ‖v −Qv‖2

0 + ‖Qv −W‖20

We can throw away the second term on the right to show that Qv is the bestapproximation to v from the subspace V. 2

5.13.3.3 Prolongation and RestrictionSuppose that (·, ·)V is the nodal inner product defined in (5.13-9). Given two nested

finite dimensional subspaces Vc ⊂ Vf in the chain (5.13-1), we will find it useful to defineprolongation and restriction operators Pfc : Vc → Vf andRcf : Vf → Vc. If Wc ∈ Vc, theprolongation PfcWc ∈ Vf is that linear combination of the fine nodal basis functions withcoefficients given by the coarse function Wc evaluated at the fine nodes xnf , 1 ≤ n ≤ Nf .The restriction operator Rcf is the adjoint of Pfc in the following sense:

∀Vf ∈ Vf ∀Wc ∈ Vc , (RcfVf ,Wc)Vc = (Vf , PfcWc)Vf= (Vf ,Wc)Vf

. (5.13-16)

By taking Vf and Wc to be nodal basis functions in the corresponding spaces, we can easilydetermine the entries of Rcf . See lemma 5.13-17 below for more details. For the L2 innerproduct, we will take Rcf = Qc.5.13.3.4 Elliptic Projection

The elliptic projection will be used in lemma 5.13-8 to relate the linear operators Aon two successive subspaces in the chain (5.13-1). Establishing the bound (5.13-23) on the


elliptic projection will be essential to our proof that the multigrid iteration converges.

Lemma 5.13-3: Suppose that Ω ⊂ Rd is open, and V ⊂ H0(Ω) is finite dimensional. LetB be the inner product defined in (5.13-14). Then

1. the linear operator H : H10 (Ω) → V, defined by

∀v ∈ H10 (Ω) ∀W ∈ V , B(Hv,W ) = B(v,W ) , (5.13-17)

is a projection.

2. for any v ∈ H10 (Ω), Hv is the best approximation to v from the subspace V:

∀v ∈ H10 (Ω) ∀W ∈ V , ‖v −Hv‖B ≤ ‖v −W‖B (5.13-18)

3. H is self-adjoint with respect to the inner product B.

Proof: Since

∀v ∈ H1(Ω) ∀W ∈ V , B(HHv,W ) = B(Hv,W ) ,

it follows that

∀v ∈ H1(Ω) ∀W ∈ V , B(HHv −Hv,W ) = 0 .

We can choose W = HHv − Hv to prove that HH = H, and thus that H is aprojection.

Note that the symmetry of B implies that

∀v, w ∈ H1(Ω) , B(Hv, w) = B(v, w) = B(w, v) = B(Hw, v) = B(v,Hw) .

Thus H is self-adjoint with respect to B.

The orthogonality of v −Hv to V implies the Pythagorean theorem

∀v ∈ H1(Ω) ∀W ∈ V , ‖v −W‖2B = ‖v −Hv‖2

B + ‖Hv −W‖2B

We can throw away the second term on the right to prove (5.13-18). 2



1. the diffusion coefficient D(x) in B satisfies the bounds (5.13-10),

2. the reaction coefficient r(x) in B satisfies the bounds (5.13-11),

3. the differential equation admits second-order regularity

∃CB > 0 ∀z ∈ H0(Ω)∃ y ∈ H2(Ω) ∩H10 (Ω) ,

∀v ∈ H10 (Ω) B(v, y) = (v, z) and ‖y‖H2(Ω) ≤ CB‖z‖H0(Ω) (5.13-19)

4. the chain of finite-dimensional subspaces satisfies the approximation assumption

∃CV > 0 ∀Vf in the chain ∀y ∈ H2(Ω) , infVf∈Vf

‖y − Vf‖H1(Ω) ≤ CVhf‖y‖H2(Ω)

(5.13-20)where hf is the mesh width associated with Vf

5. there is a refinement ratio r such that for each pair of successive subspaces Vc ⊂ Vfin the chain

∃r > 1 ∀Vc ⊂ Vf , hc ≤ rhf

6. the linear operators Af , defined by lemma 5.13-1, have spectral radii satisfying

∃Cρ > 0 ∀Vf in the chain , ρ(Af ) ≤ Cρh−2f (5.13-21)

7. the nodal norm is equivalent to the H0(Ω) norm:

∃Ce > 0 ∀Vf in the chain ∀Vf ∈ Vf , ‖Vf‖Vf≤ Ce‖Vf‖H0(Ω) (5.13-22)

Then the elliptic projection, defined by equation (5.13-17), satisfies

∀Vf ∈ Vf , ‖[I −Hc]Vf‖2Vf≤ CHρ(Af )

‖Vf‖2B (5.13-23)

whereCH = C2

VC2BC

2e r

2Cρ(δ + ρ)

Proof: We will first prove that

∀Wf ∈ Vf ‖[I −Hc]Wf‖Vf≤ CHhf‖[I −Hc]Wf‖B (5.13-24)


where

CH ≡ CVCBCer

√δ + ρ

Suppose that w ∈ H10 (Ω). By the definition of Hc,

∀Vc ∈ Vc , B([I −Hc]w, Vc) = 0 .

Define yw ∈ H2(Ω) ∩H10 (Ω) by

∀v ∈ H10 (Ω) , B(v, yw) = (v, [I −Hc]w) .

Then for all Vc ∈ Vc we have

‖[I −Hc]w‖2H0(Ω) = ([I −Hc]w, [I −Hc]w) = B([I −Hc]w, yw)

= B([I −Hc]w, yw − Vc) ≤ ‖[I −Hc]w‖B‖yw − Vc‖B(5.13-25)

We can use the bounds (5.13-10) and (5.13-11) on the coefficients in the dif-ferential operator, the approximation assumption (5.13-20), and higher-orderregularity (5.13-19) to estimate

1√δ + ρ

infVc∈Vc

‖yw − Vc‖B ≤ infVc∈Vc

‖yw − Vc‖H1(Ω)

≤ CVhc‖yw‖H2(Ω) ≤ CVCBhc‖[I −Hc]w‖H0(Ω)

Inserting this result into (5.13-25) and cancelling ‖‖[I −Hc]w‖H0(Ω) leads to

‖[I −Hc]w‖H0(Ω) ≤ CVCBhc

√δ + ρ‖[I −Hc]w‖B

In the case where the inner product on Vf is the H0 = L2 inner product, wehave proved (5.13-24) with Ce = 1. For the nodal inner product (5.13-24) followsfrom the assumption (5.13-22).

Let us prove the final claim. The definition of Af and inequality (5.13-24) implythat for all Wf ∈ Vf ,

‖[I −Hc]Wf‖2B = B([I −Hc]Wf , [I −Hc]Wf ) = B([I −Hc]Wf ,Wf )

= ([I −Hc]Wf ,AfWf )Vf≤ ‖[I −Hc]Wf‖Vf

‖AfWf‖Vf

≤ CHhf‖[I −Hc]Wf‖B‖AfWf‖Vf


We can cancel ‖[I−Hc]Wf‖B and use the bound (5.13-21) on the spectral radiusof Af to get

‖[I −Hc]Wf‖B ≤ CHhf‖AfWf‖Vf= CHhf

[(AfA

1/2f Wf ,A

1/2f Wf )Vf

]1/2≤ CHhf

[ρ(Af )‖A

1/2f Wf‖2

Vf

]1/2= CHhf

√ρ(Af )‖Wf‖B .

By combining this inequality with inequality (5.13-24) and the bound (5.13-21)on the spectral radius, we obtain

‖[I−Hc]Wf‖Vf≤ CHhf‖[I−Hc]Wf‖B ≤ C2

Hh2f

√ρ(Af )‖Wf‖B ≤

C2HCρ√ρ(Af )

‖Wf‖B

2

5.13.4 Smoothers

Now that we have defined the important linear operators A and Q, we will develop asimple iteration to solve the Galerkin equations Au = Qf for u ∈ V.

Lemma 5.13-5: Let V ⊂ H10 (Ω) be finite-dimensional. Suppose that the linear operator

A : V → V is given by lemma 5.13-1, the L2 projection Q : H10 (Ω) → V is given by lemma

5.13-2, and f ∈ H−10 (Ω). Suppose that u ∈ V satisfies Au = Qf . Given u(0) ∈ V, define

u(1) ∈ V by Richardson’s iteration :

u(1) = u(0) − (Au(0) −Qf)/ρ(A) . (5.13-26)

Then the error in Richardson’s iteration satisfies the linear recurrence

u(1) − u = [I − A/ρ(A)](u(0) − u)

and the inequality

‖u(1) − u‖2B ≤ ‖u(0) − u‖2

B − ‖A(u(0) − u)‖20/ρ(A) .

Proof: Let e(k) ≡ u(k) − u. It is easy to derive the recurrence for the error inRichardson’s iteration:

e(1) ≡ u(1) − u = u(0) − u− (Au(0) −Qf)/ρ(A)

= e(0) −Ae(0)/ρ(A) = [I − A/ρ(A)]e(0) .


Note that by the definition of the spectral radius,

∀V ∈ V , ρ(A) ≥ (AV, V )V(V, V )V

=B(V, V )(V, V )V

.

Thus

∀V ∈ V , B(V, V )ρ(A)2

≤ (V, V )Vρ(A)

.

Using this inequality with V = Ae(0), we can estimate the natural norm of theerror as follows:

‖e(1)‖2B = ‖[I − A/ρ(A)]e(0)‖2

B

= B([I − A/ρ(A)]e(0), [I − A/ρ(A)]e(0))

= B(e(0), e(0))− 2B(Ae(0), e(0))/ρ(A) + B(Ae(0),Ae(0))/ρ(A)2

= B(e(0), e(0))− 2(Ae(0),Ae(0))V/ρ(A) + B(Ae(0),Ae(0))/ρ(A)2

≤ B(e(0), e(0))− 2(Ae(0),Ae(0))V/ρ(A) + (Ae(0),Ae(0))V/ρ(A)

= ‖e(0)‖2B − ‖Ae(0)‖2

V/ρ(A)

2

Lemma 5.13-6: Suppose that V ⊂ H10 (Ω) is finite dimensional. Suppose that A : V → V

is the linear operator relating the natural inner product B to the nodal inner product(·, ·)V , as determined by lemma 5.13-1. Also suppose that S : V → V is a linear operator.If S> is the adjoint of S with respect to the nodal inner product (·, ·)V , then the adjointof K ≡ I − SA with respect to the natural inner product B is

K∗ ≡ I − S>A .

Proof: For any W,V ∈ V we can compute

B([I − SA]W,V ) = ([I − SA]W,AV )V = (W,AV )V − (AW,S>AV )V= (AW, [I − S>A]V )V = B(W, [I − S>A]V )

2

The purpose of the following lemma is to relate the error in the smoother iteration tothe error in Richardson’s iteration, thereby widening the class of smoothers that we can usein the multigrid iteration.


Lemma 5.13-7: Suppose that V ⊂ H10 (Ω) is finite dimensional. Let the linear transfor-

mation A : V → V be described by lemma 5.13-1, and let ρ(A) be the spectral radius ofA. Also suppose that S : V → V is a linear operator, and let

K ≡ I − SA , K∗ ≡ I − S>A .

Then∃ω > 0 ∀V ∈ V , 0 ≤ B(KV,KV ) ≤ B([I −A ω

ρ(A)]V, V ) (5.13-27)

if and only if

∃ω > 0 ∀W ∈ V , (W,W )Vω

ρ(A)≤ ([S + S> − S>AS]W,W )V ≤ (W,A−1W )V (5.13-28)

Furthermore, if (5.13-28) is true, then

∃ω > 0 ∀V ∈ V , ‖AK∗V ‖2V

1ρ(A)

≤ B([I −KK∗]V, V )1ω

= B(V, V )− B(K∗V,K∗V ) 1ω

(5.13-29)

Proof: Given any V ∈ V, let W = AV ∈ V. Then

0 ≤ B([I − SA]V, [I − SA]V ) ≤ B([I −A ω

ρ(A)]V, V )

if and only if

0 ≤ B(V, V )−B(SAV, V )−B(V,SAV )+B(SAV,SAV ) ≤ B(V, V )−B(AV, V )ω

ρ(A)

if and only if

(AV,AV )Vω

ρ(A)≤ (SAV,AV )V + (AV,SAV )V − (ASAV,SAV )V ≤ (AV, V )V

if and only if

(W,W )Vω

ρ(A)≤ (SW,W )V + (W,SW )V − (ASW,SW )V ≤ (W,A−1W )V .

This proves that (5.13-27) is equivalent to (5.13-28).

Next, we will prove (5.13-29). Let V ∈ V be arbitrary, and let w = AK∗V in


inequality (5.13-28). Then there exists ω > 0 such that for all V ∈ V we have

‖AK∗V ‖2V

1ρ(A)

≤ ([S + S> − S>AS]AK∗V,AK∗V )V1ω

= ([I −K∗K]K∗V,AK∗V )V1ω

= B([I −K∗K]K∗V,K∗V )1ω

= B(K[I −K∗K]K∗V, V )1ω.

Note thatK[I −K∗K]K∗ = [I −KK∗]KK∗

and that KK∗ is self-adjoint and non-negative with respect to B. Thus, for allV ∈ V we have

0 ≤ B([I −KK∗]V, [I −KK∗]V )= B([I −KK∗][I −KK∗]V, V )= B([I −KK∗]V, V )− B([I −KK∗]KK∗V, V ) .

It follows that

∀V ∈ V , B([I−KK∗]KK∗V, V ) ≤ B([I−KK∗]V, V ) = B(V, V )−B(K∗V,K∗V ) .

Thus there exists ω > 0 such that for all V ∈ V

‖AK∗V ‖2V ≤ B(K[I −K∗K]K∗V, V )

1ω

= B([I −KK∗]KK∗V, V )1ω

≤ [B(V, V )− B(K∗V,K∗V )]1ω

2

Note that ω cannot be arbitrarily large. The right-hand side of (5.13-27) is nonnegativeprovided that

∀v ∈ V , B(v, v) ≥ B(Av, v) ω

ρ(A).

Since A is self-adjoint and positive-definite, this implies that

ρ(A)ω

≥ supv∈V,v 6=0

(Av,Av)(Av, v)V

= supw∈V,w 6=0

(Aw,w)Vw,w)V

= ρ(A) .

(We took w = A1/2v.) Thus we must have 0 < ω ≤ 1.


5.13.5 Multigrid Error

Lemma 5.13-8: Suppose that Vc ⊂ Vf are two finite dimensional subspaces of H10 (Ω)

in the chain (5.13-1). Suppose that the linear operators Af and Ac are given by lemma5.13-1 respectively via these spaces. Let Hc be the elliptic projection defined in lemma5.13-3 with respect to the space Vc. Given f ∈ H−1

0 (Ω), let uf ∈ Vf solve the Galerkinequations Afuf = Qff . Also suppose that the L2 projections Qf and Qc are given bylemma 5.13-2. Then the coarse and fine operators are related by

RcfAf = AcHc (5.13-30)

where the restriction Rcf was defined in equation (5.13-16) for the nodal inner product,and Rcf = Qc for the L2 inner product.

Proof: Suppose that (·, ·)V is the L2 inner product. The definitions of Af andAc in lemma 5.13-1, the definition of Qc in lemma 5.13-2 and the definition ofHc in lemma 5.13-3 imply that

∀V ∈ Vf ∀W ∈ Vc , (QcAfV,W ) = (AfV,W )= B(V,W ) = B(HcV,W ) = (AcHcV,W ) .

This proves (5.13-30).

On the other hand, suppose that (·, ·)V is the nodal inner product (5.13-9).The definitions of Af and Ac in lemma 5.13-1, the definition Rcf in equation(5.13-16) and the definition of Hc in lemma 5.13-3 imply that

∀V ∈ Vf ∀W ∈ Vc , (RcfAfV,W )Vc = (AfV,W )Vf

= B(V,W ) = B(HcV,W ) = (AcHcV,W )Vc .

This proves (5.13-30). 2


Lemma 5.13-9: Suppose that Vc ⊂ Vf ⊂ H10 (Ω) are finite dimensional. Let the linear

transformation Af : Vf → Vf be described by lemma 5.13-1, and the natural projectionHc : Vc → Vc be described by lemma 5.13-3. Also suppose that Sf : Vf → Vf is a linearoperator. Let u, uf ∈ Vf , and let Mf be given by the multigrid algorithm (5.13-2). Thenthe error in the multigrid iteration is given by a linear transformation Ef , satisfying by

uf − u−Mf (Af (uf − u)) = Ef (uf − u) . (5.13-31)

Furthermore, on the coarsest space we have EC = 0; otherwise Ef is given recursively by

Ef = [I − SfAf ][I −Hc + EcHc][I − S>f Af ] . (5.13-32)

Finally, for all subspaces Vf in the chain, Ef is self-adjoint and non-negative with respectto the natural inner product B.

Proof: It is obvious from equation (5.13-31) and the linearity of the multigridoperator that Ef is linear. Note that (5.13-31) implies that

∀Vc ∈ Vc , Mc(AcVc) = Vc − EcVc . (5.13-33)

From the multigrid algorithm (5.13-2) we see that

[uf − d(1)f ]− u = [uf − S>f Af (uf − u)]− u = [I − S>f Af ](uf − u) .

From the multigrid algorithm and (5.13-33) we see that

[uf − d(2)f ]− u = uf − d

(1)f − u−Mc(RcfAf (uf − d

(1)f − u))

= uf − d(1)f − u−Mc(AcHc(uf − d

(1)f − u))

= uf − d(1)f − u− [Hc(uf − d

(1)f − u)− EcHc(uf − d

(1)f − u)]

= [I −Hc + EcHc](uf − d(1)f − u) .

Finally, from the multigrid algorithm we see that

[uf − d(3)f ]− u = [uf − d

(2)f − S>f Af (uf − d

(2)f − u)]− u

= [I − S>f Af ](uf − d(2)f − u) .

The multigrid error operator recursion (5.13-32) follows from these three equa-tions.


Since MC = A−1C , equation (5.13-33) shows that EC = 0.

We will prove that Ef is self-adjoint and nonnegative by induction. Note thatEC = 0 is self-adjoint and non-negative. Inductively, let us assume that Ec isself-adjoint and non-negative with respect to B.

First, we will prove that Ef is self-adjoint. Let us define

Kf ≡ I − SfAf

Then lemma 5.13-6 shows that the adjoint of Kf with respect to B is K∗f =I − S>f Af . Since Ec is self-adjoint by the inductive hypothesis, and Hc is self-adjoint with respect to B by lemma 5.13-3, for all Vf ,Wf ∈ Vf we have

B(EfVf ,Wf ) = B(Kf [I −Hc + EcHc]K∗fVf ,Wf )

= B([I −Hc + EcHc]K∗fVf ,K∗fWf )

= B(K∗fVf , [I −Hc]K∗fWf ) + B(HcK∗fVf , EcK∗fWf )

= B(K∗fVf , [I −Hc]K∗fWf ) + B(K∗fVf ,HcEcK∗fWf )

= B(K∗fVf , [I −Hc +HcEc]K∗fWf )

= B(Vf ,Kf [I −Hc +HcEc]K∗fWf ) .

Thus Ef is self-adjoint with respect to B.

Finally, we will prove that Ef is non-negative. For all Vf ∈ Vf , lemma 5.13-6and the definition (5.13-17) of the natural projection Hc show that

B(EfVf , Vf ) = B([I − SfAf ][I −Hc + EcHc][I − S>f Af ]Vf , Vf )= B([I −Hc + EcHc]K∗fVf ,K∗fVf )= B(K∗fVf ,K∗fVf )− B(HcK∗fVf ,K∗fVf ) + B(EcHcK∗fVf ,K∗fVf )= B(K∗fVf ,K∗fVf )− B(HcK∗fVf ,K∗fVf ) + B(EcHcK∗fVf ,HcK∗fVf ) .

By the inductive hypothesis,

∀Wc ∈ Vc , B(EcWc,Wc) ≥ 0 ;

we will choose Wc = HcK∗fVf . Since Hc is a projection with respect to B,

∀Wf ∈ Vf , B(HcWf ,Wf ) = B(HcWf ,HcWf ) = ‖HcWf‖2B ≤ ‖Wf‖2

B = B(Wf ,Wf ) ;

we will choose Wf = K∗fVf . It follows from the last two inequalities that Ef isnon-negative with respect to B. 2


Multigrid Convergence Theorem 5.13-10: Suppose that

VC ⊂ . . . ⊂ Vc ⊂ Vf ⊂ . . . ⊂ VF

is a chain of finite dimensional subspaces of H10 (Ω). Let Af : Vf → Vf be the linear

operator defined in lemma 5.13-1 for each Vf in the chain. For each Vf in the chain(except for the coarsest VC), suppose that Sf : Vf → Vf is a smoothing operator witherror bounded by Richardson’s iteration, meaning that it satisfies inequality (5.13-28) forsome ω > 0. For each Vc in the chain (except for the finest VF ), suppose that Hc is thenatural projection, defined by (5.13-17), and satisfies inequality

∃CH > 0 ∀Vf in the chain ∀Wf ∈ Vf , ‖[I −Hc]Wf‖2Vf≤ CHρ(Af )

‖Wf‖2B .

Let Ef be the multigrid error operator on any Vf in the chain given by EC = 0 on thecoarsest subspace and recursively by equation (5.13-32). Then for each Vf in the chain wehave

∀Vf ∈ Vf , B(EfVf , Vf ) ≤CH

CH + ωB(Vf , Vf )

and∀Vf ∈ Vf , ‖EfVf‖B ≤

CHCH + ω

‖Vf‖B .

Proof: We will prove the result by induction. The result is trivially true forthe coarsest space VC , since EC = 0. Suppose that the claim is true for Vc ⊂ Vf ;we will prove that it is true for Vf .

Let Kf = I − SfAf and recall that lemma 5.13-6 shows that the adjoint withrespect to B is K∗f = I − S>f Af . Note that for all Vf ∈ Vf ,

B(EfVf , Vf ) = B(Kf [I −Hc + EcHc]K∗fVf , Vf )= B([I −Hc + EcHc]K∗fVf ,K∗fVf )= B([I −Hc]K∗fVf ,K∗fVf ) + B(EcHcK∗fVf ,K∗fVf )= B([I −Hc]K∗fVf ,K∗fVf ) + B(EcHcK∗fVf ,HcK∗fVf ) .

By the inductive hypothesis with Vc = HcK∗fVf we have

B(EcHcK∗fVf ,HcK∗fVf ) ≤CH

CH + ωB(HcK∗fVf ,HcK∗fVf )


This leads to

B(EfVf , Vf ) ≤ B([I −Hc]K∗fVf ,K∗fVf ) +CH

CH + ωB(HcK∗fVf ,HcK∗fVf )

= B([I −Hc]K∗fVf ,K∗fVf )−CH

CH + ωB([I −Hc]K∗fVf ,K∗fVf )

+CH

CH + ωB([I −Hc]K∗fVf ,K∗fVf ) +

CHCH + ω

B(HcK∗fVf ,K∗fVf )

=ω

CH + ωB([I −Hc]K∗fVf ,K∗fVf ) +

CHCH + ω

B(K∗fVf ,K∗fVf ) .

(5.13-34)

Since I −Hc is a self-adjoint projection with respect to B, the Cauchy-Schwarzinequality implies that

‖[I −Hc]K∗fVf‖2B = B([I −Hc]K∗fVf , [I −Hc]K∗fVf )

= B([I −Hc]2K∗fVf ,K∗fVf )≤ ‖[I −Hc]2K∗fVf‖Vf

‖AfK∗fVf‖Vf

≤√CHρ(Af )‖[I −Hc]K∗fVf‖B‖AfK∗fVf‖Vf

We can cancel ‖[I −Hc]K∗fVf‖B on both sides of this inequality and square toget

‖[I −Hc]K∗fVf‖2B ≤

CHρ(Af )

‖AfK∗fVf‖2Vf

Together with inequality (5.13-34), assumption (5.13-28) on the smoother Sf ,and lemma 5.13-7, this implies that for all Vf ∈ Vf we have

B(EfVf , Vf ) ≤ω

CH + ω

CHρ(Af )

‖AfK∗fVf‖2Vf

+CH

CH + ωB(K∗fVf ,K∗fVf )

≤ CHCH + ω

B(Vf , Vf )− B(K∗fVf ,K∗fVf )

+

CHCH + ω

B(K∗fVf ,K∗fVf )

=CH

CH + ωB(Vf , Vf ) .

Since lemma 5.13-9 shows that Ef is self-adjoint with respect to B, we have thatfor all Vf ∈ Vf ,

‖E1/2f Vf‖2

B ≤CH

CH + ω‖Vf‖2

B .

It follows that

‖EfVf‖2B ≤

CHCH + ω

‖E1/2f Vf‖2

B ≤[

CHCH + ω

]2

‖Vf‖2B .

2


5.13.6 Matrix-Vector Forms

In order to interpret the finite element form of the multigrid iteration, we will introducesome notation and assumptions. Let V ⊂ H1

0 (Ω) be finite dimensional, and let Vn(x), 1 ≤n ≤ N be a basis for V. Let B be the bilinear form defined in (5.13-12) and assume thatits coefficient D(x) satisfies the bounds (5.13-10), and that its coefficient r(x) satisfies thebounds (5.13-11). Let A : V → V be the linear operator defined in lemma 5.13-1, and letQ : H−1

0 (Ω) → V be the L2 projection defined in lemma 5.13-2. Define the Gram matrixG ∈ RN×N to have entries

Gmn = (Vm, Vn)

and define the stiffness matrix A ∈ RN×N to have entries

Amn = B(Vm, Vn) .

Suppose that u,w ∈ V can be written

u(x) =N∑n=1

Vn(x)ωn and w(x) =N∑n=1

Vn(x)ωn ,

and define the vectors

u =

ω1...ωN

and w =

ω1...ωN

.

Finally, suppose that f ∈ H−10 (Ω) and

f =

(V1, f)...

(VN , f) .

5.13.6.1 Matrix-Vector Forms with L2 Inner Product

We can develop the matrix-vector form of the multigrid equations with (·, ·)V given bythe L2 inner product as follows.

Lemma 5.13-11: Under the assumption at the beginning of section 5.13.6

w ≡ Au⇐⇒ Gw = Au ,

w ≡ Qf ⇐⇒ Gw = f

and

Au ≡ Qf ⇐⇒ Au = f .


Proof: Note that w = Au if and only if

∀1 ≤ m ≤ N , e>mGw =N∑n=1

(Vm, Vn)ωn = (Vm, w) = (Vm,Au) = B(Vm, u)

=N∑n=1

B(Vm, Vn)ωn = e>mAu .

Next, w = Qf if and only if

∀1 ≤ m ≤ N , e>mGw =N∑n=1

(Vm, Vn)ωn = (Vm, w) = (Vm,Qf) = (Vm, f) = e>mf .

The third claim follows from the first two. 2

It is useful to note that although ρ(A) = O(h−2), ρ(A) = O(hd−2). In fact, lemma 5.9-3shows that the spectral radius of G is ρ(G) = O(hd), so

ρ(A) = supu6=0

u>Auu>u

= supu6=0

u>Auu>Gu

u>Guu>u

≤ sup

u6=0

u>Auu>Gu

supu6=0

u>Guu>u

= sup

u 6=0

(Au, u)V(u, u)V

ρ(G) = ρ(A)ρ(G)

Recall that lemma 5.9-3 showed that the spectral radius of the Gram matrix is O(hd).

Lemma 5.13-12: Under the assumption at the beginning of section 5.13.6, suppose thatd(k) ∈ V, k = 0, 1 can be written

d(k)(x) =N∑n=1

Vn(x)δ(k)n , k = 0, 1

Define the vectors

d(k) =

δ(k)1...δ(k)N

, k = 0, 1

Then Richardson’s iteration takes the form

d(1) ≡ d(0) + (Qr −Ad(0))/ρ(A) ⇐⇒ Gd(1) = Gd(0) + [r−Ad(0)]/ρ(A)


Proof: For 1 ≤ m ≤ N ,

e>mGd(1) −Gd(0) − [r−Ad(0)]/ρ(A)

=N∑n=1

(Vm, Vn)δ(1)n −N∑n=1

(Vm, Vn)δ(0)n + [N∑n=1

B(Vm, Vn)δ(0)n − (Vm, r)]/ρ(A)

=(Vm, d(1))− (Vm, d(0)) + [(Vm,Ad(0))− (Vm, r)]/ρ(A)

=(Vm, d(1) − d(0) − [r −Ad(0)]/ρ(A)) .

2

Lemma 5.13-13: Let Vc ⊂ Vf be two finite dimensional subspaces of H10 (Ω). Suppose

that Vnc(x), 1 ≤ n ≤ Nc is a basis for Vc, and that Vnf (x), 1 ≤ n ≤ Nf is a basis for Vf .Define the the matrix Gfc ∈ RNf×Nc to have entries

e>mGfcen = (Vmf , Vnc)

Suppose that uc ∈ Vc and uf ∈ Vf can be written

uc(x) =Nc∑n=1

Vnc(x)ωnc and uf (x) =Nf∑n=1

Vnf (x)ωnf

Define the vectors

uc =

ω1c...

ωNcc

and uf =

ω1f...

ωNff

1. Define the Gram matrix Gf ∈ RNf×Nf to have entries (Vmf , Vnf ). Then

uc = uf ⇐⇒ Gfuf = Gfcuc

2. Define the Gram matrix Gc ∈ RNc×Nc to have entries (Vmc, Vnc). Then

uc = Qcuf ⇐⇒ Gcuc = G>fcuf


Proof: To prove the first claim, note that for 1 ≤ m ≤ Nf ,

e>mGfuf =Nf∑n=1

(Vmf , Vnfωnf ) = (Vmf , uf )

= (Vmf , uc) =Nc∑n=1

(Vmf , Vncωnc) = e>mGfcuc .

Thus uc = uf implies that Gfuf = Gfcuc, and Gfuf = Gfcuc implies that(Vmf , uc− uf ) = 0 for all basis functions Vmf of Vf . This proves the first claim.To prove the second claim, note that for 1 ≤ m ≤ Nc,

e>mGcuc =Nc∑n=1

(Vmc, Vncωnc) = (Vmc, uc)

= (Vmc,Quf ) =Nf∑n=1

(Vmf , Vnfωnf ) = e>mG>fcuf .

2

Putting all of our results in this section together yields the following.


Corollary 5.13-14: Let Vc ⊂ Vf be two finite dimensional subspaces of H10 (Ω) corre-

sponding to piecewise linear functions on triangles or tetrahedra. Suppose that Vnc(x), 1 ≤n ≤ Nc is a basis for Vc, and that Vnf (x), 1 ≤ n ≤ Nf is a basis for Vf . Let Af : Vf → Vfand Ac : Vc → Vc and be the linear operators defined in lemma 5.13-1, and let Qf :H−1

0 (Ω) → Vf and Qc : H−10 (Ω) → Vc and be the L2 projections defined in lemma 5.13-2.

Define the the matrix Gfc ∈ RNf×Nc to have entries (Vmf , Vnc). Define the Gram ma-trices Gf ∈ RNf×Nf and Gc ∈ RNc×Nc and to have entries (Vmf , Vnf ) and (Vmc, Vnc),respectively. Define the stiffness matrices Af ∈ RNf×Nf and Ac ∈ RNc×Nc and to haveentries B(Vmf , Vnf ) and B(Vmc, Vnc), respectively. Suppose that r ∈ H−1

0 (Ω) and definethe vector f ∈ RNf by

r =

(V1f , r)...

(VNf, r)

.

Suppose that d(k)f ∈ Vf , 0 ≤ k ≤ 3 can be written

d(k)f (x) =

N∑n=1

V (n(x)δ(k)n

and define the vectors d(k)f ∈ RNf for 0 ≤ k ≤ 3 by

d(k)f =

δ(k)1...δ(k)Nf

.

Then the multigrid iteration

d(1)f = d

(0)f + (Qfr −Afd

(0)f )/ρ(Af )

d(2)f = d

(1)f + dc where dc = Mc(Qc(rf −Afd

(1)f )

d(3)f = d

(2)f + (Qfr −Afd

(2)f )/ρ(Af )

has the equivalent matrix-vector form

d(1)f = d(0)

f +G−1f (rf −Afd

(0)f )/ρ(Af )

d(2)f = d(1)

f +G−1f Gfcdc where dc = Mc(G−1

c G>fc(rf −Afd(1)f ))

d(3)f = d(2)

f +G−1f (rf −Afd

(2)f )/ρ(Af )


Obviously, the appearance of the Gram matrices in the multigrid algorithm significantlyincreases the work required in each step. We will see how to modify the algorithm to avoidthese matrices in the next section.5.13.6.2 Matrix-Vector Forms with Nodal Inner Product

We can develop the matrix-vector form of the multigrid equations with (·, ·)V given bythe L2 inner product as follows.

Lemma 5.13-15: Under the assumption at the beginning of section 5.13.6

w ≡ Au⇐⇒ w = Au ,

w ≡ Qf ⇐⇒ w = f

and

Au ≡ Qf ⇐⇒ Au = f .

Proof: Note that w = Au if and only if

∀1 ≤ m ≤ N , hde>mw = hdN∑n=1

δmnωn = (Vm, w)V = (Vm,Au)V = B(Vm, u)

= hdN∑n=1

B(Vm, Vn)ωn = hde>mAu .

Next, w = Qf if and only if

∀1 ≤ m ≤ N , hde>mw = hdN∑n=1

δmnωn = (Vm, w)V = (Vm,Qf)V = (Vm, f)V = hde>mf .

The third claim follows from the first two. 2

It is useful to note that

hdρ(A) ≡ supv∈V,v 6=0

(Av, v)Vh−d(v, v)V

= supv∈RN

,v 6=0

v>Avv>v

= ρ(A) ,


so ρ(A) = O(hd−2).

Lemma 5.13-16: Under the assumption at the beginning of section 5.13.6, suppose thatd(k) ∈ V, k = 0, 1 can be written

d(k)(x) =N∑n=1

Vn(x)δ(k)n , k = 0, 1

Define the vectors

d(k) =

δ(k)1...δ(k)N

, k = 0, 1

Thend(1) ≡ d(0) + (Qr −Ad(0))/ρ(A) ⇐⇒ d(1) = d(0) + [r−Ad(0)]/ρ(A)

Proof: For 1 ≤ m ≤ N ,

hde>md(1) − d(0) − [r−Ad(0)]/ρ(A)

=N∑n=1

(Vm, Vn)Vδ(1)n −N∑n=1

(Vm, Vn)Vδ(0)n − [(Vm, r)V − hdN∑n=1

B(Vm, Vn)δ(0)n ]/ρ(A)

=(Vm, d(1))V − (Vm, d(0))V + [(Vm,Ad(0))V − h−d(Vm, r)V ]/ρ(A)

=(Vm, d(1) − d(0) − [Qr −Ad(0)]/ρ(A))

2


Lemma 5.13-17: Let Vc ⊂ Vf be two finite dimensional subspaces of H10 (Ω) with mesh

widths hc and hf , respectively. Suppose that Vnc(x), 1 ≤ n ≤ Nc is a basis for Vc, andthat Vnf (x), 1 ≤ n ≤ Nf is a basis for Vf . Let xnf , 1 ≤ n ≤ Nf be the fine nodes. Supposethat the prolongation operator Pfc : Vc → Vf restriction operator Rcf : Vf → Vc aredefined by (5.13-16). Define the matrix Pfc ∈ RNf×Nc to have entries

∀1 ≤ n ≤ Nf ∀1 ≤ m ≤ Nc , e>nPfcem = Vmc(xnf ) ,

and the matrix Rcf ∈ RNc×Nf to have entries

∀1 ≤ n ≤ Nf ∀1 ≤ m ≤ Nc , e>mRcfen = Vmc(xnf )hdc/h

df .

Thenuc = uf ⇐⇒ uf = Pfcuc

anduf = Rcfuc ⇐⇒ uc = Rcfuf .

Proof: Recall that we assume that the basis functions satisfy Vnf (xmf ) = δmn.Suppose that uf (x) =

∑Nf

n=1 Vnf (x)ωnf and uc(x) =∑Nc

m=1 Vmc(x)ωmc. Toprove the first claim, note that for 1 ≤ m ≤ Nf ,

hdfe>k uf = hdf

Nf∑n=1

Vkf (xnf )uf (xnf )

and

e>k [Pfcuc] = hdf

Nc∑m=1

Vmc(xkf )ωmc = hdfuc(xkf ) .

Thus uc = uf if and only if uf = Pfcuc. This proves the first claim.

To prove the second claim, note that

(VMc, uc)Vc = hdc

Nc∑k=1

Vmc(xkc)uc(xkc) = hdcuc(xmc) = hdce>muc

and

(Vmc, uf )Vf= hdf

Nf∑n=1

Vmc(xnf )ωnf = hdce>m[Rcfuf ]

Thus uc = Rcfuf if and only if uc = Rcfuf . 2


Note that these prolongation and restriction operators have absolutely nothing to do withthe matrices Af and Ac. In other words, these prolongation and restriction operators areindependent of the differential equation, although they do depend on the finite elementbasis functions. Thus these operators are potentially different from the algebraic multigridprolongation and restriction discussed in section 3.10.5.

Putting all of our results in this section together yields the following.


Corollary 5.13-18: Let Vc ⊂ Vf be two finite dimensional subspaces of H10 (Ω) corre-

sponding to piecewise linear functions on triangles or tetrahedra. Suppose that Vnc(x), 1 ≤n ≤ Nc is a basis for Vc, and that Vnf (x), 1 ≤ n ≤ Nf is a basis for Vf . Let Af : Vf → Vfand Ac : Vc → Vc and be the linear operators defined in lemma 5.13-1, and let Qf :H−1

0 (Ω) → Vf be the L2 projection defined in lemma 5.13-2. Define the stiffness ma-trices Af ∈ RNf×Nf and Ac ∈ RNc×Nc and to have entries B(Vmf , Vnf ) and B(Vmc, Vnc),respectively. Suppose that

rf =Nf∑n=1

Vnf (x)ρnf ∈ Vf

and define the vector rf ∈ RNf by

rf =

ρ1f...

ρNff )

Suppose that d(k)

f ∈ Vf , 0 ≤ k ≤ 3 can be written

d(k)f (x) =

N∑n=1

Vn(x)δ(k)nf

and define the vectors d(k)f ∈ RNf for 0 ≤ k ≤ 3 by

d(k)f =

δ(k)1f...

δ(k)Nff

Then the multigrid iteration

d(0)f = 0

d(1)f = d

(0)f + (rf −Afd

(0)f )/ρ(Af )

d(2)f = d

(1)f + dc where dc = Mc(Rcf (rf −Afd

(1)f )

d(3)f = d

(2)f + (rf −Afd(2))/ρ(Af )

has the equivalent matrix-vector form

d(1)f = 0

d(1)f = d(0)

f + (rf −Afd(0)f )/ρ(Af )

d(2)f = d(1)

f + Pfcdc where dc = Mc(Rcf (rf −Afd(1)f ))

d(3)f = d(2)

f + (rf −Afd(2)f )/ρ(Af )

5.14. MIXED AND HYBRID FINITE ELEMENTS 445

5.14 Mixed and Hybrid Finite Elements

5.14.1 Review of Quadratic Programming

Suppose that we are given A ∈ Rn×n, f ∈ Rn, B ∈ Rm×n with n > m = rank Band g ∈ Rm, and suppose that A is symmetric and positive-definite. We want to solve thequadratic programming problem

minu∈Rn

P (u) ≡ 12u>Au− f>u

subject to Bu = g

Suppose that u∗ solves this quadratic programming problem. If u∗+vε is another candidatefor the solution, then

P (u∗ + vε) = P (u∗) + ε[v>Au∗ − v>f ] +ε2

2v>Av

andg = Bu∗ + Bvε .

Since ε is arbitrary, we must have

Bv = 0 =⇒ v>(Au∗ − f) = 0 .

This says that the gradient of the objective is orthogonal to the plane of the constraint,which is the nullspace of B. Since Rn = N (B)⊕R(B>), it follows that there is some vector`∗ ∈ Rm so that Au∗ − f = −B>`∗. We call `∗ the vector of Lagrange multipliers andrewrite the first-order condition for the minimum as

Au∗ + B>`∗ = f .

Together, the gradient of the objective and the constraint give us the linear system[A B>

B 0

] [u∗

`∗

]=[fg

]This linear system is nonsingular, with matrix that is symmetric but not positive-definite.

Let us define the Lagrangian

LP (u, `) =12u>Au− f>u + `>(Bu− g) .

Then the first-order equations for a critical point of LP are

0 = ∇uLP = Au∗ − f + B>`∗ ,

0 = ∇`LP = Bu∗ − g .


These are the same as the first-order conditions for the solution of the quadratic program-ming problem. In other words, the saddle point of the Lagrangian is the same as theminimum of the quadratic progamming problem.

We can also define the dual quadratic programming problem

maxv∈Rn

,w∈RmD(v,w) ≡ −1

2v>Av − g>w

subject to Av + B>w = f

At the optimal solution, the gradient of this objective must also be orthogonal to the planeof the constraint:

∃m∗ ∈ Rm , −[Av∗

g

]+[AB

]m∗ = 0 .

Together with the constraint, we obtain the linear system[A B>

B 0

] [v∗ = m∗

w∗

]=[fg

].

This is the same linear system that we found for the primal, so the optimal solutions ofthese two problems must be related by u∗ = v∗ = m∗ and `∗ = w∗. The Lagrangian forthe dual problem is

LD(v,w,m) = −12v>Av − g>w + m>(Av + B>w − f) .

At the optimal values, the primal and dual objectives are equal:

P (u∗) =12(u∗)>Au∗ − f>u∗

= −12(u∗)>Au∗ + (u∗)>(Au∗ − f) = −1

2(u∗)>Au∗ − (u∗)>B>`∗

= −12(u∗)>Au∗ − g>`∗ = −1

2(v∗)>Av∗ − g>w∗ = D(v∗,w∗) .

Since A is positive-definite, the primal quadratic programming problem is always bounded:

P (u) =12[u−A−1f ]>A[u−A−1f ]− f>A−1f ≥ −f>A−1f .

For the same reason, the dual quadratic programming problem is always feasible: for anyw ∈ Rm we can take v = A−1(f − B>w). Since B is assumed to have full rank, theprimal quadratic programming problem is always feasible. As a result, the dual problem isbounded below by the objective value at the optimal solution of the primal.


In our study of differential equations to follow, we will assume that our constraintsatisfies a condition of the form

∃β ∀u ∈ Rn , supv∈Rm

,v 6=0

v>Bu‖v‖

≥ β‖u‖ .

Such a condition is easy to establish for the quadratic programming problem. Let B =V>ΣU be the singular-value decomposition of B, and let σ be the minimum singular-valueon the diagonal of Σ. Here U and V are unitary matrices. Then

supu∈Rn

,u6=0

v>Bu‖u‖

= supu∈Rn

,u6=0

(Vv)>ΣUu‖Uu‖

≥ σ‖Vv‖ = σ‖v‖

In infinite-dimensional quadratic programming problems, such as those that arose inour preceding examples of incompressible flow, it is much harder to prove the existence ofthe Lagrange multipliers.

5.14.2 Mixed Formulation of Physical Problems

The following theorem will be useful in our discussions.

Hodge Decomposition Theorem 5.14-1: [24, p.51] Any vector field w on Ω ⊂ Rd

with smooth boundary can be uniquely decomposed (up to an additive constant in u) inthe form

w = v + ∇xu

where ∇x · v = 0 in Ω and n · v = 0 on ∂Ω.

Proof: Note that if ∇x · v = 0 in Ω and n · v = 0 on ∂Ω, then∫Ω

v · ∇xu dx =∫

Ω∇x · (vu)− u∇x · v dx

=∫

Ω∇x(vu) dx =

∫∂Ω

n · vu dx = 0 .

In other words, the space of divergence-free functions with zero normal derivativeon the boundary is orthogonal to the space of gradients of scalar functions.

As a result, the decomposition w = v + ∇xu is unique. To see this, supposethat v1 +∇xu1 = v2 +∇xu2. Then 0 = v1 − v2 +∇x(u1 − u2). Taking the innerproduct of this equation with v1 − v2 and integrating gives us

0 =∫

Ω‖v1 − v2‖2 + (v1 − v2) · ∇x(u1 − u2) dx

=∫

Ω‖v1 − v2‖2 dx


Thus v1 = v2, and ∇xu1 = ∇xu2, which in turn implies that u1 = u2 plus aconstant.

To prove existence, given w let u solve the Neumann problem

−∇x · ∇xu = −∇x ·w in Ωn · ∇xu = n ·w on ∂Ω

This problem is known to have a solution u, determined up to an additiveconstant [26]. Then v ≡ w−∇xu is divergence-free in Ω and n ·v = 0 on ∂Ω. 2

5.15 Mixed and Hybrid Finite Elements

For simplicity, we will consider a specific application in this section. Incompressiblesingle-phase flow in one-dimensional porous media consists of mass balance

dvdx

= ω

and Darcy’s law

v = −K(dp

dx+ gρ)/µ .

Here v is the velocity vector, p is the fluid pressure, µ is the fluid viscosity, ρ is the fluiddensity, g is the acceleration due to gravity, and K is the permeability. These two equationshold in the interior of some region Ω. We shall assume that the flow is confined to thisregion, which implies that on the boundary of Ω

v = 0 .

Flow occurs because of the influence of wells, represented by the source term ω. In order tospecify a unique solution to this problem, we will also assume that the pressure is specifiedat some point, usually within a well.

It is well-known that in one dimension, the solution of this problem depends on theharmonic average of the permeability. Away from wells, we have that v is constant. We cansolve for the pressure derivative and integrate between wells located at xL and xR to get

pR − pL = −∫ xR

xL

µ

Kdxv .

We can solve this equation for v to get

v =pL − pR∫ xR

xL

µK dx

.


We can also find the pressure at points between the wells:

p(x) = pL +

∫ xxL

µK dx∫ xR

xL

µK dx

(pL − pR) .

Note that the permeabilityK appears in these expressions only through a harmonic average.Suppose that we approximate the solution of these porous media equations by a standard

finite element method, using piecewise linear approximations to p. The method would beapplied to the equation

− d

dx

(K

µ

dp

dx

)= ω +

d

dx(gρ

µ) .

Then for piecewise-constant coefficients we get a discrete system of the form

−Kµ

)i[pi+ 1

2− pi− 1

2

4xi+ (gρ)i+ K

µ)i+1[

pi+ 32− pi+ 1

2

4xi+1+ (gρ)i+1 =

ωi + ωi+1

2.

This method uses arithmetic averaging of the coefficients; notice the coefficient of pi+ 12.

In multiple dimensions, this arithmetic averaging will allow flow into impermeable regionsfrom permeable regions.

The standard finite element method minimizes the energy per time

E(p) ≡ 12

∫Ω

dp

dx

Kµ

dp

dxdx+

∫Ωρgdp

dxdx−

∫Ωpω dx .

In this section, we will consider an alternative approach. We will view a different form ofDarcy’s law as a constraint:

dp

dx= −K−1vµ− gρ .

(Note that the introduction of this constraint allows us to invert the permeability, movingtoward the harmonic average.) Then we will use a penalty method to find a stationarypoint of the functional

H(v, p) ≡ 12

∫Ω

v ·K−1vµ+ pdvdx

+ vgρ− pω dx .

It is easy to see that H also has units of energy per time, and that

H(v + wε, pqε) ≡ H(v, p) + ε∫

Ωpdwdx

+ w · [K−1vµ+ gρ] dx+∫

Ωq[dvdx

− ω] dx

+ ε2∫

Ωqdwdx

+12w ·K−1wµ dx .


Thus the first variation of H vanishes whenever both∫Ωpdwdx

+ w[K−1vµ+ gρ] dx = 0

for all variations w of v, and ∫Ωq[dvdx

− ω] dx = 0

for all variations q of p. Since the second variation of H does not necessarily have anypredetermined sign, we expect that H may have a saddle point where its first variationvanishes.

This penalty method will give rise to a different finite element approach to approximatingthe solution of partial differential equations, called mixed finite element methods. Themethods will have the advantage that they will generate more accurate approximations tothe Darcy velocities than standard finite element methods using the same order polynomialsin the approximations. However, the resulting linear systems will be larger, and symmetricbut not positive-definite. Furthermore, the analysis of the mixed finite element methods ismore complicated.

5.15.1 Lowest Order Mixed Finite Element Method

The lowest-order mixed finite element approximates the pressure by piecewise-constants.The velocity approximations are such that their divergence is piecewise-constant. In onedimension, this means that M consists of all piecewise linear functions. In multiple dimen-sions, the ith component of a velocity in M is linear in the ith coordinate direction, andconstant in the other directions.

We will develop the details of the mixed finite element for these spaces in one dimension.The basis functions for pressure are

Pi(x) =

1, xi− 1

2< x < xi+ 1

2

0, otherwise

and the basis functions for velocity are

Vi+ 12(x) =

x−x

i− 12

4xi, xi− 1

2< x < xi+ 1

2x

i+12−x

4xi+1, xi+ 1

2< x < xi+ 3

2

0, otherwise


Our finite element equations are∫ΩPi∇x · V dx =

∫ xi+1

2

xi− 1

2

∇x · [Vi− 12(x)vi− 1

2+ Vi+ 1

2(x)vi+ 1

2] dx = vi+ 1

2− vi− 1

2

=∫

ΩPiω dx =

∫ xi+1

2

xi− 1

2

ω(x) dx = ωi4xi

and∫ΩVi+ 1

2

µ

KV dx =

∫ xi+1

2

xi− 1

2

Vi+ 12

µ

K[Vi− 1

2(x)vi− 1

2+ Vi+ 1

2(x)vi+ 1

2] dx

+∫ x

i+32

xi+1

2

Vi+ 12

µ

K[Vi+ 1

2(x)vi+ 1

2+ Vi+ 3

2(x)vi+ 3

2] dx

= (µ

K)i[

16vi− 1

2+

13vi+ 1

2]

14xi

+ (µ

K)i+1[

13vi+ 1

2+

16vi+ 3

2]

14xi+1

=∫

Ω∇x · Vi+ 1

2P + Vi+ 1

2· gρ dx

=∫ x

i+12

xi− 1

2

∇x · Vi+ 12pi + Vi+ 1

2· gρ dx +

∫ xi+3

2

xi+1

2

∇x · Vi+ 12pi+1 + Vi+ 1

2· gρ dx

= pi − pi+1 +12[(gρ4x)i + (gρ4x)i+1]

In the second finite element equation, we assume that the coefficients K, µ, g and ρ arepiecewise constant, for simplicity.

These finite element equations can be written in the form of a linear system[M GG> 0

] [vp

]=[gw

]where

M =

. . . . . .

16( µK4x)i 1

3 [( µK4x)i + ( µ

K4x)i+1] 16( µK4x)i+1

. . . . . .

is tridiagonal,

G =

. . .−1 1

. . .


is a bi-diagonal representing a discrete gradient,

g =

...

12 [(gρ4x)i + (gρ4x)i+1]

...

represents the gravity terms and

w =

...

ωi4xi...

represents the source terms due to wells. We have left the details of the boundary conditionsto the reader.

Note that the linear system is symmetric, but not positive-definite. One approach tosolving this system is to use a Schur decomposition. We use the first block equation toeliminate the velocities; this gives us

G>M−1Gp = G>M−1g − w

to solve for the pressures p. After we find the cell-centered pressures, we compute theside-centered velocities by

v = M−1(g −Gp) .

The difficulty with this approach is that M is tridiagonal, so M−1 is a full matrix, andG>M−1G is also a full matrix. An alternative approach for solving the linear system is touse a penalty method; see [16] for more details. Another approach is to use a hybrid mixedfinite element.

For the lowest-order mixed finite element method, the best we can hope for is that pcan be approximated by a piecewise-constant Q to first-order accuracy. This is the same asthe best order of accuracy we can hope for in the multi-dimensional approximation to thevelocity v.

5.15.2 Hybrid Mixed Finite Element Method

In contrast, the hybrid mixed method will use continuity of fluid flux to connect theequations on different grid scales. The hybrid mixed finite element method uses basis func-tions that treat the normal velocity as discontinuous at the cell sides. These unknownsare combined with Lagrange multipliers for a continuity constraint to decouple the mixedmethod equations between cells, and produce a symmetric positive system of linear equa-tions for the Lagrange multipliers. In this application, the Lagrange multipliers can beidentified with fluid pressures at the cell sides.


5.15.3 Mathematical Formulation of the Hybrid Mixed Finite ElementMethod

The weak formulation of the hybrid mixed method equations [5, 16, 17, 27, 70] is similarto, but less commonly used for porous flow than the mixed method [4, 5, 12, 16, 17, 21, 22, 37,39, 40, 48, 49, 70]. Let the problem domain Ω = ∪i,jΩi,j be a union of intervals, and let E bethe set of endpoints of these intervals interior to Ω. We want to find v ∈

⊕Ωi,j

H(div,Ωi,j),p ∈ H0(Ωi,j) and λ ∈ H0(E) so that

∫Ω

uT−1v 4x−∑Ωi,j

(∫Ωi,j

(dudx

)p 4x− uλ |∂Ωi,j

)=∫

Ωugρ 4x− (u)p |∂Ω

∀u ∈⊕Ωi,j

H(div,Ωi,j), (5.15-1a)

∑Ωi,j

∫Ωi,j

qdvdx

4x =∫

Ωqω 4x ∀ q ∈ H0(Ω), (5.15-1b)

∑Ωi,j

µv |∂Ωi,j= 0 ∀µ ∈ H0(E). (5.15-1c)

The first equation (5.15-1a) is Darcy’s law with possibly discontinuous velocity and pressureon the subdomains, the second equation (5.15-1b) is the divergence-free condition on thevelocity in the sub-domains, and the third equation (5.15-1c) requires the normal componentof velocity to be continuous across sides of sub-domains. The purpose of the Lagrangemultipliers is to enforce the continuity of the normal component of velocity. It is easy tosee that the Lagrange multipliers satisfy λ = p |E for strong solutions p of the pressureequation.

In order to construct an approximate solution of problem (5.15-1) we will choose finitedimensional subspaces Vh ⊂

⊕Ωi,j

H(div,Ωi,j), Ph ⊂ H0(Ω) and Λh ⊂ H0(E). We want to

find vh ∈ Vh, ph ∈ Ph and λh ∈ Λh such that∫Ω

uT−1v 4x−∑Ωi,j

(∫Ωi,j

(dudx

)p 4x− uλ |∂Ωi,j

)=∫

Ωugρ4x− u p |∂Ω

∀u ∈ Vh , (5.15-2a)∑Ωi,j

∫Ωi,j

qdvdx

4x =∫

Ωqω 4x ∀ q ∈ Ph , (5.15-2b)

∑Ωi,j

µh v |∂Ωi,j= 0 ∀µh ∈ Λh . (5.15-2c)


The finite dimensional spaces Vh, Ph and Λh consist of piecewise polynomial approximationschosen to provide good approximations to the solution of the differential equation. We havechosen Ph to consist of functions that are piecewise constant on grid cells, Vh to consist ofcell-wise discontinuous functions with ith component linear in the ith coordinate directionand all other components constant, and Λh to consist of functions that are piecewise constanton cell sides. Because the basis functions in Vh and Ph are discontinuous, equations (5.15-2a)and (5.15-2b), representing Darcy’s law and the divergence-free condition on the velocity,decouple cell by cell; see the finite difference form of these equations below in (5.15-58).The remaining equation (5.15-2c) represents conservation of volume flux at cell sides.

5.15.4 Positive-Definiteness of the Linear System

By examining the equations (5.15-2), we can see that the system of equations over theentire grid has the form T B C

B> 0 0C> 0 0

vpλ

=

gw0

. (5.15-3)

Here C represents the equations that enforce continuity of the normal components of the

Darcy velocity; together [B C] forms a discrete gradient and −[B>

C>

]forms a discrete

divergence.We will reformulate equations (5.15-2) in finite difference form in sections 5.15.5 below.

That discussion will show that the equations[T BB> 0

] [vp

]=[g −Cλ

w

],

decouple for each cell. After solving these equations, we obtain the symmetric system

[C> 0

] [ T BB> 0

]−1 [C0

]λ = −

[C> 0

] [ T BB> 0

]−1 [gw

](5.15-4)

for the Lagrange multipliers. Since T is positive definite, we could factor T = LL> and seethat[

C> 0] [ T B

B> 0

]−1 [C0

]= −C>L−>[I − L−1B(B>L−>L−1B)−1B>L−>]L−1C .

The quantity inside the brackets is the orthogonal projection onto the nullspace of B>L−>.Thus the matrix in the linear system (5.15-4) is nonnegative. Since the hybrid mixedsystem of equations has a unique solution, it follows that the linear system for the Lagrangemultipliers is positive-definite.


5.15.5 Numerical Implementation of the Hybrid Mixed Finite ElementMethod

In this section, we will reinterpret the hybrid mixed finite element equations in finitedifference form. We will also describe how the discrete equations representing Darcy’s lawand the divergence-free condition on the velocity decouple cell by cell.

Assuming piecewise constant transmissibilities, gravity and density, we will use exactintegration within each cell. For comparison with the results in multiple dimensions, ineach grid cell Ωi we will write

S =4xT

andγ = 4xgρ/2 .

Here T = K/µ is the transmissibility (permeability divided by viscosity), g is the accel-eration due to gravity and ρ is the fluid density. Then in each grid cell we can write thediscrete Darcy’s law (5.15-2a) and the divergence-free condition (5.15-2b) on the velocityfield in the form of a 3× 3 linear system

13S

16S 1

16S

13S −1

1 −1 −ω

i

v

(R)

i− 12

v(L)

i+ 12

pi

=

γi + pi− 12

γi − pi+ 12

−(ωpw)i

, (5.15-5a)

in any grid cell containing a pressure-specified well. A cell with a rate-specified well leadsto the discrete equations

13S

16S 1

16S

13S −1

1 −1 0

i

v

(R)

i− 12

v(L)

i+ 12

pi

=

γi + pi− 12

γi − pi+ 12

−(w4x)i

. (5.15-5b)

Here ω is the well productivity index, vL/Ri+ 1

2

are the volume fluxes (Darcy velocities in 1D),

pi is the cell pressure, and wi is the well rate. In one dimension, the well model, describedin equation (??), implies that the productivity index is ωi = 2Ti/4xi; if there is no wellin the grid cell, then ωi = 0 and wi = 0. The side pressures pi+ 1

2represent the values of

the Lagrange multipliers. Thus in each grid cell, we use the discrete forms of Darcy’s lawand volume conservation to determine the side volume fluxes and cell pressure in terms ofthe side pressures and well rate. In addition, we have a discrete form of the volume fluxcontinuity conditions (5.15-2c)

− v(L)

i+ 12

+ v(R)

i+ 12

= 0 , (5.15-6)


which couple the equations between grid cells and leads to a linear system of equations forthe side pressures.

We solve the linear system (5.15-5) in each cell of the grid to determine the side velocitiesand cell pressure in terms of the side pressures (i.e., the Lagrange multipliers). This hasthe form of a 3× 3 linear system (5.15-5), which we can write[

S bb> −ω

] [vp

]=[

g−ζ

].

We can invert [S bb> −ω

]−1

=[

M mm> −ν

].

Thus in each cell Ωi we obtain the equationsv

(R)

i− 12

v(L)

i+ 12

pi,j

=

M1 M2 m1

M2 M3 m2

m1 m2 −ν

i

γi + pi− 12

γi − pi+ 12

−ζ

, (5.15-7)

where the value of ζ depends on the well specification (if any). Then continuity of thevolume fluxes leads to the equation

0 = −v(L)

i+ 12

+ v(R)

i+ 12

=[−(M2)i , (M3)i + (M1)i+1 , −(M2)i+1

] pi− 12

pi+ 12

pi+ 32

− (m2ζ)i + (m1ζ)i+1 (5.15-8)− [(M2 +M3)γ]i + [(M1 +M2)γ]i+1 .

These equations simplify in obvious ways at the boundaries. It follows that the matrix in thelinear system (5.15-4) for the Lagrange multipliers (i.e., the side pressures) is tridiagonal.

After solving the pressure equation, we can use equation (5.15-7) to obtain values v(L/R)

i+ 12

for the volume fluxes (that is, the normal components of the fluid velocities) at the cell sides.If the residual in the linear system is small, then the pressure will have been determinedso that the left and right values of these quantities at each cell side are very nearly equal.Then we compute the time integral of the volume flux by

Vn+ 1

2

i+ 12

=4tn+ 1

2

2(v(L)

i+ 12

+ v(R)

i+ 12

) . (5.15-9)

5.15.5.1 Exact Solution for Piecewise-Constant Permeability


It is not hard to compute the exact solution of the pressure equation in one dimensionin the absence of wells interior to the domain, assuming piecewise-constant permeability,viscosity and density on some mesh with cell widths 4xi. The divergence-free conditionon the velocity shows that it is constant in space. If velocity is specified at one of theboundaries, then that must be the velocity everywhere in the domain. In order to make thediscussion more interesting, we will assume that we have specified the pressure at the twoboundaries.

Assume that the flow domain is (xL,xR). We can solve for pressure in terms of velocityto obtain

p(x)− p(xL) = −∫ x

xL

vµ(s)/K(s) + gρ(s) ds .

The known pressure at the right allows us to determine the velocity:

v =p(xL)− p(xR)− g

∫ xr

xLρ(s) ds∫ xR

xLµ(s)/K(s) ds

.

Since the coefficients are assumed to be piecewise continuous with respect to some meshwith cell widths 4xi, this can be rewritten

v =p(xL)− p(xR)− g

∑i ρi4xi∑

i µi/Ki4xi.

Similarly, the pressure at the cell sides in the mesh can be written

p(xi+ 12) = p(xL)− v

∑j≤i

µi/Ki4xi − g∑j≤i

ρi4xi .

Note that these analytical solutions depend on the harmonic average of the transmissibil-ity K/µ. Also recall from section ?? that homogenization produces precisely the harmonicaverage of the transmissibility.

In the absence of wells, the hybrid mixed method equations imply that velocity is con-stant throughout the domain, and that within each grid cell

µi4xi2Ki

v + pi = pi− 12

+ γi4xi

µi4xi2Ki

v − pi = −pi+ 12

+ γi4xi .

We can add these two equations to get

pi− 12− pi+ 1

2=µi4xiKi

v + gρi4xi .


Using the pressure at the left-hand side, we solve this recurrence to get

pi+ 12

= p(xL)−∑j≤i

(vµi/Ki + gρi)4xi .

Using the pressure at the right-hand side, we can solve for the discrete velocity to seethat it is identical to the analytical solution. We can substitute this velocity into thediscrete solution for the side pressures and see that they are also identical to their analyticalsolutions. It is not hard to see that the cell-centered pressures pi are also exact. Thus thehybrid mixed method produces the exact solution for piecewise-constant coefficients in onedimension, even if the permeability takes on random values in the grid cells.

5.15.6 Comments on the Hybrid Mixed Finite Element Method

In 1D, the result of substituting the volume fluxes as functions of the side pressures,analogous to (5.15-58), into the equation for continuity of the volume fluxes, analogous to(5.15-59), is a tridiagonal system of linear equations:

0 = −v(L)

i+ 12

+ v(R)

i+ 12

=[−(M2)i , (M3)i + (M1)i+1 , −(M2)i+1

] pi− 12

pi+ 12

pi+ 32

− (m2ζ)i + (m1ζ)i+1 − [(M2 +M3)γ]i + [(M1 +M2)γ]i+1 . (5.15-10)

Both the mixed method and the hybrid mixed method involve harmonic averaging ofthe permeability. In particular, if the permeability in some cell is zero, then the hybridmixed method will produce zero normal velocities associated with the interior sides of thatcell, and continuity of the normal velocity will require that there is no flow across thesesides. This feature can be very useful in practical situations, but requires some numericalcare to avoid division by zero.

One drawback of the hybrid mixed method is that the linear system involves an averageof d unknowns per cell, where d is the number of spatial dimensions. Thus this system islarger than the usual system for cell-centered pressure in standard petroleum simulation,such as block-centered finite differences [65]. Because the pressure unknowns are associatedwith cell sides, the stencil is non-standard; in particular, we cannot use readily availableincomplete factorizations for the linear system. Further, because the piecewise polynomialspaces for the Lagrange multipliers are not nested between levels of refinement, we cannotuse much of the available literature for developing multigrid iterative techniques for thispressure equation. We will describe our approach below in section ??.

We also note that the hybrid mixed method has some resemblance to finite volumemethods [36, 92] that have been suggested for use in flow in porous media.5.15.6.1 Poisson Equation


We can formulate the Poisson problem

−∇x · ∇xu = f in Ω (5.15-11a)u = ω on Γ0 ⊂ ∂Ω (5.15-11b)

n · ∇xu = g on Γ1 = ∂Ω− Γ0 (5.15-11c)

as a constrained minimization problem.

Lemma 5.15-1: Suppose that f ∈ H−1(Ω), g ∈ H−1/2(Γ1) and ν ∈ H1/2(Γ0) are given,and that Γ0 has nonzero measure. Then the solution of the Poisson problem (5.15-11) isthe same as the solution of either the primal constrained minimization problem

find u ∈ H1(Ω) (5.15-12a)

minimize P (u) ≡ 12

∫Ω∇xu · ∇xu dx−

∫Ωuf dx−

∫Γ1

ug ds (5.15-12b)

subject to ∀λ ∈ H−1/2(Γ0) ,∫

Γ0

λ(u− ν) ds = 0 (5.15-12c)

or the dual constrained maximization problem

find w ∈ Hdiv(Ω) (5.15-13a)

maximize D(w) ≡ −12

∫Ω

w ·w dx +∫

Γ0

νn ·w ds (5.15-13b)

subject to∀µ ∈ H1/2(Γ1) ,

∫Γ1µ(n ·w − g) ds = 0

∀v ∈ H0(Ω) ,∫Ω v(∇x ·w + f) dx = 0

(5.15-13c)

Here Hdiv(Ω) is the Hilbert space formed by taking the completion of C∞(Ω) functionswith respect to the norm

‖w‖2div ≡ ‖w‖2

0 + ‖∇x ·w‖20 .

If u ∈ H1(Ω) satisfies the first-order condition for a minimum of the primal, then ∇xu isfeasible for the dual. If u satisfies the primal constraint, then ∇xu satisfies the first-ordercondition for a maximum of the dual. Further, if u solves the primal problem and ∇xusolves the dual, then P (u) = D(∇xu).

Proof: The first-order conditions for the minimum of the primal have the form

∀δu ∈ NΓ0 ≡ δu ∈ H1(Ω) : ∀λ ∈ H−1/2(Γ0)∫

Γ0

λδu ds = 0∫Ω∇xu · ∇xδu dx−

∫Ωfδu dx−

∫Γ1

gδu ds = 0 . (5.15-14)


This is a weak formulation of the Poisson problem.

The first-order conditions for the maximum of the dual have the form

∀δw ∈ NΓ1 , −∫

Ωw · δw dx +

∫Γ0

νn · δw ds = 0 (5.15-15)

where

NΓ1 ≡ δw ∈ Hdiv(Ω) :∀µ ∈ H0(Γ0) ,∫

Γ1

µδw ds = 0 and

∀v ∈ H0(Ω) , −∫

Ωv∇x · δw dx = 0

Since∫Ω δw · δw dx is positive-definite on NΓ1 , the dual has a unique maximum.

Using the Hodge decomposition theorem 5.14-1, we can write w = v + ∇xuwhere v is divergence-free in Ω and n · v = 0 on ∂Ω. Then for all δw ∈ NΓ1 wehave

0 = −∫

Ω[v + ∇xu] · δw dx +

∫Γ0

νn · δw ds

= −∫

Ωv · δw dx−

∫∂Ωun · δw ds+

∫Ωu∇x · δw dx +

∫Γ0

νn · δw ds

= −∫

Ωv · δw dx

We can choose δw = v and see that we must have v = 0. Thus if w satisfiesthe first-order condition for a minimum of the dual, w = ∇xu for some scalarfunction u. The first-order condition and the constraint for the maximum thenimply that u solves the Poisson problem.

It is easy to see that if u ∈ H1(Ω) satisfies the first-order conditions for aminimum of the primal and ∇xu ∈ Hdiv(Ω), then ∇xu satisfies the constraint forthe dual. In fact, for all δu ∈ NΓ0

0 =∫

Ω∇xu · ∇xδu dx−

∫Ωfδu dx−

∫Γ1

gδu ds

=∫

Ω∇x(∇xu δu)− δu(∇x · ∇xu+ f) dx−

∫Γ1

gδu ds

=∫

Γ1

δu(n · ∇xu− g) ds−∫

Ωδu(∇x · ∇xu+ f) dx .


Furthermore, if u satisfies the constraint for the primal, then ∇xu satisfies thefirst-order conditions for a maximum of the dual: for all δw ∈ NΓ1 ,

−∫

Ωδw · ∇xu dx +

∫Γ0

νn · δw ds

=−∫

Ω∇x · (δw u)− u∇xδw dx +

∫Γ0

νn · δw ds

=∫

Γ0

(ν − u)n · δw ds+∫

Ωu∇x · δw dx = 0 .

Similarly, if w = ∇xu satisfies the first-order conditions for a maximum of thedual, then u is satisfies the constraint for the primal; and if w = ∇xu satisfies theconstraint for the dual, then u satisfies the first-order conditions for a minimumof the primal.

Then if ∇xu = w ∈ Hdiv(Ω) is optimal for the primal and the dual, then

P (u) =12


∫Ωuf dx−

∫Γ1

ug ds

=12

∫Ωw · w dx +

∫Ωu∇x · w dx−

∫Γ1

un · w ds

=12

∫Ωw · w dx +

∫Ω∇x · (wu)− w · ∇xu dx−

∫Γ1

un · w ds

= −12

∫Ωw · w dx +

∫Γ0

νn · w ds ≡ D(w) .

2

We will say that u ∈ H1(Ω) is feasible for the minimization problem (5.15-12) if itsatisfies the contraint (5.15-12c), and optimal if it satisfies the first-order condition for aminimum (5.15-14). If Γ0 has nonzero measure, then

∫Ω ∇xδu · ∇xδu dx is positive-definite

on NΓ0 , and the primal has a unique minimum. This justifies our definition of “optimal.”

We can use Lagrange multipliers to remove the constraints in these optimization prob-lems.


Lemma 5.15-2: If (u, γ) ∈ H1(Ω)×H−1/2(Γ0) is an extremum of the primal Lagrangian

LP (u, γ) ≡ 12


∫Ωuf dx−

∫Γ1

ug ds−∫

Γ0

γ(u− ν) ds , (5.15-16)

then u is a minimum of the primal formulation of the Poisson problem (5.15-12). If(w, u, µ) ∈ Hdiv(Ω)×H0(Ω)×H0(Γ1)d is an extremum of the dual Lagrangian

LD(w, u, µ) ≡ −12

∫Ωw ·w dx +

∫Γ0

νn ·w ds−∫

Ωu(∇x ·w + f) dx +

∫Γ1

µ(n ·w − g) ds ,

(5.15-17)then w is a maximum of the dual formulation of the Poisson problem (5.15-13)

Proof: The first-order conditions for an extremum of LP are

∀δu ∈ H1(Ω)∫

Ω∇xu · ∇xδu dx−

∫Ωfδu dx−

∫Γ1

gδu ds−∫

Γ0

γδu ds = 0 ,

(5.15-18a)

∀δγ ∈ H0(Γ0) −∫

Γ0

(u− ν)δγ ds = 0 . (5.15-18b)

The latter of these two equations shows that u is feasible for the primal, and theformer shows that u is optimal for the primal. If ∇xu ∈ Hdiv(Ω) is an extremumof LP , then it is not hard to see that γ = n · ∇xu; for all δu ∈ H1(Ω) we have

0 =∫

Ω∇x · (∇xu δu)− δu∇x · ∇xu dx−

∫Ωfδu dx−

∫Γ1

gδu ds−∫

Γ0

γδu ds

= −∫

Ωδu(∇x · ∇xu+ f) dx +

∫Γ1

δu(n · ∇xu− g) ds+∫

Γ0

δu(n · ∇xu− γ) ds .

Also note that if u is an extremum of LP , then u solves the primal and

LP (u,n · u) = P (u)−∫

Γ0

n · u(u− ν) ds = P (u) .

The first-order conditions for an extremum of LD are

∀δw ∈ Hdiv(Ω) , −∫

Ωw · δw dx−


∫Γ1

µn · δw ds+∫

Γ0

νn · δw ds = 0

(5.15-19a)

∀δu ∈ H0(Ω) , −∫

Ωδu(∇x · w + f) dx = 0 (5.15-19b)

∀δµ ∈ H0(Γ1) , −∫

Γ1

(n · w − g)δµ ds = 0 . (5.15-19c)


If w, u, µ is an extremum for LD, the last two of these equations show that wis feasible for the dual, and the first shows that w is optimal for the dual. Ifu ∈ H1(Ω), it follows that w = ∇xu in Ω, and µ = u on Γ1; for all δw ∈ Hdiv(Ω)we have

0 = −∫

Ωw · δw dx−


∫Γ1

µn · δw ds+∫

Γ0

νn · δw ds

= −∫

Ωw · δw dx−

∫Ω∇x · (δw u)− δw · ∇xu dx +

∫Γ1

µn · δw ds+∫

Γ0

νn · δw ds

= −∫

Ωδw · (w − ∇xu) dx +

∫Γ1

(µ− u)n · δw ds+∫

Γ0

(ν − u)n · δw ds .

Also note that if w is an extremum of LD, then w solves the dual and

LD(w, u, µ) = D(w) +∫

Γ1

µ(n · w − g) ds = D(w) .

2

While the Lagrangians have the advantage of leading to unconstrained extremum prob-lems, their extrema are saddle points.

Lemma 5.15-3: Suppose that u∗, γ∗ is the extremum of LP , and that w∗, u∗, µ∗ is theextremum of LD, defined in lemma 5.15-2. Then for all u ∈ H1(Ω) and all γ ∈ H1/2(Γ0)

LP (u∗, γ) = LP (u∗, γ∗) ≤ LP (u, γ∗)

and for all w ∈ Hdiv(Ω), all u ∈ H0(Ω) and all µ ∈ H0(Γ1)d

LD(w∗, u, µ) = LD(w∗, u∗, µ∗) ≥ LD(w, u∗, µ∗)

Proof: We will prove the claimed conditions for LD.

LD(w∗, u∗, µ∗)− LD(w∗, u, µ)

=− 12

∫Ω

w∗ ·w∗ dx +∫

Γ0

νn ·w∗ ds−∫

Ωu∗(∇x ·w∗ + f) dx +

∫Γ1

µ∗(n ·w∗ − g) ds

+12

∫Ω

w∗ ·w∗ dx−∫

Γ0

νn ·w∗ ds+∫

Ωu(∇x ·w∗ + f) dx−

∫Γ1

µ(n ·w∗ − g) ds

=∫

Ω(u∗ − u)(∇x ·w∗ + f) d+

∫Γ1

(µ∗ − µ)(n ·w∗ − g) ds = 0


Similarly, for all w ∈ Hdiv(Ω), since w∗ = ∇xu∗ in Ω and µ∗ = u∗ on γ1

− LD(w, u∗, µ∗) + LD(w∗, u∗, µ∗)

=12

∫Ω

w∗ ·w∗ dx−∫

Γ0

νn ·w∗ ds+∫

Ωu∗(∇x ·w∗ + f) dx−

∫Γ1

µ∗(n ·w∗ − g) ds

− 12

∫Ω

w∗ ·w∗ dx +∫

Γ0

νn ·w∗ ds−∫

Ωu∗(∇x ·w∗ + f) dx +

∫Γ1

µ∗(n ·w∗ − g) ds

=12

∫Ωw · w −w∗ ·w∗ dx +

∫Γ0

νn · (w∗ − w) ds

+∫

Ωw∗ · (w∗ − w) dx−

∫Γ0

νn · (w∗ − w) ds

=12

∫Ω(w∗ − w) · (w∗ − w) dx ≥ 0

The primal Lagrangian leads to a saddle-point problem in a similar fashion. 2

It is common to develop Lagrangians in ways that are unrelated to minimization ormaximization problems. In the Poisson problem, one might work with differential equationin the following form

−∇x · v = f in Ω (5.15-20a)v − ∇xu = 0 in Ω (5.15-20b)

u = ω on Γ0 ⊂ ∂Ω (5.15-20c)n · v = g on Γ1 = ∂Ω− Γ0 (5.15-20d)

We can reformulate this problem in weak form as follows. Let H be the completion withrespect to ‖‖div of C∞ vector functions whose normal derivative vanishes on Γ1. We wantto find u ∈ H0(Ω) and v ∈ Hdiv(Ω) with normal component equal to g on Γ1 so that

∀δv ∈ H ,

∫Ωδv · v dx +

∫Ωu∇x · δv dx =

∫Γ0

ωn · δv ds (5.15-21a)

∀δu ∈ H0(Ω) ,∫

Ωδu∇x · v d = −

∫Ωfδu d (5.15-21b)

This weak formulation corresponds to the Lagrangian

L(v, u) ≡∫

Ω

12v · v + u∇x · v + uf dx−

∫Γ0

gn · v ds .

We will also be able to analyze the well-posedness of the first-order conditions for theextremum of this Lagrangian. Note that the boundary condition on u is a natural boundarycondition, and the boundary condition on v is an essential boundary condition.5.15.6.2 Incompressible Single-Phase Flow in Porous Media


Incompressible single-phase flow in porous media consists of the following equations

∇x · v = ω in Ω (5.15-22a)v = −K(∇xp+ gρ)/µ in Ω (5.15-22b)

p = ψ on Γ0 ⊂ ∂Ω (5.15-22c)n · v = f on Γ1 = ∂Ω− Γ0 (5.15-22d)

Equation (5.15-22a) represents conservation of fluid volume, and equation (5.15-22b) isDarcy’s law, which represents conservation of momentum. Here v is the velocity vector, pis the fluid pressure, µ is the fluid viscosity, ρ is the fluid density, g is the acceleration dueto gravity, and K is the permeability. Flow often occurs because of the influence of wells,represented by the source term ω. In order to specify a unique solution to this problem, wewill also assume that the pressure is specified at some point; if this is not on a portion Γ0

of the boundary, then pressure is usually specified within a well.The common approach to determining a Lagrangian for this problem involves two steps.

First, we multiply (5.15-22b) by an arbitrary δv ∈ Hdiv(Ω) with zero normal componenton Γ1 and integrate over Ω; we also multiply (5.15-22a) by an arbitray δp ∈ H0(Ω) andintegrate over Ω. These lead to the system of weak equations∫

Ωv ·K−1µδv dx −

∫Ωp∇x · δv dx = −

∫Γ0

ψn · δv ds (5.15-23a)

−∫

Ωδp∇x · v dx = −

∫Ωωδp dx (5.15-23b)

These are the equations for zero first variation of the Lagrangian

L(v, p) ≡ 12

∫Ω

v ·K−1µv dx +∫

Ωp(ω − ∇x · v) + v · gρ dx +

∫Γ0

ψn · v ds (5.15-24)

Note that the boundary condition on p is natural, and the boundary condition on v isessential.5.15.6.3 Linear Elasticity

Let u ∈ Rd be the displacement, and define the infinitesimal strain by

E(u) ≡ 12[∂u∂x

+ (∂u∂x

)>] .

Given a shear modulus µ ≥ 0 and a Lame constant λ > 0, define the stress by Hooke’slaw

S(u) ≡ E(u)2µ+ Itr(E(u))λ .

Here tr (E) is the trace of the matrix E.


Suppose that Γ0 ⊂ ∂Ω and Γ1 = ∂Ω − Γ0. Given f ∈ H−1(Ω), g ∈ H−1/2(Γ1) andν ∈ H1/2(Γ0), we want to solve

−∇x · S(u) = f> in Ω (5.15-25a)u = ν on Γ0 (5.15-25b)

S(u)n = g on Γ1 (5.15-25c)

If we multiply by δu ∈ H1(Ω)d where δu = 0 on Γ0, and then integrate, we obtain∫Ωf · δu dx = −

∫Ω[∇xS(u)]δu dx

= −∫

Ω∇x[S(u)δu]− tr [S(u)

∂δu∂x

] dx

= −∫∂Ω

n · S(u)δu ds+∫

Ω

12tr [S(u)

∂δu∂x

(∂δu∂x

)>S(u)>] dx

= −∫

Γ1

n · S(u)δu ds+12

∫Ω

12tr [S(u)

∂δu∂x

S(u)>(∂δu∂x

)>] dx

= −∫∂Ω

g · δu ds+12

∫Ω

tr [S(u)E(δu)] dx .

We will define the strain energy density to be

E(u) = tr [S(u)E(u)] = tr [E(u)22µ+ E(u) tr E(u)λ] = ‖E(u)‖2F 2µ+ [ tr E(u)]2λ

= ‖E(u)‖2F 2µ+ [∇x · u]2λ .

Here ‖E(u)‖F is the Frobenius norm, the square root of the sums of the squares of theentries.

The primal optimization problem for linear elasticity is to find u ∈ H1(Ω)d to solve

minimize P (u) ≡ 12

∫Ω

tr [S(u)E(u)] dx−∫

Ωf · u dx−

∫Γ1

g · u ds (5.15-26a)

subject to ∀λ ∈ H−1/2(Γ0)d ,∫

Γ0

λ · (u− ν) ds = 0 (5.15-26b)

The first-order conditions for a minimum of this problem are the usual weak form of thelinear elasticity problem (5.15-25). Korn’s inequality [16]

∃C > 0 ∀u ∈ H1(Ω)d ,∫

Ω‖E(u)‖2

F dx ≥ C‖u‖21 (5.15-27)

shows that the second-order conditions for a minimum are satisfied. The Lagrangian forthe primal is

LP (u, η) =12

∫Ω


Ωf · u dx−

∫Γ1

g · u ds−∫

Γ0

η · (u− ν) ds .


The first-order condtions for an extremum of LP (u, η) imply that u solves the primal, andη = S(u)n on Γ0.

Note thattr (S) = tr (E)(2µ+ 3λ)

soE(S) = S

12µ

− Itr (S)λ

2µ1

2µ+ 3λ.

It follows that

P (u) =12

∫Ω


Ωf · u dx−

∫Γ1

g · u ds

=12

∫Ω

tr [S(u)E(u)] dx +∫

Ω[∇x · S(u)]u dx−

∫Γ1

n · S(u)u ds

=12

∫Ω


Ω∇x · [S(u)u]− tr [S(u)

∂u∂x

] dx−∫

Γ1

n · S(u)u ds

= −12

∫Ω


Γ0

n · [S(u)u] ds .

The dual optimization problem for linear elasticity is to find S ∈ Hdiv(Ω) to solve

maximize D(u) ≡ −12

∫Ω

tr [S2]12µ

− [ tr S]2λ

2µ1

2µ+ 3λdx +

∫Γ0

n · Sν ds (5.15-28a)

subject to

∀v ∈ H0(Ω) ,∫Ω(∇x · S + f>)v dx = 0

∀η ∈ H0(Γ1)d ,∫Γ1η · (Sn− g) ds = 0 (5.15-28b)

The first-order conditions for a maximum of this problem require that for all δS such that

∀η ∈ H0(Γ1) ,∫

Γ1

η · δSn ds = 0

and

∀v ∈ H1(Ω) , 0 =∫

Ω(∇xδS)v ds

=∫

Ω∇x · (δSv)− tr (SE(v)) dx

=∫

Γ0

n · δSv ds−∫

Ωtr (δSE(v)) dx

we have that

0 = −∫

Ωtr (E(S)δS) dx +

∫Γ0

n · δSν ds

=∫

Ωtr ([E(v)−E(S)]δS) dx +

∫Γ0

n · δS(ν − v) ds .


Korn’s inequality shows that the second-order conditions for a maximum are satisfied.The Lagrangian for the dual is

LD(S,u, ζ) ≡ −12

∫Ω

tr [S2]12µ

− [ tr S]2λ

2µ1

2µ+ 3λdx +

∫Γ0

n · Sν ds

+∫

Ω(∇x · S + f>)u dx−

∫Γ1

ζ · (Sn− g) ds .

The first-order condtions for an extremum of LD(S,u, ζ) imply that u solves the primal,and η = S(u)n on Γ0.5.15.6.4 Stoke’s Equation

The incompressible Navier-Stokes equations are

ρ∂v∂t

+ ρv · ∇xv = −∇xp+ µ∇x · ∇xv + fρ

∇x · v = 0

The former of these is conservation of momentum, and the latter is the continuity equation(or conservation of mass). Here ρ is the (constant) density, µ is the first coefficient ofviscosity, v ∈ Rd is the velocity vector, p is the pressure and f ∈ Rd is the body forceapplied to the fluid.

It is common to non-dimensionalize the Navier-Stokes equations. We choose some ref-erence velocity magnitude V and reference length L and define dimensionless velocity, po-sition, pressure and body force by

v = v/V

x = x/L

p = p/(ρV 2)

f = fL/V 2

Then the dimensionless Navier-Stokes equations can be written

∂v∂t

+ v · ∇xv = −∇xp+1R∇x · ∇xv + f

∇x · v = 0

where

R =ρV L

µ

is the Reynolds number.


For strongly viscous flows, the Reynolds number is small. In such flows the convectionterm v ·∇xv is small relative to the viscous term 1

R∇x ·∇xv. In the absence of body forces,we are lead to the Stokes equation [24, 47]

∂v∂t

= −∇xp+1R∇x · ∇xv

∇x · v = 0

At steady state, these become

−∇xp+1R∇x · ∇xv = 0

∇x · v = 0

where we have removed the tildes for simplicity. Note that p is determined only up to anadditive constant by these equations. We could specify p uniquely by fixing its value at atpoint in the domain, or by requiring that

∫Ω p dx = 0.

Suppose that u ∈ C∞(Ω) and u = 0 on ∂Ω, and q ∈ C∞(Ω). Multiplying the momentumequation by u·, the continuity equation by q and applying the divergence theorem leads to

0 =∫

Ω

1R

tr(ux

[vx

]>)dx−

∫Ωp∇x · u dx

0 =∫

Ωq∇x · v dx

These equations suggest that we define

a(u,v) =∫

Ω

1R

tr(ux

[vx

]>)dx

b(u, p) =∫

Ωp∇x · u dx

Then the steady-state Stokes equations have the weak form

∀u ∈ H10 (Ω)d , a(u,v)− b(u, p)= 0

∀q ∈ p ∈ H0(Ω) :∫

Ωp dx = 0 , −b(v, q) = 0

This has the form of the first-order extremum conditions for a saddle-point, which we willrelate to a constrained minimization problem. The divergence-free condition on the velocityfield will serve as the constraint.


5.15.7 Constrained Minimization and Lagrangians

We will consider a generalization of the mixed formulations we studied in sections5.15.6.4 and 5.15.6.2. Let U and V be two Hilbert spaces. Let a : U × U → R be bi-linear, symmetric and bounded

∃Ca > 0 ∀u,w ∈ U , |a(u,w)| ≤ Ca‖u‖U‖w‖U (5.15-29)

and let b : U × V → R be bilinear. Suppose that f : U → R and g : V → R are linear. Wewant to find u ∈ U to solve

min12a(u, u)− f(u) (5.15-30a)

subject to b(u, v) = g(v) ∀v ∈ V (5.15-30b)

If u+ wε also satisfies the constraint, then

12a(u+ wε, u+ wε)− f(u+ wε) = [

12a(u, u)− f(u)] + ε[a(u,w)− f(w)] +

ε2

2a(w,w)

and∀v ∈ V , 0 = b(u+ wε, v)− g(v) = [b(u, v)− g(v)] + εb(w, v) .

This suggests that we define

NU ≡ u ∈ U : ∀v ∈ V , b(u, v) = 0

which represents the tangent plane of the constraint. Then the first-order condition forthe minimum is

∀w ∈ NU , a(u,w)− f(w) = 0 . (5.15-31)

In other words, the linear functional

fu(w) ≡ f(w)− a(u,w) (5.15-32)

which represents the gradient of the objective, annihilates all functions in the tangent planeof the constraint. This suggests that we define

N0U ≡ f ∈ U ′ : ∀w ∈ NU , f(w) = 0

and note that fu ∈ N0U . The second-order condition for a unique minimum is

∃α > 0 ∀w ∈ NU , a(w,w) ≥ α‖w‖2U (5.15-33)

Note that we do not need to assume that a is coercive on all of U .


We will define the Lagrangian for this problem to be L : U × V → R where

L(u, v) =12a(u, u)− f(u) + b(u, v)− g(v)

Note that the first-order conditions for an extremum (u, v) of L are

∀w ∈ U , 0 = a(u,w)− f(w) + b(w, v) (5.15-34a)∀ν ∈ V , 0 = b(u, ν)− g(ν) (5.15-34b)

The latter of these conditions is the constraint (5.15-30b) for the solution of the constrainedminimization problem (5.15-30). The former condition says that the linear functional fu,defined in (5.15-32), is equal to a linear functional of the form b(·, v) for some v ∈ V . Thisis similar to requiring that the gradient of the objective is in the span of the functions thatare orthogonal to the tangent plane of the objective.

Note that if (u, v) ∈ U × V satisfies (5.15-34), so that this pair is an extremum of theLagrangian, then u solves the minimization problem (5.15-30). Obviously, u satisfies theconstraint (5.15-30b). Furthermore, if we choose w ∈ NU in equation (5.15-34a), we seeimmediately that u satisfies the first-order condition for a minimum (5.15-31). The taskbefore us is to show that the Lagrangian extremum problem (5.15-34) has a solution thatdepends continuously on the data.

5.15.8 Well-Posedness of the Weak Mixed Problem

The following lemma generalizes the Lax-Milgram Theorem. In other words, it describesthe conditions under which the constraint in the minimization problem (5.15-30) has asolution.


Lemma 5.15-4: [8, p. 112], [11, p. 120] Suppose that U and V are Hilbert spaces, andlet (, ) be some other inner product defined for v ∈ V as its second argument. Let V ′ bethe set of all continuous linear functionals on V :

f ∈ V ′ ⇐⇒ ∃Cf > 0 ∀v ∈ V , |f(v)| ≤ Cf‖v‖V .

Let b : U × V → R be bilinear, and define the linear transformation B : U → V ′ by

∀v ∈ V , (Bu, v) = b(u, v) .

Define the setsNV ≡ v ∈ V : ∀u ∈ U , b(u, v) = 0 .

andN0V ≡ f ∈ V ′ : v ∈ NV =⇒ f(v) = 0

Then the following are true:

1. B is continuous, meaning

∃CB > 0∀u ∈ U , ‖Bu‖V ′ ≡ supv∈V,v 6=0

|(Bu, v)|‖v‖V

≤ CB‖u‖U ,

if and only if b is continuous, meaning that

∃Cb > 0∀u ∈ U∀v ∈ V , |b(u, v)| ≤ Cb‖u‖U‖v‖V . (5.15-35)

2. If b satisfies the inf-sup condition

∃β > 0 ∀u ∈ U , supv∈V,v 6=0

b(u, v)‖v‖V

≥ β‖u‖U (5.15-36)

then N (B) = 0, meaning that

Bu = 0 =⇒ u = 0 ,

and B−1 is continuous, meaning

∃C1 > 0∀u ∈ U , C1‖u‖U ≤ ‖Bu‖V ′ .

3. If b is continuous and b satisfies the inf-sup condition (5.15-36), then N (B) = 0 andR(B) = N0

V , meaning that

f ∈ N0V =⇒ ∃u ∈ U , Bu = f .

4. N (B) = 0 and R(B) = V ′ if and only if b is continuous, b satisfies the inf-supcondition (5.15-36) and R(B)⊥ = 0, meaning that

∀0 6= v ∈ V ∃u ∈ U , b(u, v) 6= 0 . (5.15-37)


Proof: To prove the first claim, note that

∃CB > 0 ∀u ∈ U , ‖Bu‖V ′ ≤ CB‖u‖U

is equivalent to

∃CB > 0 ∀u ∈ U , supv∈V,v 6=0

|(Bu, v)|‖v‖V

≤ CB‖u‖U ,

which through the definition of B is equivalent to

∃CB > 0 ∀u ∈ U ∀v ∈ V , |b(u, v)| ≤ CB‖u‖U‖v‖V .

To prove the second claim, suppose that b satisfies the inf-sup condition (5.15-36).If Bu = 0, then

β‖u‖U ≤ supv∈V,v 6=0

b(u, v)‖v‖V

= supv∈V,v 6=0

(Bu, v)‖v‖V

= 0 ,

so u = 0. If Bu ∈ V ′, then

β‖u‖U ≤ supv∈V,v 6=0

b(u, v)‖v‖V

= supv∈V,v 6=0

(Bu, v)‖v‖V

= ‖Bu‖V ′ .

To prove the third claim, suppose that b satisfies the inf-sup condition (5.15-36)and the continuity condition (5.15-35). Then the range of B, R(B), is closed,and the Closed Range Theorem [93, p. 205] shows that R(B) = N0

V .

To prove the final claim, suppose that b is continuous, b satisfies the inf-supcondition and R(B)⊥ = 0. Since the Closed Range Theorem shows thatR(B) = (R(B)⊥)⊥, we must have that R(B) = V ′. The third claim showedthat N (B) = 0.To prove the other direction of the final claim, suppose that N (B) = 0,R(B) = V ′, B is continuous and B−1 is continuous. Then the first claim showsthat b is continuous. The continuity of B−1 implies that

∃C1 > 0∀u ∈ U , C1‖u‖U ≤ ‖Bu‖V ′ = supv∈V,v 6=0

b(u, v)‖v‖V

,

so b satisfies the inf-sup condition (5.15-36). Finally, suppose that (5.15-37) isfalse. Then there exists v ∈ V such that for all u ∈ U we have b(u, v) = 0; inother words, v ∈ NV . Since V ′ = R(B) = N0

V , we have that f(v) = 0 for allf ∈ V ′ . Define f ∈ V ′ by f(v) = (v, v)V . Then 0 = f(v) = ‖v‖2

V , so v = 0.This gives a contradition; (5.15-37) must hold. 2


The next lemma discusses conditions under which the constraint in the minimizationproblem (5.15-30) depends continuously on its data.

Lemma 5.15-5: [11, p. 126] Suppose that U and V are Hilbert spaces, and let (, ) besome other inner product defined for v ∈ V as its second argument. Let U ′ be the set of allcontinuous linear functionals on U , and let V ′ be the set of all continuous linear functionalson V . Let b : U × V → R be bilinear and continuous (i.e., b satisfies (5.15-35)). Definethe linear transformation B : U → V ′ by

∀v ∈ V , (Bu, v) = b(u, v)

and define the linear transformation B∗ : V → U ′ by

∀u ∈ U , (u,B∗v) = b(u, v) .

Define the sets

NU ≡ u ∈ U : ∀v ∈ V , b(u, v) = 0 ,N0U ≡ f ∈ U ′ : u ∈ NU =⇒ f(u) = 0 ,

N⊥U ≡ u ∈ U : w ∈ NU =⇒ (u,w)U = 0 .

Then the following are equivalent:

1. b satisfies the inf-sup condition

∃β > 0 ∀v ∈ V , supu∈U,u 6=0

b(u, v)‖u‖U

≥ β‖v‖V (5.15-38)

2. B∗ satisfies∀f ∈ N0

U ∃v ∈ V , B∗v = f (5.15-39)

and∃β > 0 ∀v ∈ V , ‖B∗v‖U ′ ≥ β‖v‖V ; (5.15-40)

3. B satisfies∀g ∈ V ′ ∃u ∈ N⊥

U , Bu = g (5.15-41)

and∃β > 0 ∀u ∈ U , ‖Bu‖V ′ ≥ β‖u‖U . (5.15-42)

Proof: The fact that the first claim implies the second is the third claim oflemma 5.15-4.


Let us show that the second claim implies the third. Let w ∈ N⊥U , and define

fw ∈ U ′ by

∀u ∈ U , fw(u) = (u,w)U .

If u ∈ NU , then since w ∈ N⊥U we have that 0 = (u,w)U = fw(u). This

shows that fw ∈ N0U . Since the second claim is assumed to be true, there exists

vw ∈ V so that B∗vw = fw, and there exists β > 0 so that for all v ∈ V we have‖B∗v‖U ′ ≥ β‖v‖V . Thus

∀u ∈ U , b(u, vw) = (u,B∗vw) = fw(u) = (u,w)U .

It follows that

‖w‖U = supu∈U,u 6=0

|(u,w)U |‖u‖U

= supu∈U,u 6=0

|fw(u)|‖u‖U

= supu∈U,u 6=0

|b(u, vw)|‖u‖U

= supu∈U,u 6=0

|(u,B∗vw)|‖u‖U

= ‖B∗vw‖U ′ ≥ β‖vw‖V .

Now we can let u = w and see that fw(w) = (w,w)U = ‖w‖2U , so

supv∈V,v 6=0

b(w, v)‖v‖V

≥ b(w, vw)‖vw‖V

=‖w‖2

U

‖vw‖V=‖w‖U‖vw‖V

‖w‖U ≥ β‖w‖U .

The third claim now follows from the fourth claim in lemma 5.15-4.

Finally, we will show that the third claim implies the first. Since the third claimis assumed to be true, for any g ∈ V ′ there exists ug ∈ U⊥ so that Bug = g.Then for all v ∈ V , since V is a Hilbert space, the Riesz representation theoremimplies that

β‖v‖V = β supg∈V ′,g 6=0

g(v)‖g‖V ′

= β supu∈U⊥

(Bu, v)‖Bu‖V ′

= β supu∈U⊥

b(u, v)‖Bu‖V ′

≤ supu∈U⊥

b(u, v)‖u‖V ′

≤ supu∈U

b(u, v)‖u‖V ′

.

2

The next lemma discusses the well-posedness of the constrained minimization problem(5.15-30).


Lemma 5.15-6: [11, p. 127] Suppose that U and V are Hilbert spaces, and let (, ) besome other inner product defined for v ∈ V as its second argument. Let U ′ be the setof all continuous linear functionals on U , and let V ′ be the set of all continuous linearfunctionals on V . Let f ∈ U ′ and g ∈ V ′. Suppose that a : U × U → R is bilinear,symmetric and continuous (i.e., a satisfies (5.15-29)). Let b : U × V → R be bilinear andcontinuous (i.e., b satisfies (5.15-35)). Define the linear transformation B : U → V ′ by

∀v ∈ V , (Bu, v) = b(u, v)

and define the linear transformation B∗ : V → U ′ by

∀u ∈ U , (u,B∗v) = b(u, v) .

Define the sets

NU ≡ u ∈ U : ∀v ∈ V , b(u, v) = 0N0U ≡ f ∈ U ′ : u ∈ NU =⇒ f(u) = 0

N⊥U ≡ u ∈ U : w ∈ NU =⇒ (u,w)U = 0 .

Finally, suppose that∃α > 0 ∀u ∈ NU , a(u, u) ≥ α‖u‖2

U (5.15-43)

and b satisfies the inf-sup condition (5.15-38). Then for all f ∈ U ′ and g ∈ V ′ there existsufg ∈ U and vfg ∈ V so that

∀u ∈ U , a(ufg, u) + b(u, vfg)= f(u) (5.15-44a)∀v ∈ V , b(ufg, v) = g(v) (5.15-44b)

Furthermore,

‖ufg‖U ≤1α‖f‖U ′ +

Ca + α

αβ‖g‖V ′ (5.15-45a)

‖vfg‖V ≤Ca + α

αβ‖f‖U ′ +

Ca(1 + α)αβ2

‖g‖V ′ (5.15-45b)

Proof: Note that our assumptions verify the first statement of lemma 5.15-5.It follows that the third statement of that lemma is true: given g ∈ V ′, thereexists ug ∈ N⊥

U so that Bug = g. In other words,

∃ug ∈ N⊥U ∀v ∈ V , g(v) = (Bug, v) = b(ug, v)


That lemma also shows that

β‖ug‖U ≤ ‖Bug‖V ′ = ‖g‖V ′

By the Lax-Milgram theorem 4.4-15, there exists wfg ∈ NU so that

∀u ∈ NU , a(wfg, u) = f(u)− a(ug, u)

andα‖wfg‖U ≤ sup

u∈U,u 6=0

|f(u)− a(ug, u)|‖u‖U

≤ ‖f‖u′ + Ca‖ug‖U .

Next, define h ∈ U ′ by

∀u ∈ U , h(u) = f(u)− a(ug + wfg, u) .

Note that if u ∈ NU , then the definition of wfg implies that

h(u) = f(u)− a(ug + wfg, u) = 0 .

It follows that h ∈ N0U . The second statement of lemma 5.15-5 implies that

there exists vh ∈ V so that B∗vh = h and β‖vh‖V ≤ ‖B∗vh‖U ′ . In other words,

∀u ∈ U , b(u, vh) = f(u)− a(ug + wfg, u)

andβ‖vh‖V ≤ sup

u∈U,u 6=0

|h(u)|‖u‖U

≤ ‖f‖U ′ + Ca‖ug + wfg‖U .

In summary, we have shown that for any f ∈ U ′ and any g ∈ V ′ there existufg = ug + wfg ∈ N⊥

U +NU ⊂ U and vh ∈ V so that

∀u ∈ U , f(u) = a(ug + wfg, u) +b(u, vh)∀v ∈ V , g(v) = b(ug, v) = b(ug + wfg, v)

We have also shown that

‖ug + wfg‖U ≤1α‖f‖U ′ + (1 +

Caα

)‖ug‖U ≤1α‖f‖U ′ +

1β

(1 +Caα

)‖g‖V ′

β‖vH‖V ≤ ‖f‖U ′ + Ca(‖ug‖U + ‖wfg‖U )

≤ (1 +Caα

)‖f‖U ′ + Ca(1 +1α

)‖ug‖U

≤ (1 +Caα

)‖f‖U ′ +Caβ

(1 +1α

)‖g‖V ′

The continuous dependence on the data shows that the solution of the extremumequations is unique. 2


5.15.9 Galerkin Approximations for the Mixed Problem

Next, let us develop a Galerkin approximation to our constrained minimization problem.Let U and V be two Hilbert spaces, and let U ⊂ U and V ⊂ V be finite dimensional. Leta : U × U → R be bilinear, symmetric and continuous (i.e., satisfies (5.15-29)). Letb : U × V → R be bilinear and continuous (i.e., satisfies (5.15-35)). Suppose that f ∈ U ′

and g ∈ V ′ are continuous linear functionals. We want to find u ∈ U to solve

min12a(u, u)− f(u) (5.15-46a)

subject to b(u, v) = g(v) ∀v ∈ V (5.15-46b)

We define

NU ≡ u ∈ U : ∀v ∈ V , b(u, v) = 0

and note that the first-order condition for the minimum is

∀w ∈ NU , a(u, w)− f(w) = 0 . (5.15-47)

The second-order condition for a unique minimum is

∃αU > 0 ∀w ∈ NU , a(w, w) ≥ αU‖w‖2U . (5.15-48)

Note that it is not necessarily the case that NU ⊂ NU ; for the time being, we should viewthis condition separately from (5.15-33).

Similarly, the first-order conditions for an extremum (u, v) of L are

∀w ∈ U , 0 = a(u, w)− f(w) + b(w, v) (5.15-49a)∀ν ∈ V , 0 = b(u, ν)− g(ν) (5.15-49b)

Note that if (u, v) ∈ U × V satisfies (5.15-49), then u solves the minimization problem(5.15-46). The task before us is to show that the Lagrangian extremum problem (5.15-49)has a solution that depends continuously on the data, and to estimate the error in the


Galerkin approximation.

Lemma 5.15-7: [11, p. 130] Suppose that U and V are Hilbert spaces, and let U ⊂ U andV ⊂ V be finite dimensional. Let U ′ be the set of all continuous linear functionals on U , andlet V ′ be the set of all continuous linear functionals on V . Let f ∈ U ′ and g ∈ V ′. Supposethat a : U ×U → R is bilinear, symmetric and continuous (i.e., a satisfies (5.15-29)). Letb : U×V → R be bilinear and continuous (i.e., b satisfies (5.15-35)). Finally, suppose thatboth coercivity conditions on a, namely (5.15-33) and (5.15-48), are true. Furthermore,suppose that b satisfies the inf-sup condition (5.15-38), and the subspaces U and V satisfythe Galerkin inf-sup condition

∃βU ,V > 0 ∀v ∈ V , supu∈U ,u 6=0

b(u, v)‖u‖U

≥ βU ,V‖v‖V (5.15-50)

Then for all f ∈ U ′ and g ∈ V ′ there exists ufg ∈ U and vfg ∈ V so that

∀u ∈ U , a(ufg, u) + b(u, vfg)= f(u) (5.15-51a)∀v ∈ V , b(ufg, v) = g(v) (5.15-51b)

Furthermore, the solution depends continuously on the data

‖ufg‖U ≤1αU‖f‖U ′ +

Ca + αUαUβU ,V

‖g‖V ′ (5.15-52a)

‖vfg‖V ≤Ca + αUαUβU ,V

‖f‖U ′ +Ca(1 + αU )αUβ2

U ,V‖g‖V ′ (5.15-52b)

Finally, the error in the Galerkin approximation satisfies

‖u− ufg‖U ≤ (1 +CaαU

+Ca + αUαU

CbβU ,V

) infw∈U

‖u− w‖U +CbαU

infν∈V

‖v − ν‖V ′ (5.15-53a)

‖v − vfg‖V ≤Ca + αUαUβU ,V

CaCb + βU ,VβU ,V

infw∈U

‖u− w‖U ′ + (1 +Ca + αUαU

CbβU ,V

) infν∈V

‖v − ν‖V ′

(5.15-53b)

Proof: The existence of ufg ∈ U and vfg ∈ V follows from lemma 5.15-6, asdoes continuous dependence on the data. All that remains is the error estimate(5.15-53).

Let u ∈ U and v ∈ V be arbitrary. We can subtract (5.15-51) from (5.15-44) to


get

∀u ∈ U , a(u, ufg − ufg) + b(u, vfg − vfg)= 0∀v ∈ V , b(ufg − ufg, v) = 0

It follows that for all w ∈ U and all v ∈ V we have that

fw,v(u) ≡ a(u, ufg − w) + b(u, vfg − v) = a(u, ufg − w) + b(u, vfg − v)gw,v(v) ≡ b(ufg − w, v) = b(ufg − w, v)

are continuous linear functionals on U and V, respectively. Continuous depen-dence on the data in lemma 5.15-6, shows that

‖ufg − w‖U ≤1αU‖fw,v‖U ′ +

Ca + αUαUβU ,V

‖gw,v‖V ′

≤ CaαU‖ufg − w‖U +

CbαU‖vfg − v‖V +

Ca + αUαUβU ,V

Cb‖ufg − w‖U

‖vfg − ν‖V ≤Ca + αUαUβU ,V

‖fw,v‖U ′ +Ca(1 + αU )αUβ2

U ,V‖gw,v‖V ′

≤ Ca + αUαUβU ,V

Ca‖ufg − w‖U +Ca + αUαUβU ,V

Cb‖vfg − v‖V +Ca + αUαUβ2

U ,V

CaαU‖ufg − w‖U

The triangle inequality now implies that

‖ufg − ufg‖U ≤ ‖ufg − w‖U + ‖w − ufg‖U

≤ (1 +CaαU

+Ca + αUαU

CbβU ,V

‖ufg − w‖U +CbαU‖‖vfg − v‖V

‖vfg − vfg‖V ≤ ‖vfg − ν‖V + ‖ν − vfg‖V

≤ Ca + αUαUβU ,V

Ca(1 +CbβU ,V

)‖‖ufg − w‖U + (1 + +Ca + αUαU

CbβU ,V

2

Corollary 5.15-8: [11, p. 131] Suppose that the hypotheses of lemma 5.15-7 are true.Let

NU ≡ u ∈ U : ∀v ∈ V , b(u, v) = 0NU ≡ u ∈ U : ∀v ∈ V , b(u, v) = 0

and suppose that NU ⊂ NU . Then

‖u− ufg‖U ≤ (1 +CaαU

) infw∈U

‖u− w‖U .


Proof: Suppose that w ∈ U satisfies

∀v ∈ V , b(w, v) = g(v) .

Since w ∈ NU ⊂ NU , it follows that for all w ∈ NU ,

a(ufg − w, w) = a(ufg, w)− a(u,w) + a(u− w, w)= [f(w)− b(w, vfg)]− [f(w)− b(w, v)] + a(u− w, w)= b(w, v − vfg) + a(u− w, w) = a(u− w, w) .

Since a is bounded,

∀w ∈ NU , a(ufg − w, w) ≤ Ca‖u− w‖U‖w‖U .

Since

∀v ∈ V , b(ufg − w, v) = g(v)− g(v) = 0 ,

it follows that ufg − w ∈ NU . Thus we can take w = ufg − w and obtain

αU‖ufg − w‖2U ≤ a(ufg − w, ufg − w) ≤ Ca‖u− w‖U‖ufg − w‖U .

We can cancel ‖ufg − w‖U on both sides to get

αU‖ufg − w‖U ≤ Ca‖u− w‖U .

The triangle inequality now implies that

∀w ∈ U , ‖u− ufg‖U ≤ ‖u− w‖U + ‖w − ufg‖U ≤ ‖u− w‖U +CaαU‖u− w‖U

= (1 +CaαU

)‖u− w‖U .

The claim follows by taking the infimum over w. 2

The next lemma shows how we can bound the error in approximating the true solution,which satisfies a given constraint, by a function that satisfies the Galerkin form of theconstraint.


Lemma 5.15-9: [11, p. 130] Suppose that U and V are Hilbert spaces, and let U ⊂ Uand V ⊂ V be finite dimensional. Let U ′ be the set of all continuous linear functionals onU , and let V ′ be the set of all continuous linear functionals on V . Let b : U × V → Rbe bilinear and continuous (i.e., b satisfies (5.15-35)). Furthermore, suppose that thesubspaces U and V satisfy the Galerkin inf-sup condition (5.15-50). Let g ∈ V ′, and let

Ng = u ∈ U : ∀v ∈ V , b(u, v) = g(v)Ng = u ∈ U : ∀v ∈ V , b(u, v) = g(v)

Then∀u ∈ Ng , inf

u∈Ng

‖u− u‖U ≤ 2(1 +CbβU ,V

) infu∈U

‖u− u‖U

Proof: Given u ∈ Ng and u ∈ U , consider the saddle-point problem

∀w ∈ U , (u− u,w)U + b(w, v) = (u− u,w)U ≡ fu−u(w)∀ν ∈ V , b(u− u, ν) = g(ν)− b(u, ν) = b(u− u, ν) ≡ gu−u(ν)

This is a Galerkin saddle-point problem; lemma 5.15-6 shows that the solutiondepends continuously on the data:

‖u− u‖U ≤ ‖fu−u‖U ′ +2

βU ,V‖gu−u‖V ′ ≤ ‖u− u‖U +

2CbβU ,V

‖u− u‖U

The triangle inequality implies that

‖u− u‖U ≤ ‖u− u‖U + ‖u− u‖U ≤ 2(1 +CbβU ,V

)‖u− u‖U

The claim follows by taking infima of both sides. 2

The next lemma provides a condition under which we can demonstrate that the finiteelement subspaces satisfy the inf-sup condition.


Lemma 5.15-10:(Fortin’s Criterion) [11, p. 131] Suppose that U and V are Hilbertspaces, and that U ⊂ U and V ⊂ V are finite dimensional subspaces. Assume thatb : U ×V → R is bilinear and satisfies the inf-sup condition (5.15-38). Suppose that thereis a projection P : U → U such that

∃cP > 0 ∀u ∈ U , ‖Pu‖U ≤ cP‖u‖U

and∀u ∈ U ∀v ∈ V , b(u− Pu, v) = 0 .

Then the subspaces satisfy the inf-sup condition

∀v ∈ V , supu∈U ,u 6=0

b(u, v)‖u‖U

≥ β

cP‖v‖V .

Proof: Since Pu = u for all u ∈ U ,

β‖v‖V ≤ supu∈U,u 6=0

b(u, v‖u‖U

= supu∈U,u 6=0

b(Pu, v)‖u‖U

≤ cP supu∈U,u 6=0

b(Pu, v)‖Pu‖U

= cP supu∈U ,u 6=0

b(u, v)‖u‖U

2

5.15.10 Raviart-Thomas Spaces

The ideas in this section are originally due to Raviart and Thomas [68]. We wouldlike to find basis functions for piecewise-linear functions on triangles in order to interpolatenormal derivatives at the midpoints of the triangle sides. Let xj , 1 ≤ j ≤ 3 be the verticesof a triangle T , and assume that the nodes are numbered in counter-clockwise orientationaround the triangle. Then the length of the side opposite vertex j is `j , where

`1 = ‖x2 − x3‖ , `2 = ‖x3 − x1‖ , `3 = ‖x1 − x2‖ ,

and the area of the triangle is

|T | = 12det[x1 − x3,x2 − x3] =

12det[x3 − x2,x1 − x2] =

12det[x2 − x1,x3 − x1]

The normals to the sides of T , oriented outward, times the lengths of the correspondingside, are

n1`1 = Q(x3 − x2) , n2`2 = Q(x1 − x3) , n3`3 = Q(x2 − x1) ,


where

Q ≡[

0 1−1 0

]is the rotation clockwise by ninety degrees.

Consider the function

Vj(x) = (x− xj)`j

2|T |

Then the normal value of V1 at the midpoint of the side opposite x1 is

n>1 V1(x2 + x3

2) =

14|T |

(x3 − x2)>Q>(x2 − x1) + (x3 − x1)

=1

4|T |det[x2 − x1,x3 − x2] + det[x3 − x1,x3 − x2]

=1

4|T |det[x3 − x2,x1 − x2] + det[x1 − x3,x2 − x3] =

14|T |

2|T |+ 2|T | = 1

Furthermore, the normal values of V1 at the midpoints of the other two sides are

n>2 V1(x3 + x1

2) = n>2 (x3 − x1)

`14|T |

= 0

and

n>3 V1(x1 + x2

2) = n>3 (x2 − x1)

`14|T |

= 0 .

Similar calculations apply to V2(x) and V3(x) by re-ordering the indices while preservingthe counter-clockwise orientation of the vertices.

In a 3D tetrahedron, the outer normal to the side opposite vertex x1 times the area ofthe side is

n1A1 = −12(x2 − x4)× (x3 − x4) = −1

2(x4 − x3)× (x2 − x3) = −1

2(x3 − x2)× (x4 − x2)

and the volume of the tetrahedron is

|T | = 16(x1−x4)·(x2−x4)×(x3−x4) =

16(x4−x3)·(x1−x3)×(x2−x3) =

16(x3−x2)·(x4−x2)×(x1−x2)

We assume that the vertices have been ordered so that this volume is positive. Then thefunction

Vj(x) = (x− xj)Aj3|T |


is such that

n1 · V1([x2 + x3 + x4]13− x− 1) = − 1

18|T |(x3 − x2)× (x4 − x2) · (x2 − x1) + (x4 − x3)× (x2 − x3) · (x3 − x1) + (x2 − x4)× (x3 − x4) · (x4 − x1)

=1

18|T |det[x1 − x2,x3 − x2,x4 − x2] + det[x1 − x3,x4 − x3,x2 − x3]

+ det[x1 − x4,x2 − x4,x3 − x4]

=1

18|T |6|T |+ 6|T |+ 6|T | = 1

Similar calculations apply to V2(x), v3(x) and V4(x) by re-ordering the indices while pre-serving the sign of the volume.

5.15.11 Lowest Order Mixed Finite Element Method

We will develop the lowest-order mixed finite element method for incompressible flowin porous media (see section 5.15.6.2 above.) The lowest-order mixed finite element ap-proximates the pressure by piecewise-constants. The velocity approximations are such thattheir divergence is piecewise-constant. In one dimension, this means that M consists of allpiecewise linear functions. In multiple dimensions, the ith component of a velocity in M islinear in the ith coordinate direction, and constant in the other directions.

We will develop the details of the mixed finite element for these spaces in one dimension.The basis functions for pressure are

Pi(x) =

1, xi− 1

2< x < xi+ 1

2

0, otherwise

and the basis functions for velocity are

Vi+ 12(x) =

x−x

i− 12

4xi, xi− 1

2< x < xi+ 1

2x

i+12−x

4xi+1, xi+ 1

2< x < xi+ 3

2

0, otherwise

Our finite element equations are∫ΩPi∇x · V dx =

∫ xi+1

2

xi− 1

2

∇x · [Vi− 12(x)vi− 1

2+ Vi+ 1

2(x)vi+ 1

2] dx = vi+ 1

2− vi− 1

2

=∫

ΩPiω dx =

∫ xi+1

2

xi− 1

2

ω(x) dx = ωi4xi


and ∫ΩVi+ 1

2

µ

KV dx =

∫ xi+1

2

xi− 1

2

Vi+ 12

µ

K[Vi− 1

2(x)vi− 1

2+ Vi+ 1

2(x)vi+ 1

2] dx

+∫ x

i+32

xi+1

2

Vi+ 12

µ

K[Vi+ 1

2(x)vi+ 1

2+ Vi+ 3

2(x)vi+ 3

2] dx

= (µ

K)i[

16vi− 1

2+

13vi+ 1

2]

14xi

+ (µ

K)i+1[

13vi+ 1

2+

16vi+ 3

2]

14xi+1

=∫

Ω∇x · Vi+ 1

2P + Vi+ 1

2· gρ dx

=∫ x

i+12

xi− 1

2

∇x · Vi+ 12pi + Vi+ 1

2· gρ dx +

∫ xi+3

2

xi+1

2

∇x · Vi+ 12pi+1 + Vi+ 1

2· gρ dx

= pi − pi+1 +12[(gρ4x)i + (gρ4x)i+1]

In the second finite element equation, we assume that the coefficients K, µ, g and ρ arepiecewise constant, for simplicity.

These finite element equations can be written in the form of a linear system[M GG> 0

] [vp

]=[gw

]where

M =

. . . . . .

16( µK4x)i 1

3 [( µK4x)i + ( µ

K4x)i+1] 16( µK4x)i+1

. . . . . .

is tridiagonal,

G =

. . .−1 1

. . .

is a bi-diagonal representing a discrete gradient,

g =

...

12 [(gρ4x)i + (gρ4x)i+1]

...


represents the gravity terms and

w =

...

ωi4xi...

represents the source terms due to wells. We have left the details of the boundary conditionsto the reader.

Note that the linear system is symmetric, but not positive-definite. One approach tosolving this system is to use a Schur decomposition. We use the first block equation toeliminate the velocities; this gives us

G>M−1Gp = G>M−1g − w

to solve for the pressures p. After we find the cell-centered pressures, we compute theside-centered velocities by

v = M−1(g −Gp) .

The difficulty with this approach is that M is tridiagonal, so M−1 is a full matrix, andG>M−1G is also a full matrix. An alternative approach for solving the linear system is touse a penalty method; see [16] for more details. Another approach is to use a hybrid mixedfinite element.

5.15.12 Hybrid Mixed Finite Element Method

In contrast, the hybrid mixed method will use continuity of fluid flux to connect theequations on different grid scales. The hybrid mixed finite element method uses basis func-tions that treat the normal velocity as discontinuous at the cell sides. These unknownsare combined with Lagrange multipliers for a continuity constraint to decouple the mixedmethod equations between cells, and produce a symmetric positive system of linear equa-tions for the Lagrange multipliers. In this application, the Lagrange multipliers can beidentified with fluid pressures at the cell sides.

5.15.13 Mathematical Formulation of the Hybrid Mixed Finite ElementMethod

The weak formulation of the hybrid mixed method equations [5, 16, 17, 27, 70] is similarto, but less commonly used for porous flow than the mixed method [4, 5, 12, 16, 17, 21, 22,37, 39, 40, 48, 49, 70]. Let the problem domain Ω = ∪i,jΩi,j be a union of rectangles, and letE be the set of sides of these rectangles interior to Ω. We want to find v ∈

⊕Ωi,j

H(div,Ωi,j),


p ∈ H0(Ωi,j) and λ ∈ H0(E) so that∫Ω

u ·T−1v 4x−∑Ωi,j

(∫Ωi,j

(∇x · u)p 4x−∫∂Ωi,j

n · uλ ds

)=∫

Ωu · gρ 4x−

∫∂Ω

(n · u)p ds

∀u ∈⊕Ωi,j

H(div,Ωi,j),

(5.15-54a)∑Ωi,j

∫Ωi,j

q ∇x · v 4x =∫

Ωqω 4x ∀ q ∈ H0(Ω),

(5.15-54b)∑Ωi,j

∫∂Ωi,j

µn · v ds = 0 ∀µ ∈ H0(E).

(5.15-54c)

The first equation (5.15-54a) is Darcy’s law with possibly discontinuous velocity and pres-sure on the subdomains, the second equation (5.15-54b) is the divergence-free conditionon the velocity in the sub-domains, and the third equation (5.15-54c) requires the normalcomponent of velocity to be continuous across sides of sub-domains. The purpose of theLagrange multipliers is to enforce the continuity of the normal component of velocity. Itis easy to see that the Lagrange multipliers satisfy λ = p|E for strong solutions p of thepressure equation.

In order to construct an approximate solution of problem (5.15-54) we will choose finitedimensional subspaces Vh ⊂

⊕Ωi,j

H(div,Ωi,j), Ph ⊂ H0(Ω) and Λh ⊂ H0(E). We want to

find vh ∈ Vh, ph ∈ Ph and λh ∈ Λh such that∫Ω

u ·T−1v 4x−∑Ωi,j

(∫Ωi,j

(∇x · u)p 4x−∫∂Ωi,j

n · uλ ds

)=∫

Ωu · gρ4x−

∫∂Ω

n · u p ds

∀u ∈ Vh , (5.15-55a)∑Ωi,j

∫Ωi,j

q ∇x · v 4x =∫

Ωqω 4x ∀ q ∈ Ph ,

(5.15-55b)∑Ωi,j

∫∂Ωi,j

µh n · v ds = 0 ∀µh ∈ Λh .

(5.15-55c)

The finite dimensional spaces Vh, Ph and Λh consist of piecewise polynomial approximationschosen to provide good approximations to the solution of the differential equation. We have


chosen Ph to consist of functions that are piecewise constant on grid cells, Vh to consist ofcell-wise discontinuous functions with ith component linear in the ith coordinate directionand all other components constant, and Λh to consist of functions that are piecewise con-stant on cell sides. Because the basis functions in Vh and Ph are discontinuous, equations(5.15-55a) and (5.15-55b), representing Darcy’s law and the divergence-free condition onthe velocity, decouple cell by cell; see the finite difference form of these equations below in(5.15-58). The remaining equation (5.15-55c) represents conservation of volume flux at cellsides.

5.15.14 Positive-Definiteness of the Linear System

By examining the equations (5.15-55), we can see that the system of equations over theentire grid has the form T B C

B> 0 0C> 0 0

vpλ

=

gw0

. (5.15-56)

Here C represents the equations that enforce continuity of the normal components of the

Darcy velocity; together [B C] forms a discrete gradient and −[B>

C>

]forms a discrete

divergence.We will reformulate equations (5.15-55) in finite difference form in sections 5.15.16 below.

That discussion will show that the equations[T BB> 0

] [vp

]=[g −Cλ

w

],

decouple for each cell. After solving these equations, we obtain the symmetric system

[C> 0

] [ T BB> 0

]−1 [C0

]λ = −

[C> 0

] [ T BB> 0

]−1 [gw

](5.15-57)

for the Lagrange multipliers. Since T is positive definite, we could factor T = LL> and seethat

[C> 0

] [ T BB> 0

]−1 [C0

]= −C>L−>[I − L−1B(B>L−>L−1B)−1B>L−>]L−1C .

The quantity inside the brackets is the orthogonal projection onto the nullspace of B>L−>.Thus the matrix in the linear system (5.15-57) is nonnegative. Since the hybrid mixedsystem of equations has a unique solution, it follows that the linear system for the Lagrangemultipliers is positive-definite.


5.15.15 Convergence Estimates for the Hybrid Mixed Finite ElementMethod

It is known [5] that if the hybrid mixed method and the mixed method use the samefinite element spaces Vh and Ph, then the solutions of the mixed method and the hybridmixed method are identical. As a result, we can refer to the relatively larger body ofliterature on convergence estimates for mixed methods.

With our choices of finite element spaces for Vh, Ph and Λh, and assuming Lipschitzcontinuity of the transmissibilities, it is known that the velocities, pressures and Lagrangemultipliers produced by the hybrid mixed finite element method will be first-order accurate.With random permeabilities (and therefore random transmissibilities), the convergence es-timates are unknown (and would no doubt indicate less accuracy). For details regardingthe theory of mixed methods, see [5, 16, 17, 70].

5.15.16 Numerical Implementation of the Hybrid Mixed Finite ElementMethod

In this section, we will reinterpret the hybrid mixed finite element equations in finitedifference form. We will also describe how the discrete equations representing Darcy’s lawand the divergence-free condition on the velocity decouple cell by cell. We will describe thealgorithm in 3D; the 1D and 2D forms should be obvious.

Assuming piecewise constant transmissibilities, gravity and density, we will use exactintegration within each cell. In each grid cell Ωi,j,k we will write

S =

4x1 0 00 4x2 00 0 4x3

T11 T12 T13

T21 T22 T23

T31 T32 T33

−1 4x1 0 00 4x2 00 0 4x3

14x14x24x3

and γ1

γ2

γ3

=

4x1g1

4x2g2

4x3g3

ρ2.

Here T = K/µ is the transmissibility (permeability divided by viscosity), g is the acceler-ation due to gravity and ρ is the fluid density. Within each cell we can write Darcy’s law


(5.15-55a) and the divergence-free condition (5.15-55b) on the velocity field in the form

13S11

16S11

14S12

14S12

14S13

14S13 1

16S11

13S11

14S12

14S12

14S13

14S13 −1

14S21

14S21

13S22

16S22

14S23

14S23 1

14S21

14S21

16S22

13S22

14S23

14S23 −1

14S31

14S31

14S32

14S32

13S33

16S33 1

14S31

14S31

16S32

13S32

16S33

13S33 −1

1 −1 1 −1 1 −1 −ω

i,j,k

v(R)

i− 12,j,k

4x2,j4x3,k

v(L)

i+ 12,j,k

4x2,j4x3,k

v(R)

i,j− 12,k4x1,i4x3,k

v(L)

i,j+ 12,k4x1,i4x3,k

v(R)

i,j,k− 12

4x1,i4x2,j

v(L)

i,j,k+ 12

4x1,i4x2,j

pi,j,k

=

γ1,i,j,k + pi− 12,j,k

γ1,i,j,k − pi+ 12,j,k

γ2,i,j,k + pi,j− 12,k

γ2,i,j,k − pi,j+ 12,k

γ3,i,j,k + pi,j,k− 12

γ3,i,j,k − pi,j,k+ 12

−(ωpw)i,j,k

(5.15-58a)

in a grid cell containing a pressure-specified well, or

13S11

16S11

14S12

14S12

14S13

14S13 1

16S11

13S11

14S12

14S12

14S13

14S13 −1

14S21

14S21

13S22

16S22

14S23

14S23 1

14S21

14S21

16S22

13S22

14S23

14S23 −1

14S31

14S31

14S32

14S32

13S33

16S33 1

14S31

14S31

16S32

13S32

16S33

13S33 −1

1 −1 1 −1 1 −1 0

v(R)

i− 12,j,k

4x2,j4x3,k

v(L)

i+ 12,j,k

4x2,j4x3,k

v(R)

i,j− 12,k4x1,i4x3,k

v(L)

i,j+ 12,k4x1,i4x3,k

v(R)

i,j,k− 12

4x1,i4x2,j

v(L)

i,j,k+ 12

4x1,i4x2,j

pi,j,k

=

γ1,i,j,k + pi− 12,j,k

γ1,i,j,k − pi+ 12,j,k

γ2,i,j,k + pi,j− 12,k

γ2,i,j,k − pi,j+ 12,k

γ3,i,j,k + pi,j,k− 12

γ3,i,j,k − pi,j,k+ 12

−4x1,i4x2,j4x3,kwi,j,k

(5.15-58b)

in a grid cell containing a rate-specified well. Here v(L/R)

i+ 12,j,k

, v(L/R)

i,j+ 12

and v(L/R)

i,j,k+ 12

are the

normal components of the Darcy velocity at the cell sides, pi,j,k is the cell pressure, pw,kis the well pressure, wk is the well rate, and ωi,j,k is the well productivity index. (Theproductivity index was described in section ??.) The side pressures pi+ 1

2,j,k, pi,j+ 1

2,k and

pi,j,k+ 12

represent the values of the Lagrange multipliers. In addition, we have the volumeflux continuity conditions (5.15-55c)

−v(L)

i+ 12,j,k

+ v(R)

i+ 12,j,k

= 0 , (5.15-59a)

−v(L)

i,j+ 12,k

+ v(R)

i,j+ 12,k

= 0 , (5.15-59b)

−v(L)

i,j,k+ 12

+ v(R)

i,j,k+ 12

= 0 . (5.15-59c)

In each grid cell, equation (5.15-58) gives us a 7 × 7 linear system relating the volumefluxes and cell pressure to the side pressures. (The system is 3× 3 in 1D, and 5× 5 in 2D.)


We invert this system to obtain

v(R)

i− 12,j,k

4x2j4x3,k

v(L)

i+ 12,j,k

4x2j4x3,k

v(R)

i,j− 12,k4x1i4x3,k

v(L)

i,j+ 12,k4x1i4x3,k

v(R)

i,j,k− 12

4x1i4x2,j

v(L)

i,j,k+ 12

4x1i4x2,j

pi,j,k

=

M1 M2 M4 M7 M11 M16 m1

M2 M3 M5 M8 M12 M17 m2

M4 M5 M6 M9 M13 M18 m3

M7 M8 M9 M10 M14 M19 m4

M11 M12 M13 M14 M15 M20 m5

M16 M17 M18 M19 M20 M21 m6

m1 m2 m3 m4 m5 m6 −µ

(γ1)i,j,k + pi− 12,j,k

(γ1)i,j,k − pi+ 12,j,k

(γ2)i,j,k + pi,j− 12,k

(γ2)i,j,k − pi,j+ 12,k

(γ3)i,j,k + pi,j,k− 12

(γ3)i,j,k − pi,j,k+ 12

−ζ

.

(5.15-60)Afterward, we can use the volume flux continuity conditions (5.15-59) to obtain a blocktridiagonal system of linear equations for the side pressures.

After solving the pressure equation, we use the inverted system (5.15-60) to obtainvalues v(L/R)

i+ 12,j,k

4x2,j4x3,k, v(L/R)

i,j+ 12,k4x1,i4x3,k and v(L/R)

i,j,k+ 12

4x1,i4x2,j for the volume fluxes at

the cell sides. If the residual in the linear system (5.15-59) is small, then the pressure willhave been determined so that the left and right values of these quantities at each cell sideare very nearly equal. Then we compute the time integral of the volume flux by averaging:

Vn+ 1

2

i+ 12,j,k

=4tn+ 1

2

2(v(L)

i+ 12,j,k

4x2,j4x3,k + v(R)

i+ 12,j,k

4x2,j4x3,k) ; (5.15-61)

Vn+ 1

2

i,j+ 12,k

and Vn+ 1

2

i,j,k+ 12

are computed similarly.

5.15.17 Comments on the Hybrid Mixed Finite Element Method

In 1D, the result of substituting the volume fluxes as functions of the side pressures,analogous to (5.15-58), into the equation for continuity of the volume fluxes, analogous to(5.15-59), is a tridiagonal system of linear equations:

0 = −v(L)

i+ 12

+ v(R)

i+ 12

=[−(M2)i , (M3)i + (M1)i+1 , −(M2)i+1

] pi− 12

pi+ 12

pi+ 32

− (m2ζ)i + (m1ζ)i+1 − [(M2 +M3)γ]i + [(M1 +M2)γ]i+1 . (5.15-62)


In 2D, the resulting linear system

0 = −v(L)

i+ 12,j

+ v(R)

i+ 12,j

=[−(M2)i,j , (M3)i,j + (M1)i+1,j , −(M2)i+1,j

] pi− 12,j

pi+ 12,j

pi+ 32,j

(5.15-63a)

+[−(M5)i,j , (M4)i+1,j

] [ pi,j− 12

pi+1,j− 12

]+[(M8)i,j , −(M7)i+1,j

] [ pi,j+ 12

pi+1,j− 12

]− (m2ζ)i,j + (m1ζ)i+1,j − [(M2 +M3)γ1 + (M5 +M8)γ2]i,j + [(M1 +M2)γ1 + (M4 +M7)γ2]i+1,j .

and

0 = −v(L)

i,j+ 12

+ v(R)

i,j+ 12

=[−(M9)i,j , (M10)i,j + (M6)i,j+1 , −(M9)i,j+1

] pi,j− 12

pi,j+ 12

pi,j+ 32

(5.15-63b)

+[−M7 , M8

]i,j

[pi− 1

2,j

pi+ 12,j

]+[M4 , −M5

]i,j+1

[pi− 1

2,j+1

pi+ 12,j+1

]− (m4ζ)i,j + (m3ζ)i,j+1 − [(M7 +M8)γ1 + (M9 +M10)γ2]i,j + [(M4 +M5)γ1 + (M6 +M9)γ2]i,j+1 .

can be arranged to be block-tridiagonal with blocks roughly of size 2n1, where n1 is thenumber of cells in the first coordinate direction. The block structure of the 3D system isobviously more complicated.

Both the mixed method and the hybrid mixed method involve harmonic averaging ofthe permeability. In particular, if the permeability in some cell is zero, then the hybridmixed method will produce zero normal velocities associated with the interior sides of thatcell, and continuity of the normal velocity will require that there is no flow across thesesides. This feature can be very useful in practical situations, but requires some numericalcare to avoid division by zero.

One drawback of the hybrid mixed method is that the linear system involves an averageof d unknowns per cell, where d is the number of spatial dimensions. Thus this system islarger than the usual system for cell-centered pressure in standard petroleum simulation,such as block-centered finite differences [65]. Because the pressure unknowns are associatedwith cell sides, the stencil is non-standard; in particular, we cannot use readily availableincomplete factorizations for the linear system. Further, because the piecewise polynomialspaces for the Lagrange multipliers are not nested between levels of refinement, we cannotuse much of the available literature for developing multigrid iterative techniques for thispressure equation.


We also note that the hybrid mixed method has some resemblance to finite volumemethods [36, 92] that have been suggested for use in flow in porous media.

Chapter 6

Finite Element Methods forParabolic Equations

6.1 Well-Posedness of Parabolic Problems

Let us consider the following problem for the unknown function u(x, t):

∂u

∂t− ∇x ·A∇xu+ b · ∇xu+ γu = f, ∀x ∈ Ω ∀0 < t ≤ T (6.1-1a)

n ·A∇xu = g1 ∀x ∈ S ⊂ ∂Ω ∀0 < t ≤ T (6.1-1b)u = g0 ∀x ∈ ∂Ω \ S ∀0 < t ≤ T (6.1-1c)u = u0 ∀x ∈ Ω, t = 0 . (6.1-1d)

We will letW 1e,(Ω) represent the completion, in the usual Sobolev norm ‖·‖1, of C∞ functions

φ(x) with φ(x) = 0 on ∂Ω− S. Then for any w(x) ∈W 1e,(Ω),∫

Ωw∂u

∂t− w∇x ·A∇xu+ wb · ∇xu+ wγu dx

=∫

Ωw∂u

∂tdx+

∫Ω∇xw ·A∇xu+ wb · ∇xu+ wγu dx−

∫∂Ωwn ·A∇xu ds

=∫

Ωw∂u

∂tdx+

∫Ω∇xw ·A∇xu+ wb · ∇xu+ wγu dx−

∫Swg1 ds

=∫

Ωwf dx .

This suggests that we define the bilinear form

B(w, u) ≡∫

Ω∇xw ·A∇xu+ wb · ∇xu+ wγu dx (6.1-2)

495

496 CHAPTER 6. FINITE ELEMENT METHODS FOR PARABOLIC EQUATIONS

and the linear functional

λ(w) ≡∫

Ωwf dx+

∫Swg1 ds (6.1-3)

We assume that we can find uess(x, t) ∈W 1(Ω) for all 0 < t ≤ T so that uess(x, t) = g0(x, t)for all x ∈ Ω \ S and all 0 < t ≤ T . The weak form of our problem is to find u(x, t) so thatu− uess ∈W 1

e,(Ω) for all 0 < t ≤ T ,

(w,∂u

∂t) + B(w, u) = (w, f) + λ(w) ∀w ∈W 1

e,(Ω) , (6.1-4a)

(w, u) = (w, u0) ∀w ∈W 1e,(Ω) t = 0 . (6.1-4b)

For simplicity, we will assume that g0 = 0, and therefore uess = 0, in the discussion tofollow.

We assume that A(x, t) is symmetric positive definite. We also assume that the coeffi-cients A(x, t), b(x, t) and γ(x, t) are uniformly bounded in x and t, so that

∃c > 0 ∀0 < t ≤ T ∀u,w ∈W 1(Ω)

| B(w, u) |≡|∫

Ω∇xw ·A∇xu+ wb · ∇xu+ wγu dx |≤ c‖w‖1‖u‖1 . (6.1-5)

Further, we assume that Garding’s inequality (4.5-1) is satisfied:

∃c1 > 0 ∃c0 ∈ R ∀w ∈W 1e,(Ω) B(w,w) ≥ c1‖w‖2

1 − c0‖w‖20 . (6.1-6)

6.1.1 Existence and Uniqueness of Generalized Solutions

In the special case where S = ∅, so that there are no Neumann boundary conditions,we have the following existence result.

6.1. WELL-POSEDNESS OF PARABOLIC PROBLEMS 497

Theorem 6.1-1:([42], Theorem II.9.1) Suppose that

1. If we rewrite− ∇x ·A∇xu+ b · ∇xu+ γu =

∑|α|≤2

dαDαu , (6.1-7)

then the coefficients dα are continuous in Ω× [0, T ] and for all α there exist constantsc > 0 and 0 < β ≤ 1 such that for all x ∈ Ω and for all 0 ≤ t, t′ ≤ 1,

| dα(x, t)− dα(x, t′) |≤ c | t− t′ |β . (6.1-8)

2. ∂Ω is of class C2.

3. f is continuous, and there exists c > 0 and 0 < β ≤ 1 such that for all 0 ≤ t ≤ T ,

‖f(·, t)− f(·, t′)‖0 ≤ c | t− t′ |β . (6.1-9)

Then our problem has a unique generalized solution, meaning that

1. ∂u∂t exists and is continuous for x ∈ Ω and 0 < t ≤ T ,

2. u is continuous for x ∈ Ω and 0 < t ≤ T , and

3. u solves the weak form of the problem in the following sense:

(w,∂u

∂t) + B(w, u) = (w, f) ∀w ∈W 1

0,(Ω)∀0 < t ≤ T (6.1-10)

(w, u(·, 0)− u0) = 0 ∀w ∈W0,.(Ω) (6.1-11)

Theorem 6.1-2:([42], Theorem II.9.2) Suppose that the hypotheses of the previoustheorem hold. In addition, assume that

1. f is k continuous derivatives in t for all 0 ≤ t ≤ T ;

2. there exists c > 0 and 0 < β ≤ 1 such that for all 0 ≤ t, t′ ≤ 1,

‖∂kdα∂tk

(x, t)− ∂kdα∂tk

(x, t′)‖0 ≤ c | t− t′ |β . (6.1-12)

Then for all ε > 0, u has k + 1 t-derivatives in each interval ε < t ≤ T .


6.1.2 Continuous Dependence on the Data

For more details on the results of this section, see [34].We could study the dependence of the solution of our parabolic equation on its data

through the fundamental solution of the partial differential equation. Instead, we will useenergy inequalities to described this dependence.

We will need the following basic result.

Lemma 6.1-3:(Gronwall’s Inequality) Suppose that θ(t) satisfies the differential inequality

dθ

dt≤ αθ + β(t) ,

where α is constant. Then

θ(t) ≤ eαt[θ(0) +

∫ t

0e−αsβ(s) ds

]. (6.1-13)

Proof: If we multiply the inequality for θ by e−αt, we get

d

dt(e−αtθ) ≤ e−αtβ(t) . (6.1-14)

Integrating in t, we obtain

e−αtθ(t)− θ(0) ≤∫ t

0e−αsβ(s) ds . (6.1-15)

If we solve this inequality for θ(t), we obtain the desired result. 2

Next, we will estimate the spatial derivatives of the solution. Let w(x) = u(x, t) foreach 0 < t ≤ T . Then our weak equation (6.1-4a) implies that

12d

dt‖u‖2

0 + B(u, u) = λ(u) .

It follows from Garding’s inequality (6.1-6) that

12d

dt‖u‖2

0 + c1‖u‖21 = −B(u, u) + c1‖u‖2

1 + λ(u) ≤ c0‖u‖20 + ‖λ‖−1‖u‖1 .

It is useful to recall that

‖λ‖−1 ≡ supw∈W 1(Ω)

| λ(w) |‖w‖1

= supw∈W 1(Ω)

|∫Ωwf dx+

∫S wg1 ds |

‖w‖1≤ ‖f‖−1 + ‖g1‖− 1

2,S .

6.1. WELL-POSEDNESS OF PARABOLIC PROBLEMS 499

Since ab ≤ 12(a2/ε2 + b2ε2) for all a, b and ε, we take ε2 = c1 and find that

d

dt‖u‖2

0 ≤ 2c0‖u‖20 +

1c1‖λ‖2

−1 − c1‖u‖21 .

Then Gronwall’s inequality (6.1-13) implies that

‖u(·, t)‖20 ≤ e2c0t

[‖u0‖2

0 +∫ t

0e−2c0s(

1c1‖λ‖2

−1 − c1‖u‖21)ds

]It follows that for all 0 ≤ t ≤ T ,

‖u(·, t)‖20 + c1

∫ t

0‖u(·, s)‖2

1 ds ≤ ‖u(·, t)‖20 + c1

∫ t

0e2c0(t−s)‖u(·, s)‖2

1 ds

≤ e2c0t[‖u0‖2

0 +1c1

∫ t

0e−2c0s‖λ(·, s)‖2

−1ds

]Since

√a2 + b2 ≤| a | + | b | for all a and b, we have that

‖u‖L∞(H0) ≡ max0≤t≤T

‖u(·, t)‖0 ≤ ec0T

‖u0‖0 +1√c1

√∫ T

0‖λ(·, s)‖2

−1ds

= ec0T

[‖u0‖0 +

1√c1‖λ‖L2(H−1)

]and

‖u‖L2(H1) ≡

√∫ t

0‖u(·, s)‖2

1 ds ≤ec0T√c1

‖u0‖0 +1√c1

√∫ T

0‖λ(·, s)‖2

−1ds

=ec0T√c1

[‖u0‖0 +

1√c1‖λ‖L2(H−1)

]Our next task is to estimate the time derivative of the solution. Let w(x) ∈ W 1

e,(Ω).Since the weak form of our partial differential equation is

(w,∂u

∂t) = −B(w, u) + λ(w)

we have that

‖∂u∂t‖−1 = sup

w∈W 1e,(Ω),w 6=0

| (w, ∂u∂t ) |‖w‖1

≤ c‖u‖1 + ‖λ‖−1 .


If we square both sides, use the inequality (a + b)2 ≤ 2(a2 + b2) and integrate in time, weobtain ∫ t

0‖∂u∂t

(·, s)‖2−1 ds2c

2

∫ t

0‖u(·, s)‖2

1 ds+ 2∫ t

0‖λ(·, s)‖2

−1 ds

≤ 2c2

c12e2c0t‖u0‖2

0 + 2(c2

c21e2c0t + 1)

∫ t

0‖λ‖2

−1ds .

Taking the square root of both sides, and using the inequality√a2 + b2 ≤| a | + | b |, we

obtain

‖∂u∂t‖L2(H−1) ≤

√2c

√c1ec0T

[‖u0‖0 + sqrt2(

c2

c21e2c0t + 1)‖λ‖L2(H−1)

]. (6.1-16)

6.2 Galerkin Methods for Parabolic Problems

Once again, we consider the parabolic problem

∂u

∂t− ∇x ·A∇xu+ b · ∇xu+ γu = f, ∀x ∈ Ω ∀0 < t ≤ T (6.2-1a)

n ·A∇xu = g1 ∀x ∈ S ⊂ ∂Ω ∀0 < t ≤ T (6.2-1b)u = g0 ∀x ∈ ∂Ω \ S ∀0 < t ≤ T (6.2-1c)u = u0 ∀x ∈ Ω, t = 0 , (6.2-1d)

with the equivalent weak formulation

d

dt(w, u) + B(w, u) = λ(w) ∀w ∈W 1

e,(Ω) ∀0 < t ≤ T , (6.2-2a)

(w, u(·, 0)) = (w, u0) ∀w ∈W 0(Ω) (6.2-2b)

We remark that if u0 ∈W 1(Ω), we could require that the initial data satisfy

B(w, u(·, 0)) + c0(w, u(·, 0)) = B(w, u0) + c0(w, u0) (6.2-3)

instead of using the L2 product.

6.2.1 Continuous-Time Galerkin Method

Let M be a finite-dimensional subspace of W 1e,(Ω). For each 0 < t ≤ T we will define

U(x, t) ∈M by

(W,dU

dt) + B(W,U) = λ(W ) ∀W ∈M ∀0 < t ≤ T , (6.2-4)

6.2. GALERKIN METHODS FOR PARABOLIC PROBLEMS 501

with initial data given either by the L2 projection

(W,u0 − U(·, 0)) = 0 ∀W ∈M (6.2-5)

or by the elliptic projection

B(W,u0 − U(·, 0)) + c0(W,u0 − U(·, 0)) = 0 ∀W ∈M . (6.2-6)

Let us rewrite these equations using linear algebra. If Vjmj=1 is a basis for M, then forsome unknown coefficients υj(t) we have that

U(x, t) =m∑j=1

Vj(x)υj(t) . (6.2-7)

The parabolic partial differential equation for u leads to the system of ordinary differentialequations

Mdudt

+ Bu = f (6.2-8)

where

u = [υj ] (6.2-9)M = [(Vi, Vj)] (6.2-10)B = [B(Vi, Vj)] (6.2-11)

f = [λ(Vi) = (Vi, f) +∫SVig1 ds] (6.2-12)

For our initial conditions, we require either

Mu = [(Vi, u0)] (6.2-13)

or(B + Mc0)u = [B(Vi, u0) + c0(Vi, u0)] (6.2-14)

In either case, the Galerkin equations lead to an initial value problem for a system ofordinary differential equations in u.

6.2.2 Existence of the Continuous-Time Galerkin Approximation

It is easy to see that the mass matrix M is nonsingular, since it is the Gram matrix forthe L2 inner product of the basis functions for M. Thus

dudt

+ M−1Bu = M−1f


We can multiply by the integrating factor exp∫ t0 M−1B ds, integrate in t and solve for u to

get

u(t) = e−R t0 M−1B ds

[u(0) +

∫ t

0e

R s0 M−1B drM−1f ds

](6.2-15)

Time discretizations of (6.2-8) lead to various approximations to the matrix exponentialsin this equation.

6.2.3 Finite Element Approximations for Parabolic Problems

Suppose that we want to solve the partial differential equation

∂u

∂t− ∇x ·D∇xu = 0, ∀x ∈ Ω ∀0 < t ≤ T (6.2-16)

n ·D∇xu = 0, ∀x ∈ ∂Ω∀0 < t ≤ T (6.2-17)u = u0, ∀x ∈ Ω t = 0 . (6.2-18)

Suppose that Ω is a rectangle, and has been subdivided into a union of rectangles. Let Mbe the set of continuous functions that are piecewise linear in each coordinate direction. Wewill describe the Galerkin equations for this situation.

We will use isoparametric transformations to define the basis functions for the continuouspiecewise-linear functions in our finite element space M. In one dimension, we define thecanonical basis function

V ∗(ξ) ≡ max1− | ξ |, 0 .

We also define the piecewise linear coordinate mappings

∀0 ≤ j ≤ N ξj(x) ≡

(x− xj)/(xj+1 − xj), x ≥ xj(x− xj)/(xj − xj−1), x ≤ xj

We can ignore the unnecessary choices for j = 0 and j = N . Note that we can invert thisrelation to get a mapping from ξ ∈ (0, 1) to x ∈ (xj , xj+1):

xj+ 12(ξ) =

xj + ξ4xj+ 1

20 < ξ < 1

xj+1 − ξ4xj+ 12

−1 < ξ < 0.

Then our continuous piecewise linear basis functions are

Vj(x) ≡ V ∗(ξj(x)) ,∀0 ≤ j ≤ N .


The finite element equations lead to the linear system for 0 ≤ i ≤ N ,

0 =N−1∑k=0

N∑j=0

∫ xk+1

xk

Vi(x)Vj(x) dxdυjdt

+∫ xk+1

xk

∂Vi∂x

D∂Vj∂x

dx υj

=∫ xi

xi−1

Vi(x)[Vi−1dυi−1

dt+ Vi

dυidt

] +∂Vi∂x

D(x)[∂Vi−1

∂xυi−1 +

∂Vi∂x

υi] dx

+∫ xi+1

xi

Vi(x)[Vidυidt

+ Vi+1dυi+1

dt] +

∂Vi∂x

D[∂Vi∂x

υi +∂Vi+1

∂xυi+1] dx

=∫ 0

−1V ∗(ξ)[V ∗(ξ + 1)

dυi−1

dt+ V ∗(ξ)

dυidt

]4xi− 12

+D(xi− 1

2(ξ))

4xi− 12

[− υi−1

4xi− 12

+υi

4xi− 12

]4xi− 12dξ

+∫ 1

0V ∗(ξ)[V ∗(ξ)

dυidt

+ V ∗(ξ − 1)dυi+1

dt]4xi+ 1

2+D(xi+ 1

2(ξ))

4xi+ 12

[− υi4xi+ 1

2

+υi+1

4xi+ 12

]4xi+ 12dξ

= [16dυi−1

dt+

13dυidt

]4xi− 12

+ [13dυidt

+16dυi+1

dt]4xi+ 1

2+

Di− 12

4xi− 12

[−υi−1 + υi] +Di+ 1

2

4xi+ 12

[υi − υi+1]

Here we assumed that the diffusion coefficient D is piecewise constant. This gives us asystem of ordinary differential equations

Mdudt

+ Bu = 0 (6.2-19)

Note that in one dimension, the mass matrix M and the stiffness matrix B are both tridi-agonal.

We still need to specify the initial condition for our system of ordinary differentialequations. In general, we do not expect the initial value u0 to be continuous and piecewiselinear. One choice for our initial data is to use the L2 projection. We solve the linear system

(Vi, U(·, 0)) = (Vi, u0) ∀0 ≤ i ≤ N . (6.2-20)

Provided that u0 ∈ W−1(Ω), this can be written in terms of the canonical basis functions


as follows:

0 =N−1∑k=0

∫ xk+1

xk

Vi(x)[N∑j=0

Vj(x)υj − u0] dx

=∫ xi

xi−1

Vi(x)[Vi−1υi−1 + Viυi − u0] dx+∫ xi+1

xi

Vi(x)[Viυi + Vi+1υi+1 − u0] dx

=∫ 0

−1V ∗(ξ)[V ∗(ξ + 1)υi−1 + V ∗(ξ)υi − u0(xi− 1

2(ξ))]4xi− 1

2dξ

+∫ 1

0V ∗(ξ)[V ∗(ξ)υi + V ∗(ξ − 1)υi+1 − u0(xi+ 1

2(ξ))]4xi+ 1

2dξ

= [16υi−1 +

13υi]4xi− 1

2+ [

13υi +

16υi+1]4xi+ 1

2

−∫ 1

0ξu0(xi−1 + ξ4xi− 1

2) dξ4xi− 1

2−∫ 1

0(1− ξ)u0(xi + ξ4xi+ 1

2) dξ4xi− 1

2

This linear system can be written in the form

Mu(0) = u0 ,

using the mass matrix we determined previously.

Another choice for the initial data is to use the elliptic projection. In this case, we solvethe linear system

B(Vi, U(·, 0)) = B(Vi, u0) ∀0 ≤ i ≤ N .

Provided that u0 ∈W 1(Ω), this can be written in terms of the canonical basis functions as


follows:

0 =N−1∑k=0

∫ xk+1

xk

∂Vi(x)∂x

D(x)[N∑j=0

∂Vj(x)∂x

υj −du0

dx] dx

=∫ xi

xi−1

∂Vi(x)∂x

D(x)[∂Vi−1

∂xυi−1 +

∂Vi∂x

υi −du0

dx] dx

+∫ xi+1

xi

∂Vi(x)∂x

D(x)[∂Vi∂x

υi +∂Vi+1

∂xυi+1 −

du0

dx] dx

=∫ 0

−1

D(xi− 12(ξ))

4xi− 12

[− υi−1

4xi− 12

+υi

4xi− 12

−u′0(xi− 1

2(ξ))

4xi− 12

]4xi− 12dξ

−∫ 1

0

D(xi+ 12(ξ))

4xi+ 12

[− υi4xi+ 1

2

+υi+1

4xi+ 12

−u′0(xi+ 1

2(ξ))

4xi+ 12

]4xi+ 12dξ

=Di− 1

2

4xi− 12

[−υi−1 + υi]−Di+ 1

2

4xi+ 12

[υi − υi+1]

−Di− 1

2

4xi− 12

[u0(xi)− u0(xi−1)]−Di+ 1

2

4xi+ 12

[u0(xi+1)− u0(xi)]

Again, we have assumed that D is piecewise constant. This linear system can be written inthe form

Bu(0) = Bu0 , (6.2-21)

using the stiffness matrix we discovered previously. This system implies that, at least forthis choice of M and in one dimension and with piecewise constant diffusion coefficient andwith initial data in W 1(Ω) satisfying the essential boundary conditions, the initial valuesfor the υi can be taken to be the point values of the initial data.

In 2D, the nonzero contributions to the mass or stiffness matrices from element (xk, xk+1)×(y`, y`+1) involve only nodal basis functions Vk,`, Vk+1,`, Vk,`+1 and Vk+1,`+1. For piecewiseconstant diffusion coefficient D, the integrals can be performed exactly. This suggests that


we form the diffusive quadraturesq00

q10

q01

q11

k+ 1

2,`+ 1

2

(6.2-22)

≡∫ y`+1

y`

∫ xk+1

xk

Vk,`Vk+1,`

Vk,`+1

Vk+1,`+1

[Vk,` Vk+1,` Vk,`+1 Vk+1,`+1

]dx

dυk,`

dtdυk+1,`

dtdυk,`

dtdυk+1,`+1

dt

+∫ y`+1

y`

∫ xk+1

xk

(∇xVk,`)>

(∇xVk+1,`)>

(∇xVk,`+1)>

(∇xVk+1,`+1)>

D [∇xVk,` ∇xVk+1,` ∇xVk,`+1 ∇xVk+1,`+1

]dx dy

υk,`υk+1,`

υk,`+1

υk+1,`+1

=

4 2 2 12 4 2 22 2 4 21 2 2 4

dυk,`

dtdυk+1,`

dtdυk,`

dtdυk+1,`+1

dt

4xk+ 124y`+ 1

2

36

+

−1 −11 −1−1 11 1

[4xk+ 1

20

0 4y`+ 12

]−1 [Dxx Dxy

Dyx Dyy

][4xk+ 1

20

0 4y`+ 12

]−1 [−1 1 −1 1−1 −1 1 1

]

+

1−1−11

tr

[4xk+ 12

00 4y`+ 1

2

]−1 [Dxx Dxy

Dyx Dyy

][4xk+ 1

20

0 4y`+ 12

]−1

3[1 −1 −1 1

]

υk,`υk+1,`

υk,`+1

υk+1,`+1

4xk+ 124y`+ 1

2

4

≡ Mk+ 12,`+ 1

2

dυk,`

dtdυk+1,`

dtdυk,`

dtdυk+1,`+1

dt

+ Kk+ 12,`+ 1

2

4tn+ 12

2

υk,`υk+1,`

υk,`+1

υk+1,`+1

6.3. ERROR ESTIMATES FOR PARABOLIC GALERKIN METHODS 507

The finite element equations at internal nodes then take the form

(q00)k+ 12,`+ 1

2+ (q10)k− 1

2,`+ 1

2+ (q01)k+ 1

2,`− 1

2+ (q11)k− 1

2,`− 1

2= 0 . (6.2-23)

In general the sum, over elements within the computational domain, of the quadraturesassociated with a given node is zero. The global mass matrix M and stiffness matrix B areformed by combining the elementwise mass and stiffness matrices from equations (6.2-22)into the finite element equations (6.2-23).

6.3 Error Estimates for Parabolic Galerkin Methods

Let E(u) to be the elliptic projection of u:

B(W,u− E(u)) + c0(W,u− E(u)) = 0 ∀W ∈M ∀0 < t ≤ T . (6.3-1)

Note that for all 0 < t ≤ T , E(u) ∈ M; further, we can differentiate the equation for E(u)to see that E(u) is continuously differentiable in t. The theory of Galerkin methods forelliptic problems in section ?? implies that

∀0 ≤ γ ≤ 1 ∃Cγ > 0 ∀u ∈W ke,(Ω) ∀0 < h < hmin ‖u− E(u)‖1 ≤ Cγh

k−γ‖u‖k (6.3-2)

We can also differentiate the equation for the elliptic projection with respect to t and obtainestimates for time derivatives of u − E(u). Note that we will not necessarily compute theelliptic projection E(u) in our numerical method; we are using it to obtain convenient errorestimates.6.3.0.1 H0 Error Estimate

Let us assume that the bilinear form B is coercive; in other words, we can choose c0 = 0in Garding’s inequality. Further, we will assume that the smallest eigenvalue of the ellipticpart of the differential equation is positive:

infw∈W 1(Ω)

B(w,w)(w,w)

= λmin > 0 . (6.3-3)

Since the exact solution satisfies

d

dt(w, u) + B(w, u) = λ(w) ∀w ∈W 1

e,(Ω) ∀0 < t ≤ T , (6.3-4)

and the Galerkin approximation satisfies

d

dt(W,U) + B(W,U) = λ(W ) ∀W ∈M ∀0 < t ≤ T , (6.3-5)


we can easily see that

(W,d(U − E(u))

dt) + B(W,U − E(u)) = (W,

dU

dt) + B(W,U)− (W,

dE(u)dt

) + B(W,U)

= λ(W )− (W,dE(u)dt

) + B(W,u) = (W,d(u− E(u))

dt)

∀W ∈M ∀0 < t ≤ T .

In particular, for each 0 < t ≤ T we can choose W = U − E(u) and get

(U − E(u),d(U − E(u))

dt) + B(U − E(u), U − E(u)) = (U − E(u),

d(u− E(u))dt

≤ ‖U − E(u)‖0‖d(u− E(u))

dt‖0

Since ‖U − E(u)‖0 might not be differentiable when U − E(u) = 0, we note that for all ε

12d

dt

[‖U − E(u)‖2

0 + ε2]+ λmin

[‖U − E(u)‖2

0 + ε2]

≤ (U − E(u),∂(U − E(u))

∂t) + B(U − E(u), U − E(u)) + λminε

2

= (U − E(u),∂(U − E(u))

∂t) + λminε

2

≤ ‖U − E(u)‖0‖d(u− E(u))

dt‖0 + λminε

2

In other words,

d

dt

√‖U − E(u)‖2

0 + ε2 + λmin

√‖U − E(u)‖2

0 + ε2

≤ ‖U − E(u)‖0√‖U − E(u)‖2

0 + ε2‖d(u− E(u))

dt‖0 +

λminε2√

‖U − E(u)‖20 + ε2

≤‖d(u− E(u))dt

‖0 +λminε

2√‖U − E(u)‖2

0 + ε2

We can use Gronwall’s inequality with θ =√‖U − E(u)‖2

0 + ε2, α = −λmin and β equal tothe final right-hand side of the previous inequality to get√‖U − E(u)‖2

0 + ε2(t) ≤ e−λmint

[√‖U − E(u)‖2

0 + ε2(0)

+∫ t

0eλmins

‖d(u− E(u))

dt‖0 +

λminε2√

‖U − E(u)‖20 + ε2

ds

]


Now we can let ε→ 0 to get

‖(U − E(u))(·, t)‖20 ≤ e−λmint

[‖(U − E(u))(·, 0)‖2

0 +∫ t

0e−λmins‖∂(u− E(u))

∂t(·, s)‖2

0 ds

]It follows that

‖(U − u)(·, t)‖0 ≤ ‖(E(u)− u)(·, t)‖0 + ‖(U − E(u))(·, t)‖0

≤ ‖(E(u)− u)(·, t)‖0 + e−λmint‖(U − E(u))(·, 0)‖0

+∫ t

0e−λ(t−s)‖∂(u− E(u))

∂t‖0 ds

≤ ‖(E(u)− u)(·, t)‖0 + e−λmint [‖U(·, 0)− u0‖0 + ‖u0 − E(u0)‖0]

+∫ t

0e−λ(t−s)‖∂(u− E(u))

∂t‖0 ds

≤ C0hk‖u(·, t)‖k + e−λmint [‖U(·, 0)− u0‖0 + C0h

p‖u0‖p]

+ C0hk

∫ t

0e−λ(t−s)‖u(·, s)‖k ds

This proves convergence of the scheme, provided that the initial data can be approximatedwith small error. Note, however, that the influence of the error due to the initial datadecreases with time.6.3.0.2 Error Estimate for Backward Euler

The Galerkin equations for parabolic equations lead to systems of ordinary differentialequations in time. One approach for the numerical integration of these ordinary differentialequations is to use the backward Euler method. The weak form of this method takes theform

(W,Un − Un−1

4tn−12

) + Bn(W,Un) = λn(W ) ∀W ∈M .

In this equation, the superscript on B and λ indicates that if these terms depend explicitlyon time, then they are evaluated at tn. This weak form leads to the matrix-vector form

Mun − un−1

4tn−12

+ Bnun = fn

which is equivalent to the linear system

(M + Bn4tn−12 )un = Mun−1 + fn4tn−

12 .


The error analysis for the backward-Euler integration of the Galerkin equations is similarto the analysis of the continuous-in-time method. First, note that

(W,(Un − E(u(·, tn)))− (Un−1 − E(u(·, tn−1)))

4tn−12

) + B(W,Un − E(u(·, tn)))

= λn(W )− (W,E(u(·, tn)− u(·, tn−1))

4tn−12

)− B(W,E(u(·, tn)))

= (W,∂u

∂t− E(

u(·, tn)− u(·, tn−1))

4tn−12

)

= (W, (I − E)(u(·, tn)− u(·, tn−1)

4tn−12

) + (W,∂u

∂t− u(·, tn)− u(·, tn−1)

4tn−12

)

≡ (W,ωn) .

Let W = Un − E(u(·, tn)). Then

(Un − E(u(·, tn)), (Un − E(u(·, tn)))− (Un−1 − E(u(·, tn−1)))

4tn−12

) + λmin‖Un − E(u(·, tn))‖20

≤ (Un − E(u(·, tn)), (Un − E(u(·, tn)))− (Un−1 − E(u(·, tn−1)))

4tn−12

)

+ B(Un − E(u(·, tn)), Un − E(u(·, tn)))≤ ‖Un − E(u(·, tn))‖0‖ωn‖0

which can be rewritten

‖Un − E(u(·, tn))‖20(1 + λmin4t

n− 12 )

≤(Un − E(u(·, tn)), Un−1 − E(u(·, tn−1))) + 4tn−12 ‖Un − E(u(·, tn))‖0‖ωn‖0

≤‖Un − E(u(·, tn))‖0‖Un−1 − E(u(·, tn−1))‖0 + 4tn−12 ‖Un − E(u(·, tn))‖0‖ωn‖0 .

This implies that

‖Un − E(u(·, tn))‖0 ≤1

1 + λmin4tn− 1

2

‖Un − E(u(·, tn))‖0 + 4tn−

12 ‖ωn‖0

.

This linear recurrence inequality implies that

‖Un−E(u(·, tn))‖0 ≤

[n−1∏k=0

1

1 + λmin4tn− 1

2

]‖U0−E(u0)‖0+

n∑k=1

[n−1∏k=0

1

1 + λmin4tn− 1

2

]‖ωk‖0


Recall that ωn is the sum of two terms. The first term has the bound

4tk−12 ‖(I−E)(

u(·, tn)− u(·, tn−1)

4tn−12

)‖0 = ‖(I−E)∫ tk

tk−1

∂u

∂tds‖0 ≤ C0h

k

∫ tk

tk−1

‖∂u∂t

(·, s)‖k ds .

Note that 1/(1 + x) ≤ 2e−x for x ≤ 1. Thus if 4tk−12λmin ≤ 1 for all k, then

n∑k=1

[n−1∏k=0

1

1 + λmin4tn− 1

2

]4tk−

12 ‖(I − E)(

u(·, tn)− u(·, tn−1)

4tn−12

)‖0

≤C0hk

n∑k=1

[n−1∏k=0

1

1 + λmin4tn− 1

2

]∫ tk

tk−1

‖∂u∂t

(·, s)‖k ds

≤2C0hk

n∑k=1

[n−1∏k=0

1

1 + λmin4tn− 1

2

]∫ tn

0e−λmin(tn−s)‖∂u

∂t(·, s)‖k ds .

For the other term in ωk we estimate

4tk−12 ‖∂u∂t

(·, tk)− u(·, tk)− u(·, tk−1)

4tk−12

‖0 = ‖∫ tk

tk−1

∂u

∂t(·, tk)− ∂u

∂t(·, s) ds‖0

=‖∫ tk

tk−1

∫ tk

s

∂2u

∂t2(·, r) dr ds‖0 = ‖

∫ tk

tk−1

(r − tk−1)∂2u

∂t2(·, r) dr‖0

It follows that if 4tk−12λmin ≤ 1 for all k, then

n∑k=1

[n−1∏k=0

1

1 + λmin4tk− 1

2

]4tk−

12 ‖∂u∂t

(·, tk)− u(·, tk)− u(·, tk−1)4tk− 1

2

‖0

≤n∑k=1

e−λmin(tn−tk−1)‖∫ tk

tk−1

(r − tk−1)∂2u

∂t2(·, r) dr‖0

≤n∑k=1

4tk−12 e−λmin(tn−tk−1)‖

∫ tk

tk−1

∂2u

∂t2(·, r) dr‖0

It follows that

‖Un − u(·, tn)‖0 ≤ ‖Un − E(u(·, tn))‖0 + ‖u(·, tn)− E(u(·, tn))‖0

≤ 2C0hke−λmint

n‖U0 − E(u0)‖0 + 2C0hk

∫ tn

0e−λmin(tn−s)‖∂u

∂t(·, s)‖k ds

+n∑k=1

4tk−12 e−λmin(tn−tk−1)‖

∫ tk

tk−1

∂2u

∂t2(·, r) dr‖0 + C0h

k‖u(·, tn)‖k .


This result suggests that we should consider using smaller 4t wherever∫ tktk−1

∂2u∂t2

(·, r) dr islarge.

Bibliography

[1] Milton Abramowitz and Irene A. Stegun, editors. Handbook of Mathematical Functions.Dover, 1965. 5.5.1

[2] R.A. Adams, editor. Sobolev Spaces. Academic Press, 1975. 4.4.2, 4.4.3, 4.4.4, 4.4.5

[3] R.A. Agmon. Lectures on Elliptic Boundary Value Problems. van Nostrand, 1965.4.3.2, 4.4.4, 4.4.6, 4.5.1, 4.5.1, 4.5-10, 4.5-11, 4.6.1, 4.7.6

[4] T.J. Arbogast and Z. Chen. On the implementation of mixed methods as noncon-forming methods for second-order elliptic problems. Math. Comp., 64:943–972, 1995.5.15.3, 5.15.13

[5] D.N. Arnold and F. Brezzi. Mixed and nonconforming finite element methods: Imple-mentation, postprocessing and error estimates. RAIRO Model. Math. Anal. Numer.,19:7–32, 1985. 5.15.3, 5.15.13, 5.15.15

[6] O. Axelsson. Incomplete block matrix factorization preconditioning methods. the ul-timate answer? Journal of Computational and Applied Mathematics, 12:3–18, 1985.3.6.2

[7] Owe Axelsson. Iterative Solution Methods. Cambridge, 1994. 3.2-7, 3.4, 3.4-4, 3.4-6, 3.4-9, 3.4-13, 3.4-15, 3.4-16, 3.5-2, 3.5-14, 3.5-22, 3.5.3, 3.5-30, 3.5-31, 3.6.2, 3.6-1,3.6.4, 3.7.1, 3.7.1, 3.8-3, 3.8-4, 3.8-5, 3.8-8, 3.8-9, 3.8-10

[8] A.K. Aziz, editor. The Mathematical Foundations of the Finite Element Method withApplications to Partial Differential Equations. Academic, 1972. 4.4.6, 5.15-4

[9] I. Babuska, B.A. Szabo, and I.N. Katz. The p-version of the finite element method.SIAM J. Numer. Anal., 18:515–545, 1981. 5.3.3

[10] K.-J. Bathe and E. L. Wilson. Numerical Methods in Finite Element Analysis. Prentice-Hall, 1976. 5.11

513

514 BIBLIOGRAPHY

[11] Dietrich Braess. Finite Elements. Cambridge, 1997. 4.4.6, 4.4.6, 4.7.6.1, 5.15-4, 5.15-5,5.15-6, 5.15-7, 5.15-8, 5.15-9, 5.15-10

[12] J. H. Bramble, J. E. Pasciak, and A. Vassilev. Analysis of the inexact uzawa algorithmfor saddle point problems. SIAM J. Numer. Anal., 34:1072–1092, 1997. 5.15.3, 5.15.13

[13] J. H. Bramble and J. Xu. Some estimates for weighted l2 projections. Math. Comp.,56:463–476, 1991. 4.7.2

[14] J. H. Bramble and X. Zhang. Handbook of numerical analysis, volume VII, chap-ter entitled The analysis of multigrid methods, pages 173–415. North-Holland, 2000.P. G. Ciarlet and J. L. Lions, eds. 5.13.1

[15] James H. Bramble and S. Hilbert. Bounds for a class of linear functionals with ap-plications to hermite interpolation. Numerische Mathematik, 16:362–369, 1971. 4.6-2,5.4-5

[16] Susanne C. Brenner and L. Ridgway Scott. The Mathematical Theory of Finite ElementMethods. Springer-Verlag, 1994. 4.4-36, 4.4.6, 4.5.1, 4.7.6.1, 5.10.2, 5.12, 5.12-1, 5.15.1,5.15.3, 5.15.6.3, 5.15.11, 5.15.13, 5.15.15

[17] Franco Brezzi and Michel Fortin. Mixed and Hybrid Finite Element Methods. Springer–Verlag, New York, 1991. 5.15.3, 5.15.13, 5.15.15

[18] P.N. Brown. A local convergence theory for combined inexact-Newton/finite-differenceprojection methods. SIAM J. Numer. Anal., 24:407–434, 1987. 3.9.2, 3.9-1

[19] P.N. Brown and Y. Saad. Hybrid Krylov methods for nonlinear systems of equations.SISSC, 11:450–481, 1990. 3.9.2

[20] P.N. Brown and Y. Saad. Convergence theory of nonlinear Newton-Krylov algorithms.SIAM J. Optimization, 4:297–330, 1994. 3.9.2

[21] Z. Chen. Analysis of mixed methods using conforming and nonconforming finite ele-ment methods. RAIRO Model. Math. Anal. Numer., 27:9–34, 1993. 5.15.3, 5.15.13

[22] Z. Chen. Equivalence between and multigrid algorithms for nonconforming and mixedmethods for second order elliptic problems. East-West Numer. Math., 4:1–33, 1996.5.15.3, 5.15.13

[23] Lindberg Chernuka, Cowper and Olson, 1972. 5.12

[24] A.J. Chorin and J.E. Marsden. A Mathematical Introduction to Fluid Mechanics.Springer-Verlag, 1979. 1.2.6, 5.14-1, 5.15.6.4

BIBLIOGRAPHY 515

[25] P.G. Ciarlet. The Finite Element Method for Elliptic Problems. North-Holland, 1978.4.7.6.1, 5.11

[26] R. Courant and D. Hilbert. Methods of Mathematical Physics, Volume I. Interscience,1953. 5.14.2

[27] Lawrence C. Cowsar. Some Domain Decomposition and Multigrid Preconditioners forHybrid Mixed finite elements. PhD thesis, Rice University, Texas, 1994. 5.15.3, 5.15.13

[28] G. Dahlquist and A. Bjorck. Numerical Methods. Prentice-Hall, 1974. Translated byN. Anderson. 2.2.5

[29] P.J. Davis. Interpolation and Approximation. Blaisdell, 1965. 5.4.2

[30] P.J. Davis and P. Rabinowitz. Numerical Integration. Blaisdell, 1967. 5.12

[31] Ghislain de Marsily. Quantitative Hydrology. Academic, 1986. 1.2.3

[32] C. DeBoor and B. Swartz. Collocation at gaussian points. SIAM J. Numer. Anal.,10:582–606, 1973. 5.3.2.3

[33] D.A. Dunavant. High degree efficient symmetrical gaussian quadrature rules for thetriangle. Int. J. Num. Meth. Engng., 21:1129–1148, 1985. 5.5.7

[34] T. Dupont. Some l2 error estimates for parabolic galerkin methods. In A.K. Aziz,editor, The Mathematical Foundations of the Finite Element Method with Applicationsto Partial Differential Equations, pages 491–504. Academic, 1972. 6.1.2

[35] T. Dupont, R.P. Kendal, and H.H. Rachford, Jr. An approximate factorization pro-cedure for solving self-adjoint elliptic difference equations. SIAM J. Numer. Anal.,5:554–573, 1968. 3.6.2

[36] M.G. Edwards and C.F. Rogers. A flux continuous scheme for the full tensor pressureequation. In M.A. Christie, F.V. Da Silva, C.L. Farmer, O. Guillon, and Z.E. Hein-mann, editors, 4th European Conference on the Mathematics of Oil Recovery, Norway,June 1994. 5.15.6, 5.15.17

[37] H. Elman and G. Golub. Inexact and preconditioned uzawa algorithms for saddle pointproblems. SIAM J. Numer. Anal., 31:1645–1661, 1994. 5.15.3, 5.15.13

[38] H.C. Elman. Iterative Methods for Large Sparse Nonsymmetric Systems of LinearEquations. PhD thesis, Computer Science Department, Yale University, 1982. 4

[39] R.E. Ewing and J. Wang. Analysis of mixed finite element methods in locally refinedgrids. Numer. Math., 63:183–194, 1992. 5.15.3, 5.15.13

516 BIBLIOGRAPHY

[40] R.E. Ewing and M.F. Wheeler. Numerical Methods for Scientific Computing, chapterentitled Computational Aspects of Mixed Finite Element Methods, pages 163–172.North-Holland, 1983. R. Stepleman (editor). 5.15.3, 5.15.13

[41] A. Friedman. Remarks on the maximum principle for parabolic equations and itsapplications. Pacific J. Math., 8:201–211, 1958. 2.1.3

[42] A. Friedman. Partial Differential Equations of Parabolic Type. Prentice-Hall, 1964.2.1.1, 2.1.3, 6.1-1, 6.1-2

[43] G.H. Golub and C.F. van Loan. Matrix Computations. Johns Hopkins, 1996. Thirdedition. 3.2-1, 3.2-4, 3.5-1, 3.5-13, 3.5-23, 5, 3.7.1, 3.7.2

[44] Peter Grindrod. Patterns and waves. The Clarendon Press Oxford University Press,New York, 1991. The theory and applications of reaction-diffusion equations. 1.2.2

[45] P. Grisvard. Elliptic Problems in Nonsmooth Domains. Pitman Advanced PublishingProgram, 1985. 4.4.6, 4.5.3

[46] G.H. Hardy, J.E. Littlewood, and G. Polya. Inequalities. Cambridge University Press,1967. 4.6.2

[47] C. Hirsch. Numerical Computation of Internal and External Flows, volume 1: Funda-mentals of Numerical Discretization. Wiley, 1988. 5.15.6.4

[48] R.D. Hornung. Adaptive Mesh Refinement and Multi-Level Iteration Techniques. PhDthesis, Department of Mathematics, Duke University, 1994. 5.15.3, 5.15.13

[49] R.D. Hornung and J.A. Trangenstein. Adaptive mesh refinement and multilevel itera-tion for flow in porous media. J. Comp. Phys., 136:522–545, 1997. 5.15.3, 5.15.13

[50] Cameron Hughes and Tracey Hughes. Mastering the Standard C++ Classes. Wiley,1987. 4.2.2.4

[51] B.M. Irons. Engineering application of numerical integration in stiffness method.A.I.A.A., 4:2035–2037, 1966. 5.11

[52] B.M. Irons. Comment on ‘stiffness matrices for sector element’ by i.r.raju and a.k. rao.J. A.I.A.A., 7:156–157, 1969. 5.11

[53] Claes Johnson. Numerical Solution of Partial Differential Equations by the FiniteElement Method. Cambridge, 1994. 5.10.2

[54] Mark T. Jones and Paul E. Plassmann. An improved incomplete cholesky factorization.ACM Transactions on Mathematical Software, 21:5–17, 1995. 3.6.6

BIBLIOGRAPHY 517

[55] J.P. Keener. Waves in excitable media. SIAM J. Appl. Math., 39:528–548, 1980. 1.2.2

[56] John L. Kelley. General Topology. University Series in Higher Mathematics. VanNostrand, 1955. 4.3

[57] Zdenek Kopal. Numerical Analysis, with Emphasis on the Application of NumericalTechniques to Problems of Infinitesimal Calculus in a Single Variable. Wiley (NewYork), 1961. 5.3.3

[58] S. Krein and Yu. Petunin. Scales of banach spaces. Uspehi Matematicheskii Nauk,21:89–168, 1966. 4.4.5

[59] Erwin Kreyszig. Introductory Functional Analysis with Applications. Wiley (NewYork), 1978. 4.3

[60] J.L. Lions and E. Magenes. Non-homogeneous Boundary Value Problems and Applica-tions. Springer-Verlag, 1972. 4.3.2, 4.4.3, 4.4.6, 4.4.6, 4.7.6

[61] David G. Luenberger, editor. Introduction to Linear and Nonlinear Programming.Addison-Wesley, 1973. 3.7.1, 3.7-3, 3.7-4, 3.7-5

[62] J. Mandel. Etudes algebrique d’une methode multigrille pour quelques problems defrontiere libre. C.R. Acad. Sci., Ser. I. Math., 298:469–472, 1984. 5.13.1

[63] G. I. Marchuk. Splitting and alternating direction methods. In P. G. Ciarlet and J. L.Lions, editors, Handbook of Numerical Analysis, volume 1. North-Holland, Amsterdam,1990. 2.6.2

[64] J.D. Murray. Mathematical Biology, volume 19 of Lecture Notes in Biomathematics.Springer, 1989. 1.2.2

[65] D.W. Peaceman. Fundamentals of Numerical Reservoir Simulation. Elsevier ScientificPublishing Co., 1977. 5.15.6, 5.15.17

[66] Mark A. Pinsky, editor. Partial Differential Equations and Boundary-Value Problemswith Applications. International Series in Pure and Applied Mathematics. McGraw-Hill, 1998. 2.1.1

[67] V.T. Rajan. Optimality of the delaunay triangulation in rd. Discrete and Computa-tional Geometry, 12:189–202, 1994. 5.8

[68] P.A. Raviart and J.M. Thomas. A mixed finite element method for second orderelliptic problems. In I. Galligani and E. Magenes, editors, Mathematical Aspects ofFinite Element Methods, pages 292–315. Springer-Verlag, Berlin, 1967. 5.15.10

518 BIBLIOGRAPHY

[69] Frigyes Riesz and Bela Sz.-Nagy. Functional Analysis. Frederick Ungar Publishing,1965. 4.3

[70] J.E. Roberts and J.M. Thomas. Mixed and hybrid methods. In P.G. Ciarlet and J.L.Lions, editors, Handbook of Numerical Analysis, volume 2, pages 524–639. ElsevierScience Publisher B.V., 1991. 5.15.3, 5.15.13, 5.15.15

[71] W. Rudin. Real and Complex Analysis. McGraw-Hill, 1966. 4.4.2, 4.4.2

[72] W. Rudin. Functional Analysis. McGraw-Hill, 1973. 4.3

[73] W. Rudin. Real and Complex Analysis. McGraw-Hill, 1987. 2.4.1, 4.3, 4.4.1, 4.4.1

[74] Youcef Saad and Martin Schultz. Gmres: A generalized minimal residual algorithm forsolving nonsymmetric linear equations. SIAM J. Sci. Stat. Comput., 7:856–869, 1986.3.8.2

[75] L. Ridgway Scott. Interpolated boundary conditions in the finite element method.SIAM J. Numer. Anal., 12:404–427, 1975. 5.12

[76] Jonathan Richard Shewchuk. What is a good linear finite element? interpolation,conditioning, anisotropy, and quality measures. Department of Electical Engineeringand Computer Sciences, University of California at Berkeley, December 31, 2002. 5.8

[77] K.T. Smith. Inequalities for formally positive integro-differential forms. Bull. A.M.S.,67:368–370, 1961. 4.4.4

[78] G.A. Sod. Numerical Methods in Fluid Dynamics. Cambridge University, 1985. 2.4

[79] T. Steihaug. The conjugate gradient method and trust regions in large scale optimiza-tion. SIAM J. Numer. Anal., 20:626–637, 1983. 3.9.1

[80] G. Strang. Variational crimes in the finite element method. In A.K. Aziz, editor, TheMathematical Foundations of the Finite Element Method with Applications to PartialDifferential Equations, pages 491–504. Academic, 1972. 5.11

[81] G. Strang and G.J. Fix. An Analysis of the Finite Element Method. Prentice-Hall,1973. 5.10.2, 5.11

[82] Walter Strauss. Partial Differential Equations: An Introduction. John Wiley and Sons,1992. 2.1.1, 2.1.2

[83] J.C. Strikwerda. Finite Difference Schemes and Partial Differential Equations.Wadsworth & Borrks/Cole, 1989. 2.4.5

BIBLIOGRAPHY 519

[84] A.H. Stroud and D. Secrest. Gaussian Quadrature Formulas. Prentice-Hall, 1966.5.5.1, 5.12

[85] B. Stroustrup. The C++ Programming Language. Addison-Wesley, 1992. 4.2.2.4

[86] Stuben, K. Appendix a: An introduction to algebraic multigrid. In Trottenberg,Ulrich and Oosterlee, Cornelis and Schuller, Anton, editor, Multigrid, pages 413–532.Academic, 2001. 3.10-2, 3.10-3, 3.10-4, 3.10.5

[87] Barna Szabo and Ivo Babuska. Finite Element Analysis. Wiley (New York), 1991.5.3.3, 5.5.7, 5.6.7

[88] S. Timoshenko. Strength of Materials. Van Nostrand Reinhold, 1956. 4, 4

[89] Richard S. Varga. Matrix Iterative Analysis. Springer, 2000. 3.3-4, 3.3-5, 3.3-10, 3.3-11,3.3-12, 3.3-13, 3.3-14, 3.3-15, 3.3-16, 3.4-7, 3.4-11

[90] V. S. Vladimirov. Equations of Mathematical Physics. Pure and Applied Mathematics.Dekker, 1971. 2.1.1

[91] Wachspress, editor. Iterative Solution of Elliptic Systems and Applications to the Neu-tron Diffusion Equations of Reactor Physics. Prentice-Hall, 1966. 3.5.4, 4, 5, 3.7.1

[92] A. F. Ware, A. K. Parrott, and C. Rogers. A finite volume discretization for porousmedia flows governed by non-diagonal permeability tensors. In P. A. Thibault andD. M. Bergeron, editors, Proc. CFD95, Third Annual Conference of the CFD Societyof Canada, 25–27 June, Banff, Alberta, Canada, 1995. 5.15.6, 5.15.17

[93] Kosaku Yosida. Functional Analysis. Springer-Verlag, 1974. 4.3, 4.4.2, 4.4.2, 4.4.2,5.15.8

[94] O. C. Zienkiewicz. The Finite Element Method in Engineering Science. McGraw-Hill,1971. 5.11

Index

Z3, 364L∞(Ω), 243Lp(Ω), 243k-smooth, 238

, 414natural boundary condition, 315

accessor, 236accuracy, 84–86, 89adjoint, 239amplitude, 58Arnoldi algorithm, 180Arnoldi process, 177averaged Taylor polynomial, 276

Banach space, 244barycentric coordinates, 350beam bending, 228, 317bilinear form, 246block Jacobi iteration, 169bounded, 246bounded sublinear functional, 272, 348Bramble-Hilbert Lemma, 348Buckley-Leverett model, 9bulk modulus, 217

canonical element, 230canonical shape functions, 230capillary pressure, 10Cauchy sequence, 244CGW method, 176Chebychev polynomial, 166

class Ck, 266coercive, 246complete, 244condition number, 147cone, 248cone condition, 248cone property, 248conjugate vectors, 158conservation law, 10, 12consistent

finite difference approximation, 70convection-diffusion, 58convergence, 84Courant triangles, 357

Darcylaw, 6, 7, 9, 10, 12velocity, 12

Darcy’s law, 216, 465decay

number, 59decay number, 32deformation gradient, 216dense, 245descent direction, 186Dirichlet bilinear form, 241Dirichlet problem, 240dispersion, 58, 59dissipation error, 61dual, 446

efficiency, 84, 86

520

INDEX 521

element domain, 304elliptic

pde, 1, 2uniformly strongly, 259, 405

elliptic operator, 238ellipticity constant, 241ess sup, 243essential boundary condition, 315essential supremum, 243extremal value, 111extremal vector, 112

feasible, 461Fick’s law, 7finite

difference, 84finite difference, 56

linear, 60finite element, 304finite element method, 221, 304, 306FitzHugh-Nagumo model, 5Fourier

inversion formula, 57transform, 57

Fourier’s law of cooling, 3, 215Fredholm Theorem of the Alternative, 298frequency, 57, 58Frobenius norm, 105

Galerkin approximation, 283Galerkin equations, 218Galerkin method, 218Gauss-Lobatto quadrature, 333Gauss-Seidel iteration, 137GCR algorithm, 183generalized solution, 241GMRES, 182Goldstein-Armijo descent conditions, 189Gram matrix, 171, 389, 435, 437, 439graph, 254

gravity number, 12Garding’s inequality, 259, 267, 297

heat capacity, 3heat equation, 5heat kernel, 18Hermite interpolation, 360Hilbert scale, 252Hilbert space, 245Hooke’s law, 465hourglass mode, 368hyperbolic, 1, 2

inf-sup condition, 472infimum, 243interpolation operator, 74irreducible, 109isoparametric transformation, 354iterative improvement, 125iterator, 236

Jacobi iteration, 133, 169

Korn’s inequality, 466, 468Krylov subspace, 162

L-stable, 41Lagrange multiplier, 445Lagrangian, 445, 462, 471Lam’e constant, 465Lame constant, 217Laplace equation, 2latent heat of fusion, 5Lebesgue space, 243linear functional, 245Lipschitz boundary, 255Lipschitz norm, 254Lobatto quadrature, 414local interpolant, 305, 357local truncation error, 49, 55, 71locally integrable function, 244

522 INDEX

M-matrix, 117makefile, 34matrix

extremal value, 111normal form, 109

mesh elements, 221, 309mesh nodes, 221, 309minimum

first order condition, 470second order condition, 470

miscible displacement, 6, 7mixed finite element method, 450mobility, 10modified Gram-Schmidt orthogonalization pro-

cess, 177multi-index, 237multigrid

restriction operator, 418smoother operator, 418

natural boundary condition, 242Navier-Stokes equations, 15, 468Neumann series, 104, 107nodal basis functions, 222, 306nodal variable, 357nodal variables, 304norm

Frobenius, 105matrix 2-, 105subordinate matrix, 126

normal derivative, 253normal form, 109

operator splitting, 92optimal, 461order

of pde, 1orthomin, 176

parabolic pde, 1, 2Parseval’s identity, 73

Peaceman-Rachford Scheme, 95Peano-Kernel Theorem, 347Peclet, 58permeability, 6, 9phase error, 61porosity, 6, 9preconditioner, 125, 167predictor-corrector splitting, 95principal part, 238prolongation, 190, 422, 442

quadratic program, 445quadratic programming problem

dual, 446quadrature

Lobatto, 414quasi-uniform, 402

reducible, 109refinement

study, 85relative permeability, 9Rellich’s lemma, 252restart, 177restriction, 190, 422, 442restriction operator, 418Reynolds number, 15, 468Richardson’s iteration, 426right j-smooth, 266

saturation, 9scheme

dispersive, 60dissipative, 60explicit centered, 33explicit centered difference, 37finite difference, 84Peaceman-Rachford, 95positive, 61

scheme:explicit centered difference, 31scheme:implicit centered difference, 38

INDEX 523

Schur complement, 122Schwarz’s inequality, 244separable, 22shape functions, 304shear mode, 368shear modulus, 217, 465smoother, 191smoother operator, 418Sobolev inner product, 247Sobolev norm, 247Sobolev seminorm, 247Sobolev space, 247solution ratio, 69spectral radius, 105SSOR, 169stabilization method, 92stable, 55

finite difference approximation, 73stiffness matrix, 435, 439, 444Stokes equation, 469Stokes equations, 15strain

infinitesimal, 216, 465strain energy, 217, 466stress, 217, 465strong cone condition, 249subordinate matrix norm, 126subparametric transformation, 354superconvergence, 321, 324supremum, 243symbol

of finite difference approximation, 69of partial differential operator, 69

symmetric bilinear form, 241

Taylor polynomial, 276test function, 315thermal conductivity, 3, 215transformation

coordinate, 1–3

translation mode, 368truncation error, 55, 71truncation operator, 75Turner triangles, 357

uniform cone condition, 249uniform mesh map, 402uniformly elliptic operator, 238uniformly strongly elliptic, 241, 259, 405

vectorextremal, 112

viscosityfirst coefficient of, 468

wavenumber, 57, 58

wave numbermesh, 59

weak form, 212

numerical solution of partial diﬀerential equationsjliu/math226/book3.pdf · numerical solution...

Documents