numerical methods for large-scale non-linear systems, hoppe

108
Numerical Methods for Large-Scale Nonlinear Systems Handouts by Ronald H.W. Hoppe following the monograph P. Deuflhard Newton Methods for Nonlinear Problems Springer, Berlin-Heidelberg-New York, 2004

Upload: piyush-panigrahi

Post on 30-Sep-2015

6 views

Category:

Documents


0 download

TRANSCRIPT

  • Numerical Methods for

    Large-Scale Nonlinear Systems

    Handouts by Ronald H.W. Hoppe

    following the monograph

    P. Deuflhard

    Newton Methods for Nonlinear Problems

    Springer, Berlin-Heidelberg-New York, 2004

  • Num. Meth. Large-Scale Nonlinear Systems 2

    1. Classical Newton Convergence Theorems

    1.1 Classical Newton-Kantorovich Theorem

    Theorem 1.1 Classical Newton-Kantorovich Theorem

    Let X and Y be Banach spaces, D X a convex subset, and suppose that thatF : D X Y is continuously Frechet differentiable on D with an invertibleFrechet derivative F (x0) for some initial guess x0 D. Assume further thatthe following conditions hold true:

    F (x0)1F (x0 , (1.1)F (y) F (x) y x , x, y D , (1.2)h0 := F (x0)1 < 1

    2, (1.3)

    B(x0, 0) D , 0 := 11 2h0

    F (x0)1 . (1.4)

    Then, for the sequence {xk}lN0 of Newton iterates

    F (xk) xk = F (xk) ,xk+1 = xk + xk

    there holds

    (i) F (x) is invertible for all Newton iterates x = xk, k lN0,(ii) The sequence {xk}lN of Newton iterates is well defined with xk B(x0, 0),

    k lN0, and xk x B(x0, 0), k lN0 (k ), where F (x) = 0,(iii) The convergence xk x (k ) is quadratic,(iv) The solution x of F (x) = 0 is unique in

    B(x0, 0) (D B(x0, 0)) , 0 :=1 +

    1 2h0

    F (x0)1 .

    Proof. We have

    F (xk) F (x0) xk x0 tkfor some upper bound tk, k lN.If we can prove xk B(x0, 0) and tk := F (x0)1tk < 1, k lN, then by the

  • Num. Meth. Large-Scale Nonlinear Systems 3

    Banach perturbation lemma F (xk) is invertible with

    F (xk)1 F(x0)1

    1 F (x0)1F (xk) F (x0) (1.5)

    F(x0)1

    1 F (x0)1xk x0 F (x0)11 tk

    =: k .

    We prove xk B(x0, 0) and tk < 1, k lN, by induction on k:For k = 1 we have

    x1 x0 = F (x0)1F (x0) = h0F (x0)1 < 0 ,

    since h0 < 11 2h0, and

    t1 := F (x0)1 t1 = F (x0)1 x1 x0 == F (x0)1 F (x0)1F (x0) F (x0)1

    = h0

    =

    =

    1s=0

    < F (xk + sxk) F (xk),xk > ds 12

    < F (xk)xk,xk > =

    =

    1s=0

    1t=0

    < F (xk + stxk)xk,xk > dt ds 12< F (xk)xk,xk >=

    =

    1s=0

    s

    1t=0

    < (F (xk + stxk) F (xk))xk =: wk

    ,xk > dt ds =

    =

    1s=0

    s

    1t=0

    < F (xk)1/2wk, F (xk)1/2xk > dt ds

    1

    s=0

    s

    1t=0

    F (xk)1/2wk stF (xk)1/2xk2

    F (xk)1/2xk dt ds

    F (xk)1/2xk2 = k

    F (xk)1/2xk = hk

    1s=0

    s21

    t=0

    t dt 16hk k ,

    which proves (1.34).

    Using the right-hand side of (1.34) and hk < 2 yields

    f(xk) f(xk+1) (12

    +1

    6hk) k

    1

    6k .

    Together, this proves (1.35).

    In order to prove (iii), we use (1.34) and obtain

    0 2 (f(x0) f(x)) 2k=0

    (f(xk) f(xk+1) < 562 k =

    =5

    6h2k =

    5

    64 (

    1

    2hk)

    2 .

    Using

    1

    2hk+1 (1

    2hk)

    2 12hk < 1 ,

    we further get

    (1

    2h0)

    2 + (1

    2h1)

    2 + ...

    (12h0)

    2 + (1

    2h0)

    4 + (1

    2h1)

    4 + ...

    14h20

    k=0

    (1

    2h0)

    k =14h20

    1 h02

    ,

    which proves (1.36).

  • Num. Meth. Large-Scale Nonlinear Systems 17

    3. Inexact Newton Methods

    We recall that Newtons method computes iterates successively as the solutionof linear algebraic systems

    F (xk) xk = F (xk) , k lN0 , (1.37)xk+1 = xk + xk .

    The classical convergence theorems of Newton-Kantorovich and Newton-My-sovskikh and its affine covariant, affine contravariant, and affine conjugate ver-sions assume the exact solution of (1.37).In practice however, in particular if the dimension is large, (1.37) will be solvedby an iterative method. In this case, we end up with an outer/inner iter-ation, where the outer iterations are the Newton steps and the inner iterationsresult from the application of an iterative scheme to (1.37). It is important totune the outer and inner iterations and to keep track of the iteration errors.With regard to affine covariance, affine contravariance, and affine conjugacy theiterative scheme for the inner iterations has to be chosen in such a way, that iteasily provides information about the

    error norm in case of affine covariance, residual norm in case of affine contravariance, and energy norm in case of affine conjugacy.

    Except for convex optimization, we cannot expect F (x), x D, to be sym-metric positive definite. Hence, for affine covariance and affine contravariancewe have to pick iterative solvers that are designed for nonsymmetric matrices.Appropriate candidates are

    CGNE (Conjugate Gradient for the Normal Equations) in caseof affine covariance,

    GMRES (Generalized Minimum RESidual) in case of affine con-travariance, and

    PCG (Preconditioned Conjugate Gradient) in case of affine con-jugacy.

  • Num. Meth. Large-Scale Nonlinear Systems 18

    3.1 Affine Covariant Inexact Newton Methods

    3.1.1 CGNE (Conjugate Gradient for the Normal Equations)

    We assume A lRnn to be a regular, nonsymmetric matrix and b lRn tobe given and look for y lRn as the unique solution of the linear algebraicsystem

    Ay = b . (1.38)

    As the name already suggests, CGNE is the conjugate gradient method appliedto the normal equations:It solves the system

    AAT z = b , (1.39)

    for z and then computes y according to

    y = AT z . (1.40)

    The implementation of CGNE is as follows:

    CGNE Initialization:

    Given an initial guess y0 lRn, compute the residual r0 = b Ay0 and setp0 = r0 , p0 = 0 ,

    0 = 0 , 0 = r02 .CGNE Iteration Loop: For 1 i imax compute

    pi = AT ri1 + i1 pi1 , i =

    i1pi2 ,

    yi = yi1 ipi , 2i1 = ii1 ,

    ri = ri1 iApi , i = ri2 ,

    i =ii1

    .

    CGNE has the error minimizing property

    y yi = minvKi(AT r0,ATA)

    y v , (1.41)

    where Ki(AT r0, ATA) stands for the Krylov subspaceKi(AT r0, ATA) := span{AT r0, (ATA)AT r0, ..., (ATA)i1AT r0} . (1.42)

  • Num. Meth. Large-Scale Nonlinear Systems 19

    Lemma 3.1 Representation of the iteration error

    Let i := y yi2 be the square of the CGNE iteration error with respectto the i-th iterate. Then, there holds

    i =n1j=i

    2j . (1.43)

    Proof. CGNE has the Galerkin orthogonality

    (yi y0, yi+m yi) = 0 , m lN . (1.44)Setting m = 1, this implies the orthogonal decomposition

    yi+1 y02 = yi+1 yi2 + yi y02 , (1.45)which readily gives

    yi y02 =i1j=0

    yj+1 yj2 =i1j=0

    2j . (1.46)

    On the other hand, observing yn = y, for m = n i the Galerkin orthogonality

    yields

    y y02 =

    n1j=0

    2j

    = y yi2 = 2i

    + yi y02 =

    i1j=0

    2j

    . (1.47)

    Computable lower bound for the iteration error

    It follows readily from Lemma 3.1 that the computable quantity

    [i] :=i+mj=i

    2j , m lN, (1.48)

    provides a lower bound for the iteration error.In practice, we will test the relative error norm according to

    i :=y yiyi

    [i]

    yi , (1.49)

    where is a user specified accuracy.

  • Num. Meth. Large-Scale Nonlinear Systems 20

    3.1.2 Convergence of affine covariant inexact Newton methods

    We denote by xk lRn the result of an inner iteration, e.g., CGNE, for thesolution of (1.37). Then, it is easy to see that the iteration error xk xksatisfies the error equation

    F (xk)(xk xk) = F (xk) + F (xk)xk =: rk . (1.50)We will measure the impact of the inexact solution of (1.37) by the relativeerror

    k :=xk xk

    xk . (1.51)

    Theorem 3.1 Affine covariant convergence theorem for the inexactNewton method. Part I: Linear convergence

    Suppose that that F : D lRn lRn is continuously differentiable on D withinvertible Frechet derivatives F (x), x lRn. Assume further that the followingaffine covariant Lipschitz condition is satisfied

    F (z)1(F (y) F (x)

    )v y x v , (1.52)

    where x, y, z D, v lRn.Assume that x0 D is an initial guess for the outer Newton iteration andthat x0 = 0 is chosen as the startiterate for the inner iteration. Considerthe Kantorovich quantities

    hk := xk , hk := xk =hk1 + 2k

    (1.53)

    associated with the outer and inner iteration.Assume that

    h0 < 2 , 0 < 1 , (1.54)and control the inner iterations according to

    (hk, k) :=12hk + k(1 + h

    k)

    1 + 2k < 1 , (1.55)

    which implies linear convergence.Note that a necessary condition for (hk, k) is that it holds true fork = 0, which is satisfied due to assumption (1.37).

  • Num. Meth. Large-Scale Nonlinear Systems 21

    Then, there holds:

    (i) The Newton CGNE iterates xk, k lN0 stay in

    B(x0, ) , :=x01 (1.56)

    and converge linearly to some x B(x0, ) with F (x) = 0.(ii) The exact Newton increments decrease monotonically according to

    xk+1xk , (1.57)

    whereas for the inexact Newton increments we have

    xk+1xk

    1 + 2k1 + 2k+1

    . (1.58)

    Proof. By elementary calculations we find

    xk+1 = F (xk+1)1F (xk+1) = (1.59)

    = F (xk+1)1[F (xk+1) F (xk)

    ]+ F (xk+1)1 F (xk)

    = rkF (xk)xk

    = F (xk+1)1[F (xk+1) F (xk) F (xk)xk

    ] +

    + F (xk+1)1 rk= F (xk)(xkxk)

    1

    0

    F (xk+1)1[F (xk + txk) F (xk)

    ]xkdt

    =: I

    +

    + F (xk+1)1F (xk)(xk xk) =: II

    .

  • Num. Meth. Large-Scale Nonlinear Systems 22

    Using the affine covariant Lipschitz condition (1.52), the first term on theright-hand side in (1.59) can be estimated according to

    I xk21

    0

    t dt =1

    2 xk2 . (1.60)

    For the second term we obtain by the same argument

    II = F (xk+1)1[F (xk)(xk xk) F (xk+1)(xk xk)

    ] (1.61)

    F (xk+1)1(F (xk+1) F (xk))(xk xk) +

    + F (xk+1)1F (xk+1)(xk xk)

    12xk xk xk + xk xk2 .

    Combining (1.60) and (1.61) yields

    xk+1xk

    1

    2 xk

    = hk

    +1

    2 xk x

    k xkxk

    = k hk

    +xk xk

    xk = k

    hk + k (1 + hk) .Observing (1.53), we finally get

    xk+1xk (hk, k) =

    12hk + k(1 + h

    k)

    1 + 2k < 1 , (1.62)

    which implies linear convergence.Note that a necessary condition for (hk, k) is that it holds true for k = 0,which is satisfied due to assumption (1.54).For the contraction of the inexact Newton increments we get

    xk+1xk =

    1 + 2k1 + 2k+1

    xk+1xk

    1 + 2k1 + 2k+1

    . (1.63)

    It can be easily shown that {xk}lN0 is a Cauchy sequence in B(x0, ). Conse-quently, there exists x B(x0, ) such that xk x (k ). Since

    F (xk)xk 0

    = F (xk) + rk F (x)

    ,

    we conclude F (x) = 0.

  • Num. Meth. Large-Scale Nonlinear Systems 23

    Theorem 3.2 Affine covariant convergence theorem for the inexactNewton method. Part II: Quadratic convergence

    Under the same assumptions on F : D lRn lRn as in Theorem 3.1 supposethat the initial guess x0 D satisfies

    h0 0 and control the inner iterations such that

    k 2

    hk1 + hk

    . (1.65)

    Then, there holds:

    (i) The Newton CGNE iterates xk, k lN0 stay in

    B(x0, ) , :=x0

    1 1+2

    h0(1.66)

    and converge quadratically to some x B(x0, ) with F (x) = 0.(ii) The exact Newton increments and the inexact Newton incrementsdecrease quadratically according to

    xk+1 1 + 2

    xk2 , (1.67)

    xk+1 1 + 2

    xk2 . (1.68)

    Proof. We proceed as in the proof of Theorem 3.1 to obtain

    xk+1xk (hk, k) =

    12hk + k(1 + h

    k)

    1 + 2k.

    and

    xk+1xk =

    1 + 2k1 + 2k+1

    xk+1xk .

    In view of (1.65) we get the further estimates

    xk+1xk

    1 +

    2

    hk1 + 2k

    1 + 2

    hk .

  • Num. Meth. Large-Scale Nonlinear Systems 24

    and

    xk+1xk

    1 +

    2

    hk1 + 2k+1

    1 + 2

    hk ,

    from which (1.67) and (1.68) follow by the definition of the Kantorovich quan-tities.In order to deduce quadratic convergence we have to make sure that the initialincrements (k = 0) are small enough, i.e.,

    1 +

    2h0

    1 +

    2h0 < 1 . (1.69)

    Furthermore, (1.68) and (1.69) allow us to show that the iterates xk, k lN stayin B(x0, ). Indeed, (1.68) implies

    xj 1 + 2

    hj1 xj1 1 + 2

    h0 xj1 , j lN ,and hence,

    xk x k

    j=0

    xj x0k

    j=0

    (1 +

    2h0)

    j x0

    1 1+2

    h0.

    3.1.3 Algorithmic aspects of affine covariant inexact Newton methods

    (i) Convergence monitor

    Let us assume that the quantity < 1 in both the linear convergence modeand the quadratic convergence mode has been specified and let us furtherassume that we use CGNE with xk0 = 0 in the inner iteration.Then, (1.58) suggests the monotonicity test

    k :=

    1 + 2k+11 +

    2

    k

    xk+1xk , (1.70)

    where 2

    k and 2

    k+1 are computationally available estimates of 2k and

    2k+1.

    (ii) Termination criterion

    We recall that the termination criterion for the exact Newton iteration withrespect to a user specified accuracy XTOL is given by

    xk12k1

    XTOL .

  • Num. Meth. Large-Scale Nonlinear Systems 25

    According to (1.53) we have

    xk =1 + 2k xk.

    Consequently, replacing k1 and k by the computable quantities k1 andk, we arrive at the termination criterion

    1 + 2

    k

    1 2k1 XTOL . (1.71)

    (iii) Balancing outer and inner iterations

    According to (1.55) of Theorem 3.1, in the linear convergence mode theadaptive termination criterion for the inner iteration is

    (hk, k) :=12hk + k(1 + h

    k)

    1 + 2k < 1 .

    On the other hand, in view of (1.65) of Theorem 3.2, in the quadratic con-vergence mode the termination criterion is

    k 2

    hk1 + hk

    .

    Since the theoretical Kantorovich quantities (cf. (1.53))

    hk = xk =hk1 + 2k

    are not directly accessible, we have to replace them by computationally avail-able estimates [hk].We recall that for hk we have the a priori estimate

    [hk] = 2 2k1 hk .

    Consequently, replacing k by k, hk by [hk], and k1 by k1 (cf. (1.70)), weget the a priori estimates

    [hk] =[hk]1 +

    2

    k

    , [hk] = 2 2k1 , k lN . (1.72)

    For k = 0, we choose 0 = 0 =14.

    In practice, for k 1 we begin with the quadratic convergence mode and switch

  • Num. Meth. Large-Scale Nonlinear Systems 26

    to the linear convergence mode as soon as the approximate contraction factork is below some prespecified threshold value 12 .(iii)1 Quadratic convergence mode

    The computationally realizable termination criterion for the inner itera-tion in the quadratic convergence mode is

    k 2

    [hk]

    1 + [hk]. (1.73)

    Inserting (1.72) into (1.73), we obtain a simple nonlinear equation in k.

    Remark 3.1 Validity of the approximate termination criterion

    Observing that the right-hand side in (1.73) is a monotonically increasing func-tion of [hk], and taking [h

    k] hk into account, it follows that for k k the

    approximate termination criterion (1.73) implies the exact termination criterion(1.65).

    Remark 3.2 Computational work in the quadratic convergence mode

    Since k 0 (k ) is enforced, it follows that:The more the iterates xk approach the solution x, the more computa-tional work is required for the inner iterations to guarantee quadraticconvergence of the outer iteration.

    (iii)2 Linear convergence mode

    We switch to the linear convergence mode, once the criterion

    k < (1.74)

    is met.The computationally realizable termination criterion for the inner itera-tion in the linear convergence mode is

    [(hk, k)] := ([hk], k) =12[hk] + k(1 + [h

    k])

    1 + 2

    k

    . (1.75)

    Remark 3.3 Validity of the approximate termination criterion

    Since the right-hand side in (1.75) is a monotonically increasing function in [hk]and [hk] hk, the estimate provided by (1.75) may be too small and thus resultin an overestimation of k. However, since the exact quantities and their apriori estimates both tend to zero as k approaches infinity, asymptotically wemay rely on (1.75).

  • Num. Meth. Large-Scale Nonlinear Systems 27

    In practice, we require the monotonicity test (1.70) in CGNE and run theinner iterations until k satisfies (1.75) or divergence occurs, i.e.,

    k > 2 .

    Remark 3.4 Computational work in the linear convergence mode

    As opposed to the quadratic convergence mode, we observe

    The more the iterates xk approach the solution x, the less compu-tational work is required for the inner iterations to guarantee linearconvergence of the outer iteration.

  • Num. Meth. Large-Scale Nonlinear Systems 28

    3.2 Affine Contravariant Inexact Newton Methods

    3.2.1 GMRES (Generalized Minimum RESidual)

    The Generalized Minimum RESidual Method (GMRES is an iterativesolver for nonsymmetric linear algebraic systems which generates an orthogo-nal basis of the Krylov subspace

    Ki(r0, A) := span{r0, Ar0, ..., Ai1r0} . (1.76)

    by a modified Gram-Schmidt orthogonalization called the Arnoldi method.The inner product coefficients are stored in an upper Hessenberg matrixso that an approximate solution can be obtained by the solution of a least-squares problem in terms of that Hessenberg matrix:

    GMRES Initialization:

    Given an initial guess y0 lRn, compute the residual r0 = b Ay0 and set

    := r0 , v1 := r0

    , V1 := v1 . (1.77)

    GMRES Iteration Loop: For 1 i imax:I. Orthogonalization:

    vi+1 = Avi Vihi , (1.78)where hi = V

    Ti Avi . (1.79)

    II. Normalization:

    vi+1 =vi+1vi+1 . (1.80)

    III. Update:

    Vi+1 =(Vi vi+1

    ). (1.81)

    Hi =

    (hi

    vi+1)

    , i = 1 , (1.82)

    Hi =

    (Hi1 hi0 vi+1

    ), i > 1 . (1.83)

  • Num. Meth. Large-Scale Nonlinear Systems 29

    IV. Least squares problem: Compute zi as the solution of

    e1 Hizi = minzlRn

    e1 Hiz . (1.84)

    V. Approximate solution:

    yi = Vizi + y0 . (1.85)

    GMRES has the residual norm minimizing property

    b Ayi = minzKi(r0,A)

    b Az . (1.86)

    Moreover, the inner residuals decrease monotonically

    ri+1 ri , i lN0 . (1.87)

    Termination criterion for the GMRES iteration

    The residuals satisfy the orthogonality relation

    (ri, ri r0) = 0 , i lN , (1.88)from which we readily deduce

    r02 = ri r02 + ri2 , i lN . (1.89)We define the relative residual norm error

    i :=rir0 . (1.90)

    Clearly, i < 1, i lN, andi+1 < i if i 6= 0 . (1.91)

    Consequently, given a user specified accuracy , an appropriate adaptivetermination criterion is

    i . (1.92)We note that, in terms of i, (1.89) can be written as

    ri r02 = (1 2i ) r02 . (1.93)

  • Num. Meth. Large-Scale Nonlinear Systems 30

    3.2.2 Convergence of affine contravariant inexact Newton methods

    We denote by xk lRn the result of the inner GMRES iteration. As initialvalues for GMRES we choose

    xk0 = 0 , rk0 = F (x

    k) . (1.94)

    Consequently, during the inner GMRES iteration the relative error i, i lN0,in the residuals satisfies

    i =rki

    F (xk) 1 , i+1 < i , if i 6= 0 . (1.95)

    In the sequel, we drop the subindices i for the inner iterations and refer to kas the final value of the inner iterations at each outer iteration step k.

    Theorem 3.3 Affine contravariant convergence theorem for the inex-act Newton GMRES method. Part I: Linear convergence

    Suppose that F : D lRn lRn is continuously differentiable on D and letx0 D be some initial guess. Let further the following affine contravariantLipschitz condition be satisfied

    (F (y) F (x))(y x) F (x)(y x)2 , x, y D , 0 . (1.96)Assume further that the level set

    L0 := {x lRn | F (x) F (x0)} (1.97)is a compact subset of D.In terms of the Kantorovich quantities

    hk := F (xk) , k lN0 . (1.98)the outer residual norms can be bounded according to

    F (xk+1) (k +

    1

    2(1 2k) hk

    )F (xk) . (1.99)

    Assume that

    h0 < 2 (1.100)

    and control the inner iterations according to

    k 12hk , (1.101)

  • Num. Meth. Large-Scale Nonlinear Systems 31

    for some h02< < 1.

    Then, the Newton GMRES iterates xk, k lN0 stay in L0 and convergelinearly to some x L0 with F (x) = 0 at an estimated rate

    F (xk+1) F (xk) . (1.102)Proof. We recall that the Newton GMRES iterates satisfy

    F (xk) xk = F (xk) + rk , (1.103)xk+1 = xk + xk . (1.104)

    It follows from the generalized mean value theorem that

    F (xk+1) = F (xk) +

    10

    F (xk + txk) xk dt . (1.105)

    Consequently, replacing F (xk) in (1.105) by (1.103), we obtain

    F (xk+1) = 1

    0

    (F (xk + txk) F (xk)

    )xk dt + rk

    1

    0

    (F (xk + txk) F (xk)

    )xk dt + rk

    12 F (xk) xk2 + rk

    12 F (xk) rk2 + rk .

    We recall (1.93)

    rk F (xk)2 = (1 2k) F (xk)2 ,from which (1.99) can be immediately deduced.Now, in view of (1.101), (1.99) yields

    F (xk+1) (

    k 1

    2hk

    +1

    2(1 2k)hk

    )F (xk)

    ( 122k hk) F (xk) F (xk) .

    Taking advantage of the previous inequality, by induction on k it follows that

    xk L0 D , k lN0 .

  • Num. Meth. Large-Scale Nonlinear Systems 32

    Hence, there exists a subsequence lN lN and an x L0 such that xk x (k lN ) and F (x) = 0. Moreover, since

    F (xk+`) F (xk) F (xk+`) + F (xk) (1 + `) F (xk) (1 + `) k F (x0) 0 (k lN) ,

    the whole sequence must converge to x.

    Theorem 3.4 Affine contravariant convergence theorem for the inex-act Newton GMRES method. Part II: Quadratic convergence

    Under the same assumptions on F : D lRn lRn as in Theorem 3.3 supposethat the initial guess x0 D satisfies

    h0 0 and control the inner iterations such that

    k1 2k

    2hk . (1.107)

    Then, the Newton GMRES iterates xk, k lN0 stay in L0 and convergequadratically to some x B(x0, ) with F (x) = 0 at an estimated rate

    F (xk+1) 12 (1 + ) (1 2k) F (xk)2 . (1.108)

    Proof. Inserting (1.107) into (1.99) and observing hk = F (xk) gives theassertion.

    3.2.3 Algorithmic aspects of affine contravariant inexact Newtonmethods

    (i) Convergence monitor

    Throughout the inexact Newton GMRES iteration we use the residual mono-tonicity test

    k :=F (xk+1)F (xk) < 1 . (1.109)

    The iteration is considered as divergent, if

    k > . (1.110)

  • Num. Meth. Large-Scale Nonlinear Systems 33

    (ii) Termination criterion

    As in the exact Newton iteration, specifying a residual accuracy FTOL, thetermination criterion for the inexact Newton GMRES iteration is

    F (xk) FTOL . (1.111)(iii) Balancing outer and inner iterations

    With regard to (1.101) of Theorem 3.3, in the linear convergence mode theadaptive termination criterion for the inner GMRES iteration is

    k 12hk ,

    whereas, in view of (1.107) of Theorem 3.4, in the quadratic convergencemode the termination criterion is

    k1 2k

    2hk .

    Again, we replace the theoretical Kantorovich quantities hk by some computa-tionally easily available a priori estimates. We distinguish between the quadraticand the linear convergence mode:

    (iii)1 Quadratic convergence mode

    We recall the termination criterion (1.107) for the quadratic convergence mode

    k1 2k

    2hk .

    It suggests the a posteriori estimate

    [hk]2 :=2 k

    (1 + ) (1 2k) hk .

    In view of hk+1 = khk, this implies the a priori estimate

    [hk+1] := k [hk]2 k hk = hk+1 . (1.112)Using (1.112) in (1.107) results in the computationally feasible terminationcriterion

    k1 2k

    12 [hk] , 1.0 . (1.113)

  • Num. Meth. Large-Scale Nonlinear Systems 34

    (iii)2 Linear convergence mode

    We switch from the quadratic to the linear convergence mode, if the local con-traction factor satisfies

    k < . (1.114)

    The proof of the previous theorems reveals

    F (xk+1) rk 2F (xk) rk2 = 1

    2(1 2k) hk F (xk) . (1.115)

    The above inequality (1.115) implies the a posteriori estimate

    [hk]1 :=2 F (xk+1) rk(1 2k)F (xk)

    hk (1.116)

    and the a priori estimate

    [hk+1] := k [hk]1 hk+1 . (1.117)

    Based on (1.117) we define

    k+1 := 1

    2[hk+1] . (1.118)

    If we find

    k+1 < k (1.119)

    with k from (1.113), we continue the iteration in the quadratic conver-gence mode.Otherwise, we realize the linear convergence mode with some

    k+1 k+1 . (1.120)

  • Num. Meth. Large-Scale Nonlinear Systems 35

    3.3 Affine Conjugate Inexact Newton Methods

    3.3.1 PCG (Preconditioned Conjugate Gradient)

    The Preconditioned Conjugate Gradient Method (PCG) is an iterativesolver for linear algebraic systems with a symmetric positive definite coefficientmatrix A lRnn. We recall that any symmetric positive definite matrix C lRnn defines an energy inner product (, )C according to

    (u, v)C := (u,Cv) , u, v lRn .

    The associated energy norm is denoted by C .The PCG Method with a symmetric positive definite preconditioner B lRnn corresponds to the CG Method applied to the transformed linear algebraicsystem

    B1/2AB1/2(B1/2y) = B1/2b .

    The PCG Method is implemented as follows:

    PCG Initialization:

    Given an initial guess y0 lRn, compute the residual r0 = b Ay0 and thepreconditioned residual r0 = Br0 and set

    p0 := r0 , 0 := (r0, r0) = r02B .

    PCG Iteration Loop: For 0 i imax compute:

    yi+1 = yi +1

    ipi ,

    ri+1 = ri 1i

    Api , ri+1 = Bri+1 , i =pi2Ai

    2i =ii

    (= yi+1 yi2A) ,

    pi+1 = ri+1 +i+1i

    pi , i+1 = ri+12B .

  • Num. Meth. Large-Scale Nonlinear Systems 36

    PCG minimizes the energy error norm

    y yiA = minzKi(r0,A)

    y zA , (1.121)

    where Ki(r0, A) denotes the Krylov subspace

    Ki(r0, A) := span{r0, ..., Ai1r0} . (1.122)

    PCG satisfies the Galerkin orthogonality

    (yi y0, yi+m yi)A = 0 , m lN . (1.123)

    Denoting by y lRn the unique solution of Ay = b and by i := y yi2Athe square of the iteration error in the energy norm, we have the following errorrepresentation:

    Lemma 3.2 Representation of the iteration error

    The PCG iteration error satisfies

    i =n1j=i

    2j . (1.124)

    Proof. For m = 1 the Galerkin orthogonality implies the orthogonal decom-positions

    yi+1 y02A = yi+1 yi2A = 2i

    + yi y02A , (1.125)

    yi y02A =i1j=0

    yj+1 yj2A =i1j=0

    2j . (1.126)

    On the other hand, observing yn = y, for m = n i the Galerkin orthogonality

    yields

    y y02A =

    n1j=0

    2j

    = y yi2A = 2i

    + yi y02A =

    i1j=0

    2j

    . (1.127)

  • Num. Meth. Large-Scale Nonlinear Systems 37

    Computable lower bound for the iteration error

    A lower bound for the iteration error in the energy norm is obviously given by

    [i] =i+mj=0

    2j . (1.128)

    In the inexact Newton PCG method we will control the inner PCG itera-tions by the relative energy error norms

    i =y yiAyiA

    [i]

    yiA (1.129)

    and use the termination criterion

    i , (1.130)

    where is a user specified accuracy.

    3.3.2 Convergence of affine conjugate inexact Newton methods

    We denote by xk lRn the result of the inner PCG iteration. As initial valuefor PCG we choose

    xk0 = 0 . (1.131)

    Again, we will drop the subindices i for the inner PCG iterations and refer tok as the final value of the inner iterations at each outer iteration step k. Werecall the Galerkin orthogonality (cf. (1.123))

    (xk, F (xk)(xk xk)) = (xk, rk) = 0 . (1.132)

    Theorem 3.5 Affine conjugate convergence theorem for the inexactNewton PCG method. Part I: Linear convergence

    Suppose that f : D lRn lR is a twice continuously differentiable strictlyconvex functional onD with the first derivative F := f and the Hessian F = f

    which is symmetric and uniformly positive definite. Assume that x0 D is someinitial guess such that the level set

    L0 := {x D | f(x) f(x0)}

  • Num. Meth. Large-Scale Nonlinear Systems 38

    is compact.Let further the following affine conjugate Lipschitz condition be satisfied

    F (z)1/2(F (y) F (x)

    )v (1.133)

    F (x)1/2(y x) F (x)1/2v , x, y, z D , 0 .

    For the inner Newton PCG iterations consider the exact error terms

    k := F (xk)1/2xk2

    and the Kantorovich quantities

    hk := F (xk)1/2xk

    as well as their inexact analogues

    k := F (xk)1/2xk2 =k

    1 + 2k

    and

    hk := F (xk)1/2xk =hk1 + 2k

    ,

    where k characterizes the inner PCG iteration error

    k :=F (xk)1/2

    (xk xk

    )

    F (xk)1/2xk .

    Assume that for some < 1

    h0 < 2 < 2 (1.134)

    and that

    k+1 k , k lN0 (1.135)

    holds true throughout the outer Newton iterations.Control the inner iterations according to

    (hk, k) :=hk + k

    (hk +

    4 + (hk)

    2)

    21 + 2k

    . (1.136)

  • Num. Meth. Large-Scale Nonlinear Systems 39

    Then, the inexact Newton PCG iterates xk, k lN0 stay in L0 and con-verge linearly to some x L0 with f(x) = min

    xDf(x).

    The following estimates hold true

    F (xk+1)1/2xk+1 F (xk)1/2xk , k lN0 , (1.137)

    F (xk+1)1/2xk+1 F (xk)1/2xk , k lN0 . (1.138)

    Moreover, the objective functional is reduced according to

    110

    hk k f(xk) f(xk+1)

    2

    3k

    1

    10hk

    k . (1.139)

    Proof. Observing

    rk = F (xk) + F (xk)xk , k lN0 ,

    for [0, 1] we obtain

    f(xk + xk) f(xk) =

    s=0

    (xk, F (xk + sxk)) ds = (1.140)

    =

    s=0

    (xk, F (xk + sxk) F (xk)) ds +

    s=0

    (xk, F (xk)) ds =

    =

    s=0

    s

    st=0

    (xk, F (xk + stxk)xk) dt ds +

    s=0

    (xk, F (xk)) ds =

    =

    s=0

    s

    st=0

    (xk,(F (xk + stxk) F (xk)

    )xk) dt ds +

    +

    s=0

    s

    st=0

    (xk, F (xk)xk) dt ds +

    s=0

    (xk, F (xk) rkF (xk)xk

    ) ds =

  • Num. Meth. Large-Scale Nonlinear Systems 40

    =

    s=0

    s

    st=0

    (F (xk)1/2xk, F (xk)1/2(F (xk + stxk) F (xk)

    )xk)

    F (xk)1/2xk s t F (xk)1/2xk2 = s t hk k

    dt ds

    +

    s=0

    s

    st=0

    (xk, F (xk)xk) dt ds

    s=0

    (xk, F (xk)xk) ds +

    +

    s=0

    (xk, rk) = 0 due to (1.123)

    ds 110

    6 hk k +

    1

    34 k 2 k .

    It readily follows from (1.140) that

    f(xk + xk) f(xk) + 2 ( 110

    hk k + (

    1

    32 1) k) . (1.141)

    Denoting by Lk the level setLk := { x D | f(x) f(xk) } ,

    by induction on k we prove

    hk < 2 and hence, xk+1 Lk . (1.142)

    For k = 0, we have h0 < 2 by assumption (1.134). Since h0 h0, (1.141) readily

    shows f(x1) < f(x0), whence x1 L0.Now, assuming (1.142) to hold true for some k lN, again taking advantage ofhk hk < 2, (1.141) yields f(xk+1) < f(xk) and thus xk+1 Lk.Moreover, choosing = 1 in (1.141), we obtain the left-hand side of the func-tional descent property (1.139). We note that we get the right-hand side of(1.139), if in (1.140) we estimate by the other direction of the Cauchy-Schwarzinequality.Finally, in order to prove the contraction properties (1.137),(1.138) and lin-ear convergence, we estimate the local energy norms as follows:

    F (xk+1)1/2xk+1 = F (xk+1)1/2 F (xk+1)xk+1 = F (xk+1)

    =

    = F (xk+1)1/2(F (xk+1) F (xk)

    ) =

  • Num. Meth. Large-Scale Nonlinear Systems 41

    = F (xk+1)1/2(F (xk+1) F (xk)

    )+ F (xk+1)1/2 F (xk) .

    Observing

    F (xk) = F (xk)xk + rk ,and using the affine conjugate Lipschitz condition we obtain

    F (xk+1)1/2xk+1 = (1.143)

    = F (xk+1)1/2( 1

    0

    (F (xk + txk) F (xk)

    )xk dt + rk

    )

    12 F (xk)1/2xk2 + F (xk+1)1/2rk .

    Setting z = xkxk, for the second term on the right-hand side of the previousinequality we get the implicit estimate

    F (xk+1)1/2rk2

    F (xk)1/2z2 + hk F (xk)1/2z F (xk+1)1/2rk ,which gives the explicit bound

    F (xk+1)1/2rk 12

    (hk +

    4 + (hk)

    2)F (xk)z . (1.144)

    Using (1.144) in (1.143) results in

    F (xk+1)1/2xk+1

    122 F (xk)1/2xk2

    = (hk)2

    +1

    2

    (hk +

    4 + (hk)

    2) F (xk)1/2z

    = k hk

    .

    Taking (1.136) into account, we thus get the contraction factor estimate

    k := F (xk+1)1/2xk+1 F (xk)1/2xk = hk =

    1+2k h

    k

    (hk, k) , (1.145)

  • Num. Meth. Large-Scale Nonlinear Systems 42

    which proves (1.137) and linear convergence.For the proof of (1.138) we observe

    F (x`)1/2x`2 = (1 + 2` ) F (x`)1/2x`2 , ` = k, k + 1 ,

    as well as k+1 k and obtain

    F (xk+1)1/2xk+1F (xk)1/2xk

    1 + 2k1 + 2k+1

    k k . (1.146)

    By standard arguments we further show that the sequence {xk}lN0 of inexactNewton PCG iterates is a Cauchy sequence in L0 and there exists an x L0such that xk x (k ) with F (x) = 0.

    Theorem 3.6 Affine conjugate convergence theorem for the inexactNewton PCG method. Part II: Quadratic convergence

    Under the same assumptions on F : D lRn lRn as in Theorem 3.5 supposethat the initial guess x0 D satisfies

    h0 0 and control the inner iterations such that

    k 2

    hk

    hk +4 + (hk)

    2. (1.148)

    Then, there holds:

    (i) The Newton CGNE iterates xk, k lN0 stay in L0 and convergequadratically to some x L0 with F (x) = 0.(ii) The exact Newton increments and the inexact Newton incrementsdecrease quadratically according to

    F (xk+1)1/2 xk+1 1 + 2

    F (xk)1/2 xk2 , (1.149)

    F (xk+1)1/2 xk+1 1 + 2

    F (xk)1/2 xk2 . (1.150)

  • Num. Meth. Large-Scale Nonlinear Systems 43

    Proof. Using (1.148) in (1.145) yields

    F (xk+1)1/2xk+1F (xk)1/2xk

    hk + k (hk +

    1 + (hk)

    2)

    21 + 2k

    12(1 + ) hk ,

    which proves (1.149) in view of hk hk h0 < 2.The proof of (1.150) follows along the same line by using (1.148) in (1.146).

    3.3.3 Algorithmic aspects of the affine conjugate inexact NewtonPCG method

    (i) Convergence monitor

    Let us assume that the quantity < 1 in both the linear convergence modeand the quadratic convergence mode has been specified and let us furtherassume that we use the startiterate xk0 = 0 in the inner PCG iteration.Denoting by k an easily computable estimate of the relative energy normiteration error k, we accept a new iterate x

    k+1, if the condition

    f(xk+1) f(xk) 110

    k = 110

    (1 + 2

    k)k (1.151)

    or the monotonicity test

    k :=(k+1

    k

    )1/2=((1 + 2k+1) k+1

    (1 + 2

    k) k

    )1/2 < 1 (1.152)

    is satisfied. We consider the outer iteration as divergent, if neither (1.151) nor(1.152) hold true.

    (ii) Termination criterion

    With respect to a user specified accuracy ETOL, the inexact Newton PCGiteration will be terminated, if either

    k = (1 + 2

    k) k ETOL2 . (1.153)

    or

    f(xk) f(xk+1) 12ETOL2 . (1.154)

    (iii) Balancing outer and inner iterations

    For k = 0, we choose 0 = 0 =14.

    As in case of the inexact Newton CGNE iteration, for k 1 we begin with the

  • Num. Meth. Large-Scale Nonlinear Systems 44

    quadratic convergence mode and switch to the linear convergence mode as soonas the approximate contraction factor k is below some prespecified thresholdvalue 1

    2.

    (iii)1 Quadratic convergence mode

    A computationally realizable termination criterion for the inner PCGiteration in the quadratic convergence mode is given by

    k [hk]

    [hk] +4 + [hk]

    2, (1.155)

    where [hk] is an appropriate a priori estimate of the inexact Kantorovichquantity hk. In view of (1.145), we have the a posteriori estimates

    [hk]2 :=10

    k|f(xk+1) f(xk) + 1

    3k| (1.156)

    and

    [hk]2 :=

    1 +

    2

    k |[hk]2 . (1.157)We note that (1.157) yields the a priori estimate

    [hk] := k1 [hk1]2 . (1.158)

    Using (1.158) in (1.157), for the inexact Kantorovich quantity we obtain thefollowing a priori estimate

    [hk] :=[hk]1 +

    2

    k

    . (1.159)

    Inserting (1.159) into (1.155), we obtain a simple nonlinear equation in k.

    Remark 3.5 Computational work in the quadratic convergence mode

    Since k 0 (k ) is enforced, it follows that:The more the iterates xk approach the solution x, the more computa-tional work is required for the inner iterations to guarantee quadraticconvergence of the outer iteration.

    (iii)2 Linear convergence mode

    We switch to the linear convergence mode, if

    k < (1.160)

  • Num. Meth. Large-Scale Nonlinear Systems 45

    is satisfied.The computationally realizable termination criterion for the inner itera-tion in the linear convergence mode is

    [(hk, k)] := ([hk], k) . (1.161)

    Since asymptotically there holds

    k 12

    (k ) ,

    we observe:

    Remark 3.6 Computational work in the linear convergence mode

    The more the iterates xk approach the solution x, the less compu-tational work is required for the inner iterations to guarantee linearconvergence of the outer iteration.

  • Num. Meth. Large-Scale Nonlinear Systems 46

    4. Quasi-Newton Methods

    4.1 Introduction

    Given F : D lRn lRn as well as xk, xk+1 D , xk 6= xk+1, the idea is toapproximate F locally around xk+1 by an affine function

    Sk+1(x) := F (xk+1) + Jk+1(x xk+1) , Jk+1 lRnn , (1.162)

    such that

    Sk+1(xk) = F (xk) . (1.163)

    The requirement (1.163) gives rise to the so-called secant condition

    J(xk+1 xk

    )

    =: xk

    = F (xk+1) F (xk) =: yk

    . (1.164)

    The matrix J is not uniquely determined by (1.164), since

    dim Sk+1 = (n 1)n , (1.165)

    where

    Sk+1 := {J lRnn | Jxk = yk} . (1.166)

    There are different criteria to select an appropriate J Sk+1.4.1.1 The Good Broyden rank 1 update

    Let us consider the change in the affine model as given by

    Sk+1(x) Sk(x) = (Jk+1 Jk)(x xk) . (1.167)

    An appropriate idea is to choose Jk+1 Sk+1 such that there is a least changein the affine model in the sense

    Jk+1 JkF = minJSk+1

    J JkF , (1.168)

    where F stands for the Frobenius norm (observe J = (Jik)ni,k=1)

    JF :=( ni,k=1

    J2ik

    )1/2. (1.169)

  • Num. Meth. Large-Scale Nonlinear Systems 47

    The solution of (1.169) can be heuristically motivated as follows: Choose tk xk such that

    x xk = xk + tk .Then, (1.167) reads

    Sk+1(x) Sk(x) = (Jk+1 Jk)xk = (ykJkxk)

    + (Jk+1 Jk)tk . (1.170)

    Now, choose Jk+1 Sk+1 such that(Jk+1 Jk)tk = 0 .

    It follows that

    rank (Jk+1 Jk) = 1 , Jk+1 Jk = vk(xk)T . (1.171)Inserting (1.171) into (1.170) yields

    vk (xk)T xk = (yk Jkxk) ,which results in

    vk =yk Jkxk(xk)T xk

    .

    Altogether, this gives us Broydens rank 1 update (Good Broyden)

    Jk+1 = Jk +[F (xk+1) F (xk) Jkxk

    ] (xk)T(xk)T xk

    . (1.172)

    For the solution of nonlinear systems, we are more interested in updates of theinverse of Jk. Such an update can be provided by the Sherman-Morrison-Woodbury formula

    (A + uvT )1 = A1 A1uvTA1

    1 + vTA1u. (1.173)

    Setting

    A := Jk , u := F (xk+1) F (xk) Jkxk , v := (x

    k)T

    (xk)T xk,

    we obtain

    J1k+1 = J1k +

    [xk J1k (F (xk+1) F (xk))

    ](xk)TJ1k

    (xk)TJ1k[F (xk+1) F (xk)

    ] . (1.174)

  • Num. Meth. Large-Scale Nonlinear Systems 48

    4.1.2 The Bad Broyden rank 1 update

    Instead of (1.168), an alternative to choose Jk+1 Sk+1 such that there is aleast change in the solution of the affine model, i.e.,

    J1k+1 J1k F = minJ Sk+1

    J1 J1k F . (1.175)

    Similar considerations as before lead us to the Broydens alternative rank1 update (Bad Broyden)

    J1k+1 = J1k +

    [xk J1k

    (F (xk+1) F (xk)

    )](F (xk+1) F (xk)

    )T(F (xk+1) F (xk)

    )T(F (xk+1) F (xk)

    ) .(1.176)

    4.2 Affine covariant Quasi-Newton method

    4.2.1 Affine covariant Quasi-Newton convergence theory

    Affine covariant Quasi-Newton methods require the secant condition (1.164) tobe stated by means of affine covariant terms in the domain of definition of thenonlinear mapping F .Observing that we compute the Quasi-Newton increment xk as the solution of

    Jkxk = F (xk) , (1.177)

    we can rewrite (1.164) according to

    (Jk J)xk = F (xk+1) .

    Multiplication by J1k yields the affine covariant secant condition

    xk+1 := (I J1k J) =: Ek(J)

    xk = J1k F (xk+1) . (1.178)

    we note that any rank 1 update of the form

    Jk+1 = Jk

    (I x

    k+1vT

    vT xk

    ), v lRn \ {0} (1.179)

    satisfies the affine covariant secant condition (1.178).In particular, for v = xk we recover the Good Broyden.

  • Num. Meth. Large-Scale Nonlinear Systems 49

    Theorem 4.1 Properties of the affine covariant Quasi-Newton method

    For Broydens affine covariant rank 1 update (Good Broyden)

    Jk+1 = Jk

    (I x

    k+1(xk)T

    xk2)

    (1.180)

    assume that the local contraction condition

    k =xk+1xk 0 ,

    which proves (1.251).

    In order to come up with an affine covariant globalization concept, weintroduce the level set associated with the level function TA given by

    GA(z) := {x D | TA(x) TA(z)} . (1.252)We recall that monotonicity with respect to TA reads as follows

    xk+1 int GA(xk) , if int GA(xk) 6= .Denoting by GL(n) the set of all regular nn matrices, we introduce the affinecovariant level set

    GA(x) :=

    AGL(n)GA(x) . (1.253)

    Theorem 5.1 Newton path

    Assume that F : D lRn lRn is continuously differentiable on D withnonsingular Jacobi matrix F (x), x D. Further suppose that for some A GL(n) the path-connected component of GA(x

    0), x0 D, is a compact subsetof D. Then, the path-connected component of GA(x

    0) is a topologicalpath x : [0, 2] lRn, called the Newton path. It has the properties

    F (x()) = (1 ) F (x0) , (1.254)

    TA(x()) = (1 )2 TA(x0) , A GL(n) , (1.255)and satisfies the two-point boundary value problem

    dx

    d= F (x)1F (x0) , (1.256)

    x(0) = x0 , x(1) = x .

  • Num. Meth. Large-Scale Nonlinear Systems 74

    Moreover, we recover the ordinary Newton increment x0 by means of

    dx

    d|=0 = F (x0)1F (x0) = x0 . (1.257)

    Proof. We introduce the level sets

    HA(x0) := {y lRn | Ay2 AF (x0)2}

    and define their intersection

    H(x0) :=

    AGL(n)HA(x

    0) . (1.258)

    The idea of proof is to show that H(x0) = G(x0).For that purpose, we refer to i, 1 i n, as the singular values of A andto qi, 1 i n, as the associated eigenvectors of ATA such that

    ATA =ni=1

    2i qiqTi .

    We further denote by A the following subset of GL(n)

    A := {A GL(n) | ATA =ni=1

    2i qiqTi , q1 =

    F (x0)

    F (x0) } .

    Obviously, every y lRn admits the representation

    y =nj=1

    bjqj , bj lR , 1 j n ,

    and hence,

    Ay2 = yTATAy =ni=1

    2i b2i ,

    AF (x0)2 = 21 F (x0)2 .

    In particular, for A A we find

    HA(x0) = {y lRn |

    ni=1

    2i b2i 21 F (x0)2} .

  • Num. Meth. Large-Scale Nonlinear Systems 75

    Figure 1: Intersection of ellipsoids HA(x0), A A.

    In other words, HA(x0) defines the n-dimensional ellipsoid

    1

    F (x0)2 b21 +

    ( 21F (x0)

    )2b22 + ... +

    ( n1F (x0)

    )2b2n 1 .

    For A A, all ellipsoids have a common b1-axis of length F (x0), whereas thelengths of the other axes differ (cf. Figure 1).

    It follows readily that

    H(x0) = {y lRn | y = b1q1 , |b1| F (x0)} = (1.259)

    = {y lRn | y = (1 )F (x0) , [0, 2]} =

    = {y lRn | Ay = (1 )AF (x0) , [0, 2] , A GL(n)} .Since A GL(n), we have

    H(x0) H(x0) .On the other hand, for y H(x0) and A A

    Ay2 = (1 )2AF (x0)2 AF (x0)2 ,which shows

    H(x0) H(x0) .

  • Num. Meth. Large-Scale Nonlinear Systems 76

    The final stage of the proof is done by an appropriate lifting of the path H(x0)to G(x0) using the homotopy

    (x, ) := F (x) (1 )F (x0) .In view of

    x = F(x) , = F (x0)

    and observing that x is nonsingular for x D and GA(x0) D, local contin-uation from x(0) = x0 by the implicit function theorem, applied to 0,delivers the existence of the path

    x GA(x0) Dwith the properties (1.256),(1.257). The assertions (1.254) and (1.255) are nowa direct consequence of (1.259).

    Remark 5.2 The implication of the previous theorem is that even far from thesolution, the Newton increment x0/x0, which is tangent to the Newtonpath originating from x0, plays a decisive role and should be used in an affineinvariant globalization strategy. Alone, its length may me too large and thushas to be controlled appropriately.

    Remark 5.3 The previous theorem assumes that the Jacobian is regular in D.However, sometimes the situation is encountered where the Jacobian is singularat a critical point x even close to the initial guess x0. In this case, the implicitfunction theorem tells us that the Newton path ends at that critical point.

    5.2 Trust region concepts

    As we have seen, far away from the solution the ordinary Newton method canbe still used, provided an appropriate damping of the Newton increment isprovided. Of course, we would like to know how to determine the dampingfactor, or in other words, what is the region around the current iterate wherewe can rely on the linearization with respect to the tangent to the Newton path.The specification of such regions is known as trust region concepts.

    5.2.1 Trust region based on the Levenberg-Marquardt method

    Given a current iterate xk lRn and a prespecified parameter > 0, the idea ofthe Levenberg-Marquardt method is to determine an increment xk lRnas the solution of the constrained minimization problem

    infxkK

    F (xk) + F (xk)xk ,

  • Num. Meth. Large-Scale Nonlinear Systems 77

    where K stands for the constraint

    K := {xk lRn | xk } .Coupling the inequality constraints by a Lagrangian multiplier lR+ leadsto the saddle point problem

    infxklRn

    suplR+

    L(xk, )

    in terms of the associated Lagrangian functional

    L(xk, ) := F (xk) + F (xk)xk2 + (xk2 2

    ).

    The KKT conditions read as follows:(F (xk)TF (xk) + I

    )xk = F (xk)F (xk) , (1.260)

    0 , xk2 2 0 , (xk2 2) = 0 . (1.261)Denoting the solution of the saddle point problem by (xk(), ), we observe

    0+ = xk() F (xk)F (xk) ,

    >> 1 = xk() 1F (xk)F (xk) = 1

    grad T (xk) .

    This means:Close to the solution, the method coincides with the ordinary Newton method,whereas far from the solution, it corresponds to a steepest descent with thesteplength parameter 1

    .

    The Levenberg-Marquardt method looks robust, since the coefficient matrixF (xk)TF (xk)+I in (1.260) is regular, even if the Jacobian F (xk) is singular.However, the method may terminate for singular F (xk), since then the right-hand side in (1.260) also degenerates. Moreover, the Levenberg-Marquardtmethods lacks affine invariance.

    5.2.2 The Armijo damping strategy

    An empirical damping strategy is the Armijo strategy:Let k {1, 12 , 14 , ..., min} be a sequence of steplengths with the property

    T (xk + xk) (1 12) T (xk) , k . (1.262)

  • Num. Meth. Large-Scale Nonlinear Systems 78

    Figure 2: Geometric interpretation of the affine covariant trust region method

    Then, the damping parameter k k is chosen as the optimal one:

    T (xk + kxk) = min

    kT (xk + xk) .

    Obviously, the choice of the level function T (x) in the Armijo rule does notreflect affine covariance. We will develop an affine covariant damping strategybelow.

    5.2.3 Affine covariant trust region method

    The Levenberg-Marquardt method can be easily reformulated to yield an affinecovariant version. Since affine covariance means affine invariance with respectto transformations in the domain of definition, we have to modify the objectivefunctional:

    infxkK

    F (xk)1(F (xk) + F (xk)xk

    ) , (1.263)

    whereas the set of constraints K is given as before.

    The affine covariant trust region method (1.263) admits an easy geometric in-terpretation as shown in Figure 5.2. The set K of constraints is representedas a sphere with radius around xk. If exceeds the length of the Newtoncorrection xk, the constraint is not active, and we are in the regime of theordinary Newton method. However, if is smaller than the Newton correctionxk, we have to apply an appropriate damping.

  • Num. Meth. Large-Scale Nonlinear Systems 79

    5.2.4 Affine contravariant trust region method

    We can also easily reformulate the Levenberg-Marquardt method to come upwith an affine contravariant version. Since affine contravariance means affineinvariance with respect to transformations in the range space, the objectivefunctional remains unchanged, but we have to modify the set of constraints:

    infxkK

    F (xk) + F (xk)xk ,

    whereas the set of constraints K is given as follows:

    K := {xk lRn | F (xk)xk } . (1.264)

    There is basically the same geometric interpretation as before with the onlydifference that now the picture has to be drawn in the range space.

    5.3 Globalization of affine contravariant Newton methods

    5.3.1 Convergence of the damped Newton iteration

    We consider the damped Newton iteration

    F (xk)xk = F (xk) , (1.265)xk+1 = xk + kx

    k , k [0, 1]

    in an affine contravariant setting where the damping factor k is chosen toachieve residual contraction.

    Theorem 5.2 Optimal choice of the damping factor

    Assume that F : D lRn lRn , D convex, is continuously differentiable onD with regular Jacobian F (x), x D. We further suppose that the followingaffine contravariant Lipschitz condition holds true

    (F (y) F (x)

    )(y x) F (x)(y x)2 , x, y D . (1.266)

    Setting hk := F (xk), for [0,min(1, 2hk )] we have

    F (xk + xk) tk() F (xk) , (1.267)

    where

    tk() := 1 + 12hk

    2 .

  • Num. Meth. Large-Scale Nonlinear Systems 80

    The optimal choice of the damping factor is

    k := min(1,1

    hk) . (1.268)

    Proof. By straightforward calculation we find

    F (xk + xk) = F (xk + xk F (xk) F (xk)xk =

    =

    0

    (F (xk + txk) F (xk)

    )xk dt (1 ) F (xk)xk

    0

    (F (xk + txk) F (xk)

    )xk dt + (1 ) F (xk) .

    The first term on the right-hand side measures the deviation from the New-ton path. Using the affine contravariant Lipschitz condition, it can be esti-mated as follows

    0

    (F (xk + txk) F (xk)

    )xk dt

    122 F (xk)xk2 1

    2hk

    2 F (xk) .

    Inserting this estimate into the previous one and minimizing tk() proves thetheorem.

    Theorem 5.3 Global convergence of affine contravariant Newtonmethods

    Under the same assumptions as in theorem 5.2 let D0 be the path-connectedcomponent of the level set G(x0) and suppose that D0 is a compact subset ofD. Then, the for all damping factors

    k [, 2k ] (1.269)

    with > 0 sufficiently small, the damped Newton iterates xk, k lN0 convergeto some x D0 with F (x) = 0.

  • Num. Meth. Large-Scale Nonlinear Systems 81

    Proof. The parabola tk() from Theorem 5.2 can be bounded by a polygonalas follows

    tk()

    1 12 , 0 1

    hk,

    1 + 12 1

    hk, 1

    hk 2

    hk.

    For 0 < 1hk

    and k [, 2k ] we thus have

    tk() 1 12 , (1.270)

    which shows strict reduction of the residual level function T (x).The existence of a global > 0 follows from the compactness assumption on D0which implies

    maxxD0

    F (x) < .

    Consequently, if G(xk) D0, then (1.270) yields

    G(xk+1()) G(xk) .

    The rest of the proof is along the same lines as the proof of the affine contravari-ant Newton-Mysovskikh theorem.

    5.3.2 Adaptive affine contravariant trust region strategy

    In Theorem 5.2 we derived the theoretical damping factor (1.268). Since theKantorovich quantity hk = F (xk) cannot be accessed directly, we again haveto provide appropriate estimates

    [hk] := [] F (xk) , (1.271)

    where [] is a lower bound for the domain dependent Lipschitz constant thatcan be obtained by pointwise sampling.Then, an estimate of the optimal damping factor is given by means of

    [k] := min (1,1

    [hk]) . (1.272)

    It follows readily from (1.271) that

    [k] k ,

  • Num. Meth. Large-Scale Nonlinear Systems 82

    i.e., we may have a considerable overestimation. As a remedy, repeated reduc-tions must be performed by appropriate prediction and correction strate-gies.The following bit counting lemma gives information about the contractionin the residuals in terms of the accuracy of the estimate for the Kantorovichquantity.

    Lemma 5.3 Bit counting lemma

    Assume that for some 0 < 1 there holds0 hk [hk] < max (1, [hk]) . (1.273)

    Then, the residual monotonicity test (1.267) yields

    F (xk+1) (1 1

    2(1 )k

    )F (xk) . (1.274)

    Proof. The assumption (1.273) can be rewritten as

    [hk] hk < (1 + ) max (1, [hk]) ,which results in the following estimate of the residual contraction

    F (xk+1)F (xk) [1 +

    1

    22hk]|=[k] 0 being sufficiently small, the damped Newton method convergesto some x D0 with F (x) = 0.Proof. As before, we remark that the parabola tAk () can be bounded fromabove by a polygonal bound according to

    tAk () 1 1

    2 , 0 < 1

    hk. (1.288)

    Moreover, there is a global , since with regard to the compactness assumptionon D0 we have

    maxxD0

    F (x)1F (x) cond(AF (x)) < .

    The proof proceeds by induction on k: Assuming GA(xk) D0, (1.288) yields

    GA(xk+1) GA(xk) D0 .

    Consequently, the sequence of Newton iterates lives in a compact set whichallows to conclude.

    Remark 5.5 The flaws of residual monotonicity

    Setting A = I in the previous theorem, we are obviously back in the residualbased regime where we have proved global convergence according to Theorem5.3. However, if the Jacobian F (xk) is ill conditioned, we obtain

    k =(hk cond(F

    (xk)))1

    1 , (1.289)

  • Num. Meth. Large-Scale Nonlinear Systems 87

    Figure 3: Reduction factors and optimal damping factors

    which algorithmically will result in a termination of the iteration.

    5.4.2 Natural level function

    In view of (1.283) and (1.286), the most natural choice of the matrix A GL(n) in the level function TA is

    A := Ak = F(xk)1 . (1.290)

    The associated level function TF (xk)1 is called the natural level functionwhich gives rise to the natural monotonicity test

    xk+1 xk (1.291)in terms of the simplified Newton correction

    xk+1

    = F (xk)1F (xk+1) . (1.292)Several remarks are due with respect to the properties of the natural levelfunction.

    Remark 5.6 Extremal properties

    As shown in Figure 3, for A GL(n) the reduction factors tAk () and the optimaldamping factors k(A) satisfy

    tAkk () = 1 +1

    22 hk tAk () , (1.293)

    k(Ak) = min (1,1

    hk) k(A) . (1.294)

  • Num. Meth. Large-Scale Nonlinear Systems 88

    Figure 4: Asymptotic distance spheres associated with natural level sets

    Remark 5.7 Steepest descent property

    The damped Newton method in xk is a method of steepest descent for thenatural level function TAk :

    xk = grad TAk(xk) . (1.295)

    Remark 5.8 Asymptotic optimality

    In view of

    hk < 1 = k(Ak) = 1 , (1.296)

    the damped Newton method asymptotically achieves quadratic convergence.

    Remark 5.9 Asymptotic distance function

    If F : D lRn lRn is twice continuously differentiable, we can show

    TF (x)1(x) =1

    2x x2 + O(x x3) .

    Hence, for xk x the natural monotonicity criterion approaches a distancecriterion of the form

    xk+1 x xk x .

    As shown in Figure 4, close to the solution x the natural level surface is closeto a sphere, whereas it degenerates to an osculating sphere with increasing

  • Num. Meth. Large-Scale Nonlinear Systems 89

    distance to x. Note that for other level functions, the level surface is an ellipsoidclose to x, with the ratio of the largest to the smallest half-axis being relatedto the condition number of the Jacobian, and an osculating ellipsoid off x.

    Remark 5.10 Local descent

    if we insert A = Ak into (1.285),(1.286) of Theorem 5.4, we get the localdescent property

    xk+1 (1 + 1

    22 hk

    )xk . (1.297)

    Remark 5.11 Global convergence

    We note that the results of Theorem 5.5 are not applicable to the situation athand, since A = Ak changes from one step to the other. Taking the asymptoticdistance function property into account, in the subsequent global convergenceresult we make the fixed choice A = F (x)1.

    Theorem 5.6 Global convergence of the affine covariant damped New-ton method with natural level functions; Part I

    Assume that F : D lRn lRn, D lRn convex, is continuously differentiableon D with regular Jacobian F (x), x D and suppose that the following affinecovariant Lipschitz condition is fulfilled

    F (x)1(F (y) F (x)

    )(y x) y x2 , x, y D . (1.298)

    Suppose further that x D is the unique solution in D and let x0 D bean initial guess such that the path-connected component of GF (x)1(x

    0) is acompact subset of D.Let the damping factors be chosen according to

    k [, 2k ] , 0 < 0 the data

    x(`) , q `

  • Num. Meth. Large-Scale Nonlinear Systems 100

    is available. Then, in terms of the fundamental Lagrange polynomials L`q(),the prediction path is given by the interpolating polynomial

    xq() :=

    `=qx(` L

    `q() . (1.334)

    Standard error estimates give

    x() xq() Cq+1 () , (1.335)where

    () :=

    `=q( `) .

    (iv)2 Hermite extrapolation

    Here, we assume that we are given the data

    x(`) , x(`) , q ` .

    We define the prediction path xq() as the associated Hermite polynomial andobtain

    x() xq() Cq+1 () , (1.336)where

    () :=

    `=q( `)2 .

    6.1.3 Affine covariant correction method

    Once we have computed a prediction path x(), +1, we choose thepredicted value x0 := x(+1) as an initial guess for a correction methodto compute an approximation of x := x(+1). We will study the ordinaryNewton method with a new Jacobian at each iterate. Applying the affine co-variant version of the Newton-Kantorovich theorem, we get the following result.

    Theorem 6.1 Convergence of the corrector

    Assume that F : D I lRn is continuously differentiable with nonsingularJacobian Fx(x, ), (x, ) D I. Further, suppose that there exists a uniquehomotopy path x() and that the affine covariant Lipschitz condition

    Fx(x(), )1(Fx(y, ) Fx(x, )

    ) 0 y x , x, y D , I (1.337)

  • Num. Meth. Large-Scale Nonlinear Systems 101

    is satisfied, where x() is a prediction method of order p (cf. (1.326)). Then,for all step sizes

    max :=(2 10 p

    )1/p, (1.338)

    the ordinary Newton method with initial guess x(+1) converges to the solutionpoint x(+1).

    Proof. For the ease of exposition, we write instead of . The affinecovariant Newton-Kantorovich theorem requires

    x0()0 12. (1.339)

    Applying the Lipschitz condition (1.337), by straightforward computation wefind

    x0() = Fx(x(), )1F (x(), ) = Fx(x, )1(F (x, ) F (x, )

    ) =

    Fx(x, )11

    0

    Fx(x+ t(x x), )(x x) dt x x)(1 +

    1

    20 x x

    ).

    Observing (1.326), we deduce

    x0() pp(1 +

    1

    20 p

    p)

    =: () . (1.340)

    Consequently, this leads to the requirement

    0 pp(1 +

    1

    20 p

    p) 1

    2,

    which is equivalent to0 p

    p 2 1 .

    6.1.4 Adaptive stepsize control

    For the practical application of the theoretical convergence results we have toreplace the theoretical quantities 0 and p by computationally available lowerbounds [0] and [p] thus resulting in the stepsize estimate

    [max] :=(2 1[0] [p]

    )1/p max . (1.341)

    Since there might be a substantial overestimation, we need again a predictionstrategy and a correction strategy.

  • Num. Meth. Large-Scale Nonlinear Systems 102

    As far as the correction strategy is concerned, let us assume that for +1wealready know the first contraction factor

    0() :=x1()x0() .

    The convergence analysis of the affine covariant Newton method yields

    0() 120 x0() . (1.342)

    Hence, inserting (1.340) gives us

    0() 120 p

    p ,

    which leads to0 p

    p g(0()) ,where

    g() :=1 + 4 1 .

    From this, we get the a posteriori estimate

    [0 p] :=g(0())

    p 0 p ,

    and the associated stepsize estimate

    [max] :=( g()[0 p]

    )1/p, =

    1

    4.

    Denoting by the stepsize associated with the computed value of 0 and by corresponding to =

    14, we arrive at the stepsize correction

    :=( g()g(0)

    )1/p . (1.343)

    Remark: If the termination criterion detects some k such that k >12, the

    last continuation step has to be repeated with

    :=( g()g(k)

    )1/p , (1.344)

    which gives rise to a reduction, since

    [max]