semismooth newton methods in function space theoretical ... · outlineofpartii...

Semismooth Newton Methods in Function SpaceTheoretical Foundations and Applications

Part II

Michael Ulbrich

Technische Universität München

42nd Woudschoten Conference, October 4–6, 2017

Supported by the DFG and the Munich Centre of Advanced Computing (MAC)Includes joint work with Christian Böhm, Daniela Bratzke, Michael

Hintermüller, Moritz Keuthen, Andre Milzarek, Stefan Ulbrich

Outline of Part II

Sufficient Conditions for Regularity

Moreau-Yosida Regularization for State-Constrained andRelated Problems

Application to 3D Elastic Contact Problems (Multigrid Preconditioner)

Globalization

Semismooth Newton for Nonsmooth Minimizationusing the Proximal Operator

Application to Seismic Tomography

Mesh-Independence of Semismooth Newton

Michael Ulbrich | Semismooth Newton Methods in Function Space: Theory and Applications | 5.10.2017 2

Sufficient Conditions for Regularity


Sufficient Conditions for RegularityS. Ulbrich, M.U. ’00; M.U. ’01,’11

Sufficient conditions for regularity

‖M−1‖Z→W ≤ C ∀ M ∈ ∂H(w), ∀ w ∈ BW (w , δ)

can be derived from second-order type optimality conditions.

For reformulated complementarity problems in L2, the central ingredientsof sufficient regularity conditions are:

(v ,F ′(w)v)L2(Ω) ≥ ν‖v‖2L2(Ω) ∀ v ∈ L2(Ω) with vF (w) = 0 a.e. on Ω.

F has the structure

F = γI + G with γ > 0, G : L2(Ω)→ Lp(Ω), p > 2.

Some further technical requirements.


State-Constrained and Related Problems


Nonsmooth Reformulation Beyond the L2 Setting

Consider the obstacle-type problem

minw∈W

J(w) s.t. w ≤ β.

with, e.g., W = H10 (Ω) or C(Ω), or H1

0 (Ω) ∩ H2(Ω) and β ∈W .

Then the optimality conditions assume the form

w ≤ β, 〈J ′(w),w − w〉W ∗,W ≥ 0 ∀ w ∈W , w ≤ β.

It is no longer possible to write this in pointwise form

w − P(−∞,β](w − τJ ′(w)) = 0,

since J ′(w) ∈W ∗ is not a pointwise a.e. defined function.

Hence, nonsmooth Newton methods for such problems are difficult.

Standard approach: Regularize the problem to recover an Lp setting.


Moreau-Yosida RegularizationIto, Kunisch ’03,’08; Hintermüller, Kunisch ’06; Meyer, Yousept ’09;Neitzel, Tröltzsch ’08; Hintermüller, Schiela, Wollner ’12; M.U. ’02,’11;Keuthen, M.U. ’15; Böhm, M.U. ’15; M.U., S. Ulbrich, Bratzke ’17

Moreau-Yosida regularized problem:

minw

J(w) + 12θ

∥∥∥[λ+ θ(w − β)]+

∥∥∥2L2(Ω)

,

where θ > 0 is a penalty parameter and [t]+ = max0, t.λ ∈ L2

+(Ω) is a shift parameter (often: λ = 0).

Optimality Conditions: J ′(w) + [λ+ θ(w − β)]+ = 0.

Observation: Semismooth Newton methods are applicable, since

w ∈W ⊂ Lp(Ω) 7→ λ+ θ(w − β) ∈ L2(Ω) ⊂ Lp′(Ω) ⊂W ∗, p′ = pp−1 ,

is continuous affine linear with suitable p > 2 (Sobolev embedding).

Results on convergence rates w.r.t. θ and continuation methods are avaliable.Michael Ulbrich | Semismooth Newton Methods in Function Space: Theory and Applications | 5.10.2017 7

Moreau-Yosida Regularization – Semismooth NewtonIto, Kunisch ’03,’08; Hintermüller, Kunisch ’06; Hintermüller, Schiela, Wollner ’12;M.U. ’02,’11; Böhm, M.U. ’15; Keuthen, M.U. ’15; M.U., S. Ulbrich, Bratzke ’17

As observed, with p > 2 such that W ⊂ Lp(Ω), the operator

w ∈W ⊂ Lp(Ω) 7→ [λ+ θ(w − β)]+ ∈ Lp′(Ω) ⊂W ∗,

is semismooth. The generalized differential of [·]+ at λ+ θ(w − β) consistsof all operators M ∈ L(W ,W ∗), M : h 7→ g h, where

g ∈ L∞(Ω), g(x) ∈

0 if λ(x) + θ(w(x)− β(x)) < 0,1 if λ(x) + θ(w(x)− β(x)) > 0,[0, 1] if λ(x) + θ(w(x)− β(x)) = 0.

The semismooth Newton system thus reads

[J ′′(w) + θ g · I] s = −J ′(w)− [λ+ θ(w − β)]+.


Moreau-Yosida – Some Extensions

Constrained problem: minw

J(w) s.t. c(w) ∈ C .

J : W → R and c : W → V are C1; C ⊂ V is closed and convex.

Choose a Hilbert space V0 (often V0 = L2) with V ⊂ V0 densely.

Moreau-Yosida regularized problem:

minw

Jθ(w) := J(w) + θ2dist

2V0

(c(w) + θ−1λ,C),

where λ ∈ V0 and distV0(·,C) measures the distance from C in V0.

Using proximal theory, Jθ is C1 with

J ′θ(w) = J ′(w) + θc ′(w)∗R(c(w) + θ−1λ− PV0C (c(w) + θ−1λ)).

Here, R : V0 → V ∗0 , Rw = (w , ·)V0 , is the Riesz map.

Example: C = v ∈ V ; v(x) ∈ C a.e. on Ω; C ⊂ Rn closed,convex. Then PL2(Ω)n

C (v)(·) = PC (v(·)) is a superposition operator.


State ConstraintsA particularly involved situation arises for state constraints:

miny∈Y ,u∈U

J(y , u) s.t. e(y , u) = 0, y ≤ β.

Then, in general, CQs are more demanding: Usually, Y ⊂ C(Ω) is required.As a consequence, the multiplier is a measure: λ ∈M(Ω) = C(Ω)∗.

Moreau-Yosida-Regularized Problem:

miny∈Y ,u∈U

J(y , u) + 12θ‖[λ+ θ(y − β)]+‖2L2(Ω) s.t. e(y , u) = 0.

Moreau-Yosida Optimality System:

Jy (y , u) + [λ+ θ(y − β)]+ + ey (y , u)∗q = 0,

Ju(y , u) + eu(y , u)∗q = 0,

e(y , u) = 0.

Semismooth Newton methods are applicable to the MY optimality system.Michael Ulbrich | Semismooth Newton Methods in Function Space: Theory and Applications | 5.10.2017 10

State-Constrained Problem – Optimal State

State constraint: y ≤ 0.1 in Ω.Semismooth Newton requires 20 iterations.Nested iteration reduces fine grid iterations to 4.

00.1

0.20.3

0.40.5

0.60.7

0.80.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

−1.6

−1.4

−1.2

−1

−0.8

−0.6

−0.4

−0.2

0

0.2


State-Constrained Problem – Optimal Multiplier

The Lagrange multiplier for the state constraint is very irregular (a measure).This makes state constraints a challenging problem class.

00.1

0.20.3

0.40.5

0.60.7

0.80.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

50

100

150

200

250

300


State-Constrained Problem – Optimal Control

00.1

0.20.3

0.40.5

0.60.7

0.80.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

−100

−50

0

50


Application to 3D Elastic Contact Problems

M.U., S. Ulbrich, D. Bratzke ’17


Elastic 3D Contact Problem

Ω

nΓC

ΓN

ΓD


Elastic 3D Contact Problem

Elastic 3D Contact Problem as Optimization Problem (P):

minu∈U

J(u) :=∫

Ω

(µε(u) : ε(u) + λ

2 div(u)2 − f TV u)dx −

∫ΓN

f TS u dS(x)

s. t. uT n ≤ β on ΓC

Ω ⊂ R3 reference domain of an elastic body,ΓD , ΓN ⊂ ∂Ω Dirichlet boundary, Neumann boundary,ΓC ⊂ ∂Ω possible contact boundary on Ω,u ∈ U displacement, U =

u ∈ H1(Ω)3 ; u|ΓD = 0

ε(u) = 1

2 (∇u +∇uT ) strain,λ, µ Lamé material constants,uT n normal displacement on ΓC ,β ∈ H 1

2 (ΓC ) normal distance of the body to the obstacle,fV ∈ L2(Ω)3, fS ∈ L2(ΓN)3 volume / surface forces.


Possible Generalizations

Our theory and methods also work for other C2-functions J : U → R.

The following structure is required:

J(u) =∫

Ω

( 12 (C∇u) : ∇u+D(u) : ∇u+e(u)−f T

V u)dx−

∫ΓN

f TS u dS(x),

where C, D, and e have suitable properties.

For error estimates we need that J : U → R is strongly convex in aneighborhood of the solution.

For the analysis of the multigrid semismooth Newton method werequire that J : U → R is strongly convex in a neighborhood of theMoreau-Yosida-regularized solution.


Related Work

Semismooth Newton methods for contact problems:Bratzke, Christensen, Hoppe, Hüeber, Ito, Kunisch, Pang, Stadler,M.U., S. Ulbrich, Wohlmuth, . . .

Multilevel methods for contact problems:Dostal, Hüeber, Kornhuber, Krause, Schöberl, Stadler, Oosterlee,Vollebregt, Wohlmuth, Zhao . . .

Abstract multilevel theory (only the references we built on):Bornemann, Yserentant (. . . and many more)

Multilevel trust region methods:Gratton, von Loesch, Toint, . . .

Regularization of obstacle and state-constrained problems:Hintermüller, Ito, Kunisch, Meyer, Prüfert, Rösch, Schiela, Tröltzsch,M.U, Weiser, . . .


KKT-System of the Elastic Contact ProblemWe define a : U ×U → R, A ∈ L(U,U∗), N ∈ L(U,H 1

2 (ΓC )), f ∈ U∗ by

a(v ,w) = 〈v ,Aw〉U,U∗ =∫

Ω

(2µε(v) : ε(w) + λdiv(v)div(w)

)dx ,

Nu = uT n|ΓC , 〈f , u〉U∗,U =∫

Ωf TV u dx +

∫ΓN

f TS u dS(x).

(P) minu∈U

12a(u, u)− 〈f , u〉U∗,U s. t. Nu ≤ β.

The problem is uniformly convex and quadratic. Also, N is onto (= CQ).

Optimality Conditions:u ∈ U solves (P) if and only if there exists z ∈ H 1

2 (ΓC )∗ such thatAu − f + N∗z = 0

z ≥ 0, Nu − β ≤ 0, 〈z ,Nu − β〉(H

12 )∗,H

12

= 0.

Here, z ≥ 0 means 〈z , v〉(H

12 )∗,H

12≥ 0 ∀ v ∈ H 1

2 (ΓC ), v ≥ 0.


Moreau-Yosida-Regularized Problem

Moreau-Yosida-Regularized Elastic Contact Problem

minu∈U

12 〈Au, u〉U∗,U − 〈f , u〉U∗,U + 1

2θ‖[z + θ(Nu − β)]+‖2L2(ΓC )

Here, θ > 0 is a penalty parameter and z ∈ L2(ΓN)+.Optimality condition is a semismooth system:

Auθ − f + N∗[z + θ(Nuθ − β)]+ = 0

Operator in the semismooth Newton system is boundedly invertible:

A + θN∗MN, with Md = 1z+θ(Nu−β)≥0 d .

Thus, the semismooth Newton method converges locally superlinearly.

We apply a multigrid-preconditioned semismooth Newton CG method.


Error EstimatesM.U., S. Ulbrich, Bratzke ’17

Let u be the solution of (P) with corresponding Lagrange multiplier z .

Let uθ be the solution of (Pθ) and zθ := [z + θ(Nuθ − β)]+.

Regularity results (e.g., Nečas ’75 or Kinderlehrer ’81) can be used toobtain improved regularity of u and z .

For z ∈ L2(ΓC ), we can show for θ →∞:

‖uθ − u‖H1 = o(θ− 12 ),

‖zθ − z‖(H1/2)∗ = o(θ− 12 ).

For z − z ∈ Hs(ΓC ), 0 < s ≤ 12 , we can show θ →∞:

‖uθ − u‖H1 = O(θ−s− 12 ),

‖zθ − z‖(H1/2)∗ = O(θ−s− 12 ),

‖zθ − z‖L2 = O(θ−s).Michael Ulbrich | Semismooth Newton Methods in Function Space: Theory and Applications | 5.10.2017 21

Multigrid-Preconditioned Semismooth Newton PCG Method

M.U., S. Ulbrich, Bratzke ’17

In a recent paper we propose and analyze a multigrid preconditioner for theMY-regularized semismooth Newton system:

The underlying operator is A + θN∗MN.

Large θ generates a strong algebraic (0th order) coupling supportedon the approximate contact boundary.

This requires special care in the multigrid method.

A suitable discretization yields a hierarchy of discretized semismoothNewton systems with the same structure.

We developed a multigrid preconditioner and proved a contractionrate that is independent of the number of grid levels and uniform forall sufficiently large regularization parameters θ.


3D Hertzian Contact Problem

05

1015

20

0

5

10

15

200

5

10

15

20

xz

y

coarsest mesh – 3993 elements finest mesh – 4 120 119 elements



left: Maximal contact normal stresses on levels 0,. . . ,6

right: Normal contact stress distribution in the x-y plane



contact zone von Mises stress distribution



Final θ = 108

εpcg = 10−2 εpcg = 10−4 εpcg = 10−8

l nl nC,l itNewt avg-itpcg itNewt avg-itpcg itNewt avg-itpcg0 922 69 3 1.00 3 1.00 3 1.001 1793 245 6 2.33 4 4.00 4 7.502 4827 929 5 2.40 4 5.00 3 8.6673 16456 3621 5 3.00 4 6.25 3 10.674 61711 14257 5 3.76 4 7.00 4 11.755 237300 56612 5 3.80 4 7.50 4 12.756 928152 225563 5 4.00 4 7.75 4 13.75

Convergence history semismooth Newton method with pcg-multigrid solver:l : Level, nl : number of grid points, nC ,l : number of contact nodes,itNewt: number of semismooth Newton iterations,avg-itpcg: average number of pcg iterations per Newton iteration


Globalization


Globalization

Achieving global convergence of Newton-type methods requiresadditional measures. We address two variants (further options exist).

Globalization using a merit function:Choose a suitable merit function and enforce (nonmonotone) descentto achieve convergence to stationarity for this merit function.

Globalization by path following:Generate a one-parameter family of problems with (Pµ0) easy to solveand (P0) the original system. Follow the path for µ 0.Examples are interior-point and smoothing methods.

If H(w) = 0 expresses optimality conditions, then globalizationtechniques for the underlying optimization problem can be used.

Central for globalizations of Newton-type methods is transition to fastlocal convergence (“undamped” Newton) under realistic conditions.


Globalization Based on Merit Functions

Globalization based on a merit function ϕ enforces convergence tostationarity of the following auxiliary problem:

(Pglob) minw∈W

ϕ(w) s.t. w ∈ S.

S ⊂W is a closed convex set containing the relevant roots of H.

ϕ : U → R is a continuous (preferably C1) function defined on U ⊃ S.

The problem is chosen such that solutions of H(w) = 0 are stationarypoints of (Pglob), ideally with a 1-to-1 correspondence.

Stationarity is often expressed by a continuous criticality measureχ : W → R+ with χ(w) = 0 iff w is a stationary point of (Pglob).

Global convergence comes in different flavors, such as:

lim infk→∞

χ(wk) = 0 or, stronger, limk→∞

χ(wk) = 0.

Sometimes, ϕ depends on (e.g., penalty) parameters that are adapted.


Globalization Based on Merit Functions (2)

If Z = Z∗ is a H-space, a canonical merit function is

ϕ2(w) = 12‖H(w)‖2Z .

If H is differentiable then ϕ′2(w) = H ′(w)∗H(w) and the Newton stepsk = −H ′(wk)−1H(wk) is a descent direction:

〈ϕ′2(wk), sk〉W ∗,W = 〈H ′(wk)∗H(wk),−H ′(wk)−1H(wk)〉W ∗,W= −‖H(wk)‖2Z .

If H is nonsmooth, then ϕ2 is usually nonsmooth, too.

Thus, there arise the following tasks:

• Finding a (preferably C1) merit function ϕ

• Showing that semismooth Newton steps are sufficient descentdirections for ϕ, at least close to a “nice” solution.



For complementarity problems (bilateral bounds can also be handled)

w ≥ 0, F (w) ≥ 0, (w ,F (w))W = 0

in either W = Rn or W = L2(Ω) one can use a reformulation

HFB(w) := φFB(w ,F (w)) = 0

component-wise in Rn, pointwise a.e. in L2(Ω).

φFB(a, b) = a + b − ‖(a, b)‖2 is the Fischer-Burmeister function.

We then have HFB : W → Z := W .

Although HFB is nonsmooth, one can show that φ2FB is C1 with

∇(φ2FB)(a, b) = 2φFB(a, b)g for all gT ∈ ∂φFB(a, b).



Due to the C1-smoothness of φ2FB , in both cases W = L2(Ω) orW = Rn, it can be proved that if F is C1, then ϕ2 is C1 with

ϕ′2(w) = M∗HFB(w) ∀ M ∈ ∂HFB(w).

For W = Rn, all M ∈ ∂HFB(w) = ∂HFB(w) ⊂ Rn×n have the form

M = Diag(ga) + Diag(gb)F ′(w),

(gaj , gb

j ) ∈ ∂φFB(wj ,Fj(w)) (1 ≤ j ≤ n).

In the case W = L2(Ω), all M ∈ ∂HFB(w) ⊂ L(W ,W ) have the form

Md = ga d + gb F ′(w)d ∀ d ∈ L2(Ω),

ga, gb ∈ L∞(Ω), (ga(x), gb(x)) ∈ ∂φFB(w(x),F (w)(x)) for a.a. x ∈ Ω.



In the case Rn, one can devise globally, superlinearly convergentalgorithms that use HFB and ϕ2.

In the case L2(Ω), ϕ2 is a C1-function, but HFB is not semismooth.

In fact, the typical lifting property F = γI + G , where G : L2 → Lp,p > 2, cannot be exploited to achieve that w 7→ (w ,F (w)) maps tosome Lp(Ω)2, p > 2.

In M.U. ’02, ’11, smoothing (or lifting) steps are proposed to closethe L2-Lp norm gap; we do not go into these technicalities.

We see that globalizing semismooth Newton methods in functionspace is delicate.

In practice, semismooth Newton methods combined with nestediteration over a grid hierarchy usually require globalization only oncoarser grids.


Globalization Based on Merit Functions (6)M.U. ’11; Milzarek, M.U. ’14; Milzarek ’16

We used and analyzed the following globalization in several contexts:

Choose an auxiliary problem and a globally convergent method for it.

Use semismooth Newton steps whenever they are admissible for theglobalization method or if they satisfy certain other acceptanceconditions.

Examples for acceptance conditions:

• Nonmonotone filter (filter globalization goes back to Fletcher ’96)

• Sufficient residual reduction between Newton steps.

The acceptance conditions are such that if infinitely many Newtonsteps satisfy the condition, then ‖H(wk)‖Z → 0 on this subsequence.

These conditions are satisfied for full Newton steps in a neighborhoodof a “reasonably nice” solution.


Globalization Based on Path FollowingContinuation w.r.t. a parameter can be used for globalization in various ways:

Nested iteration over a grid hierarchy often requires globalization onlyon the coarse grids (→ finite dimensional theory applies)

Moreau-Yosida regularization (state constraints and related problems)forms a basis for path following, which can be used for globalization.

Smoothing methods introduce a smoothed approximation Hε of H.

• Smoothing Newton system:

H ′εk(wk)sk = −H(wk).

• The merit function ϕk(w) = 12‖Hεk (w)‖2Z is used.

• In Rn global and fast local convergence of smoothing methods isshown, e.g., in Chen, Qi, Sun ’98.

• In a current draft paper, we extend the convergence theory ofsmoothing methods to an L2-setting.


Semismooth Newton for Nonsmooth MinimizationUsing the Proximal Operator


A Composite Nonsmooth Optimization Problem

We consider a composite nonsmooth optimization problem:

minw

g(w) + h(w).

W is a Hilbert space and g : W → R is C1.

h : W → R ∪ +∞ is proper, convex, lower semicontinuous.

For convenience we choose W ∗ = W , 〈·, ·〉W ∗,W = (·, ·)W .

Problems of this form arise in big data, compressed sensing, imageanalysis, sparse control, . . .

Optimality condition of the composite nonsmooth problem:

0 ∈ ∇g(w) + ∂h(w),

where ∂h is the subdifferential of convex analysis.

Goal: Reformulate this generalized equation as a nonsmooth equation.


Composite Nonsmooth Optimization Problems: Examples

Example 1: h can represent constraints:

h(w) = ιWad(w) :=

0 (w ∈Wad),∞ (w /∈Wad),

Wad ⊂W closed, convex.

where Wad ⊂W is closed, convex.

Then the problem is equivalent to minw∈Wad g(w).

Example 2: Sparse optimization and related problems:

h(w) = ‖w‖W with W ⊂ W .

Important in compressed sensing and sparse control:

h(w) = ‖w‖L1 or h(w) = ‖w‖`1 are sparsity promoting.


Proximal Operator

We introduce the Proximal Problem:

miny∈W

f (y) + 12‖y − w‖2W .

f : W → R ∪ ∞ is proper, lower semicontinuous, and convex.

W is a Hilbert space; we work with W ∗ = W , 〈·, ·〉W ∗,W = (·, ·)W .

The proximal problem is strictly convex. Hence, for every w ∈W , theproximal problem has a unique solution.

The unique solution y is denoted by proxf (w) and defines theproximal operator proxf : W →W .

proxf is firmly non-expansive, i.e., for all w1,w2 ∈W :

‖proxf (w1)− proxf (w2)‖W ≤ (proxf (w1)− proxf (w2),w1 − w2)1/2W

≤ ‖w1 − w2‖W .


Proximal Operator (2)

Optimality condition of prox problem:

y = proxf (w) satisfies:

0 ∈ ∂f (y) + y − w .

The optimal value function ef : W → R,

ef (w) := miny∈W

f (y) + 12‖y − w‖2W

= f (proxf (w)) + 12‖proxf (w)− w‖2W

is called Moreau envelope or Moreau-Yosida regularization.

ef is convex and continuously differentiable with

∇ef (w) = w − proxf (w).

Much more could be said, cf., e.g., Bauschke, Combettes ’11.


Proximal Operator – Example

Example: f (w) = µ|w |, W = R, µ > 0.

Proximal operator:

proxµ|·|(w) =

0 (|w | ≤ µ),w − µ sgn(w) (|w | ≥ µ).

(shrinkage/thresholding)

Moreau envelope:

eµ|·|(w) =

12w2 (|w | ≤ µ),µ(|w | − µ

2 ) (|w | ≥ µ).(Huber function)

Gradient of Moreau envelope:

∇eµ|·|(w) =

w (|w | ≤ µ),

µ sgn(x) (|w | ≥ µ)

= w − proxµ|·|(w).


Sketches for the Example f (w) = µ|w |

left: proxµ|w | middle: eµ|·| right: ∇eµ|·|.


Proximal Operator – Example (2)

Example: f (w) = ιWad(w), W = W ∗ H-space,Wad ⊂W closed and convex.

Proximal operator:

proxιWad(w) = PWad = projection onto Wad.

Moreau envelope:

eιWad(w) = 1

2‖w − PWad(w)‖2W = 12 dist

2W (w ,Wad).

Gradient of Moreau envelope:

∇eιWad(w) = w − proxιWad

(w) = w − PWad(w).


Proximal Operator – Example (3)

Example: f (w) = µ‖w‖L1 + ιWad(w), W = L2(Ω),Wad = w ; α ≤ w ≤ β a.e. in Ω.

Proximal problem:

miny

µ‖y‖L1 + 12‖y − w‖2L2 s.t. α ≤ y ≤ β a.e. in Ω

One can show (we assume α < 0 < β):

proxf (w)(x) = P[α,β](w(x)− P[−µ,µ](w(x))), x ∈ Ω.

From our results it follows that this superposition operator is semismoothfrom Lp(Ω) to Lr (Ω), 1 ≤ r < p ≤ ∞.


Proximal Operator of µ| · |+ ι[α,β]


Prox-based Equation Reformulation

Optimization problem: minw g(w) + h(w).

Optimality condition:

(1) 0 ∈ ∇g(w) + ∂h(w).

Optimality condition of prox problem for h with w = w −∇g(w):

(2) 0 ∈ y − w +∇g(w) + ∂h(y).

We now show: (1) ⇐⇒ (3) with

(3) w = proxh(w −∇g(w)).

“=⇒”: If w satisfies (1) then y = w satisfies (2) and thus

proxh(w −∇g(w)) = y = w .

“⇐=”: If w satisfies (3) then y = w satisfies (2) and inserting y = winto (2) shows that w satisfies (1).


Prox-Based Equation Reformulation (2)

We can replace g and h by τg and τh.

This yields that the first order optimality system

(1) 0 ∈ ∇g(w) + ∂h(w).

is equivalent to

(3)τ w = proxτh(w − τ∇g(w)).

We thus arrive at a nonsmooth system of equations.

If ∇g and proxτh are semismooth, then we can apply semismoothNewton methods.

An example, discussed on the next slide, is L1-regularization pluspointwise bound constraints.


Prox-Based Equation Reformulation: Example

L1-regularization plus pointwise bound constraints result in

h(w) = µ‖w‖L1(Ω) + ιWad(w),

where Wad =

w ∈ L2(Ω) ; α ≤ w ≤ β a.e. in Ω.

As shown,

proxτh(w) = P[α,β](w(·)− P[−τµ,τµ](w(·))), x ∈ Ω,

is semismooth from Lp(Ω), p > 2, to L2(Ω).

If g : L2(Ω)→ R has the structure

∇g = γI + G with G : L2(Ω)→ Lp(Ω)

we achieve with τ = 1/γ:

H(w) := w − proxτh(w − τ∇g(w)) = w − proxγ−1h(−γ−1G(w))

is semismooth from L2(Ω) to L2(Ω).


Application to Seismic Tomography

C. Boehm, M.U. ’15


Introduction

Seismic Inversion:Given a set of seismograms and a description of the seismic sources,determine the material parameters of the Earth.

A better knowledge of the structure of the Earth’s subsurface can help to

explain geodynamic processes,

support the search for natural resources,

identify areas of potential geological hazards, ...


Related Work

Full-Waveform Inversion in the Time DomainTromp, Tape, Liu ’05; Epanomeritakis, Akçelik, Ghattas, Bielak ’08;Fichtner, Kennett, Igel, Bunge ’09; Wilcox, Stadler, Burstedde,Ghattas ’10; Fichtner, Trampert ’11 ...

Properties of Parameter-to-State OperatorStolk ’00; Blazek, Stolk, Symes ’13; Kirsch, Rieder ’13 ...

Randomized Source Sampling, Mini-Batch HessianKrebs, Anderson, et al. ’09; Byrd, Chin, Neveitt, Nocedal ’11;Aravkin, Friedlander, Herrmann, Leeuwen ’12; Byrd, Chin, Nocedal,Wu ’12; Haber, Chung, Herrmann ’12; Schiemenz, Igel ’13 ...

Paper: Boehm, M.U., SISC, 2015


Elastic Wave Equation

Bounded domain Ω ⊂ Rd (d = 2, 3), time interval I = (0,T ):

ρutt −∇ · (Ψ : ε(u)) = f on Ω× I,u(0) = 0 on Ω,

ut(0) = 0 on Ω,(Ψ : ε(u)) · ~n = 0 on ∂Ω× I.

u : displacement field ε(u) : strain tensor (= 12 (∇u +∇uT ))

ρ : density Ψ : 4th-order material tensor (Ψijkl )

Particularly important is Lamé material:

Ψijkl = λδijδkl + µ(δikδjl + δilδjk) with parameters λ(x), µ(x).


Suitable Parameter Spaces

Parameterization:Ψ(m)(x) = Ψ(x) + Φ (m(x))

Ψ ∈ L∞(Ω)d4 = reference model that captures major discontinuities.

Unknown m parameterizes model variations

Φ : Rr → Rd4 is sufficiently smooth.

Here for simplicity: Φ linear.

Hilbert space of model variations: M ⊂⊂ L∞(Ω)r (compactly).

Admissible set: Mad = M ∩M∞ad withM∞ad = m ∈ L∞(Ω)r : ψa ≤ Sm ≤ ψbS ∈ L(M,Q), Q ⊂ Lq(Ω)n for some q > 2,ψa, ψb ∈ Q, ψa ≤ 0 < ψb.


Parameter Identification Problem

Input:

ns seismic sources fi (with location and source time function).

Observations uδi (t, x) for every source in Ωobs × I, Ωobs ⊂ Ω.

Reference model Ψ ∈ L∞(Ω)d4 .

Seismic Inverse Problem (SIP)

minm∈Mad

j(m) = J(u(m),m) def=ns∑

i=1Jfit(ui , uδi ) + αJreg(m)

where the displacements u(m) = (ui (m))1≤i≤ns solve the elastic wave PDEs

E (ui ,m) = fi , ui (0) = 0, (ui )t(0) = 0 (1 ≤ i ≤ ns).

Here (e.g.): Jfit(ui , uδi ) = 12∥∥ui − uδi

∥∥2L2(Ωobs×I), Jreg(m) = 1

2‖m‖2M .


Existence and Differentiability for the Seismic Inverse Problem

Let V = H1(Ω)d . The following can be shown:

Existence and uniqueness of (very) weak solutionsu(m) ∈ C (I; L2(Ω)d ) ∩ C1(I; V ∗) to the elastic wave equation forfixed m ∈ M, sufficiently regular ρ and f ∈ L2(I; V ∗).

Fréchet-differentiability of the parameter-to-state operator:Let f ∈ Hk+l (I; V ∗), k ≥ 2, l ≥ 0, and f (t) = 0 near t = 0.Then the solution operator

m ∈ L∞(Ω)r 7→ u(m) ∈ C l (I; V )

is (k − 2)-times Lipschitz continuously Fréchet-differentiable.

Existence of a solution to the regularized seismic inverse problem.

Böhm, M.U. ’15, Lions, Magenes ’72, Lasiecka, Triggiani ’90Michael Ulbrich | Semismooth Newton Methods in Function Space: Theory and Applications | 5.10.2017 55

Moreau-Yosida Regularization

Moreau-Yosida-Regularized Problem

For θ ∈ (0,∞) defineminm∈M

jθ(m) def= j(m) + θφ(m),

with the penalty function

φ(m) def= 12

(‖[Sm − ψb]+‖2L2(Ω)n + ‖[ψa − Sm]+‖2L2(Ω)n

).

First order optimality conditions:

j ′(m) + θS∗([Sm − ψb]+ − [ψa − Sm]+

)= 0 in M∗.

If j is twice cont. diff. then this is a semismooth operator equationsince [ · ]+ is semismooth as a map

Q ⊂ Lq(Ω)n → Lq

q−1 (Ω)n ⊂ Q∗, q > 2.

We also can prove error estimates for the regularized solution in terms of θ.Michael Ulbrich | Semismooth Newton Methods in Function Space: Theory and Applications | 5.10.2017 56

Outline of the Optimization AlgorithmMoreau-Yo

sidaPe

nalty

Metho

d

Trust-Re

gion

1: Choose θ0 > 0, an initial parameter m0θ0 and ε > 0.

2: for k = 0, 1, 2, . . . do3: Obtain an approximate solution m∗

θk of (Pθk ):4: for i = 0, 1, 2, . . . do5: Obtain iterates mi+1

θkby solving a

6: trust-region subproblem with a7: matrix-free Newton-PCG method.8: end for9: if (violation of feasibility & optimality at m∗

θk ) < ε then10: Stop with m∗ = m∗

θk .11: else12: Choose θk+1 > θk .13: Set m0

θk+1 = m∗θk .

14: end if15: end for


Simultaneous Sources – Motivation

minm∈Mad

12∑ns

i=1∥∥ui (m)− uδi

∥∥2L2(Ωobs×I) + αJreg(m)

s.t. E (ui ,m) = fi , ui (0) = 0, (ui )t(0) = 0 (1 ≤ i ≤ ns).

Required number of simulations: objective: ns

gradient: + ns

Newton step: + 2 ns · itcg

Idea: Replace individual seismic events by simultaneous “super-shots”.=⇒ Source-encoding: For w ∈ Rns define u(m; w) as the solution to:

E (u,m) =ns∑

i=1wi fi , u(0) = 0, ut(0) = 0.


Sample Average Approximation

Define

Jfit,w (m) := 12‖u(m; w)−

ns∑i=1

wiuδi ‖2L2(Ωobs×I)

and for WK = (w1, . . . ,wK ) ∈ Rns×K consider

minm∈Mad

j(m; WK ) := 1K

K∑k=1

Jfit,wk (m) + α2 ‖m‖

2M .

+ Requires only 2K simulations for objectiveand gradient.

− Possible loss of information due tointerference.


Marmousi Test Data

Synthetic data set provided by Institut Français du Pétrole.

2D domain, 9.2km × 3.1km,

190 seismic sources at 36m depth,

384 receivers, equidistant at 100m depth,

perfectly elastic, isotropic material with constant Poisson’s ratio.

m: 50k nodes, u: 800k nodes, 4k time steps.

geophysical exploration displacement field (snapshot)


Sample Average Gradient

WK ∈ −1, 1ns×K , i.i.d. Rademacher distributed.

K = 1 super-shot K = 8 super-shots

0 2 4 6 8

0

1

2

3

dept

h(k

m)

0 2 4 6 8

0

1

2

3

8 single sources all 190 sources

0 2 4 6 8

0

1

2

3

dept

h(k

m)

length (km)0 2 4 6 8

0

1

2

3

length (km)


Reconstruction

1 super-shot

8 super-shots

16 super-shots

λ, difference from initial model


Reconstruction - Computational Effort

K super-shots, WK ∈ −1, 1ns×K , i.i.d. Rademacher distributed.

Ktol = 10−3 tol = 10−6

iter. avg. cg it # PDEs iter. avg. cg it # PDEs

1 24 14.3 810 41 25.0 22552 25 15.3 1784 46 26.6 53544 28 18.8 4796 38 24.4 81968 24 17.0 7504 30 21.6 1158416 23 15.6 13376 31 21.9 24256

Can we do better?


Approximation of the Hessian

j ′′(m; WK ) = 1K

K∑k=1

j ′′(m; wk).

Idea: Use mini-batch to generate curvature information.In every Newton iteration:Choose S ⊂ w1, . . . ,wK and approximate

j ′′(m; WK ) ≈ 1|S|

∑wκ∈S

j ′′(m; wκ).

+ Reduces number of PDE solves by a factor of K/|S|.

+ Reduces memory requirements by a factor of K/|S|.

− Only approximation to the true Hessian.


Computational Effort with Mini-Batch Hessian

WK ∈ −1, 1ns×K , i.i.d. Rademacher distributed,

K = 8 super-shots for objective function and gradient,

Hessian approx. with S = wk (k chosen cyclical).

0 5000 100000

0.2

0.4

0.6

0.8

1

PDE simulations

rel. m

isfit

0 5000 10000

10−6

10−4

10−2

100

PDE simulations

rel. o

ptim

.

full Hessian L-BFGS mini-batch Hessian


Comparison: Unconstrained vs. Bounds on Wave Velocities

WK ∈ −1, 1ns×K , i.i.d. Rademacher distributed,

unconstrained vs. lower bound on P-wave velocity: 1450 m/s,

K super-shots, Hessian approx. with S = wk (k chosen cyclical).

Kw/o constraints with constraints

iter. avg. cg it # PDEs iter. avg. cg it # PDEs

8 65 23.2 5058 66 24.0 509216 62 22.5 6112 65 24.4 6358

Total no. of PDE solves is less than 16 cg iterations without super-shots!


Parallel Scaling Statistics - Multiple Events

#cores 32 64 128 256 512 1024

#events 32 32 32 32 32 32#cores / event 1 2 4 8 16 32total time (s) 265.3 130.1 66.6 33.7 18.5 9.4speed-up 1.0 2.04 3.99 7.88 14.36 28.27

par. efficiency 1.000 1.020 0.996 0.985 0.898 0.883

Strong scaling results: 2d elastic wave equation, 32 eventsDiscretization: 12,288 elements, 197,633 dofs, 6,000 time steps

all computations carried out on Piz Daint (Cray XC30, Xeon E5),Swiss National Supercomputing Centre


Parallel Scaling Statistics

#cpu cores 8 64 512 4096

#elements 8,000 64,000 512,000 4,096,000total time (s) 16.7 17.0 17.3 17.9

scaling efficiency 1.000 0.979 0.963 0.935

Weak scaling results: 3d elastic wave equationDiscretization: 68,921 dofs per core, 1,000 time steps

all computations carried out on Piz Daint (Cray XC30, Xeon E5),Swiss National Supercomputing Centre


Mesh-Independence of Semismooth Newton


Challenges due to Nonsmoothness

Parametric stability of the radius of fast convergence, mesh-independenceresults, impicit function theorems, and other related topics are significantlymore challenging in nonsmooth settings than in the smooth case:

In the smooth case, the Jacobian H ′(w) at w induces a good linearmodel H(w) + H ′(w)d of H(w + d) in a neighborhood of w .

If H is C1, then the model varies continuously with w .

In the nonsmooth case, however, the linear operator M(w + d) in the“linear” model has to be chosen depending on w + d , not w :H(w) + M(w + d)d .

In finite dimensions, upper semicontinuity and compact-valuedness of∂H are helpful and yield, e.g., dist(M(w + d), ∂H(w))→ 0 as d → 0.

In infinite dimensions, such properties of ∂H are not available.


Structure of a Semismooth Newton Mesh-Independence ResultAccuracy of discretization measured by mesh size h ∈ (0, h0], h0 > 0.Discrete spaces: Wh ⊂W , Zh ⊂ Z .

Equation: H(w) = 0, H : W → Z

Semismooth Newton: wk+1 = wk + sk , Mksk = −H(wk).

Discretized Equation: Hh(wh) = 0, Hh : Wh → Zh

Discrete Semismooth Newton: wk+1h = wk

h + skh , Mh,ksk

h = −Hh(wkh ).

Let be given solutions w and wh, h ∈ (0, h0), with

‖wh − w‖W → 0, as h→ 0.

Our mesh-independence results have the following flavor:

For all η ∈ (0, 1), there exist h1 ≤ h0, δ > 0, such that ∀ h ∈ (0, h1]:∥∥wk − w∥∥

W < δ,∥∥wk

h − wh∥∥

W < δ ⇒∥∥wk+1 − w∥∥

W ≤ η∥∥wk − w

∥∥W ,

∥∥wk+1h − wh

∥∥W ≤ η

∥∥wkh − wh

∥∥W .


Structure of a Semismooth Newton Mesh-Independence Result (2)Proving mesh-independence can be split in two parts:

There exist h1 ≤ h0, δ0 > 0, C > 0 such that for all h ∈ (0, h1] there holds:

Uniform Regularity Condition:∥∥M−1∥∥

Z→W ≤ C1 ∀ M ∈ ∂H(w), w ∈ BW (w , δ0),∥∥M−1h∥∥

Zh,Wh≤ C1 ∀ Mh ∈ ∂Hh(wh), wh ∈ BWh (wh, δ0).

Mesh-Independent Semismoothness:

For all η ∈ (0, 1) there exists δ ∈ (0, δ0):

∀ w ∈ BW (w , δ), M ∈ ∂H(w), wh ∈ BWh (wh, δ), Mh ∈ ∂Hh(wh) :

‖H(w)− H(w)−M(w − w)‖Z ≤ η ‖w − w‖W ,

‖Hh(wh)− Hh(wh)−Mh(wh − wh)‖Z ≤ η ‖wh − wh‖W .


Mesh-Independence Result for VIs in L2

We consider the following complementarity problem:Find w ∈W := L2(Ω) such that

w ≥ 0, F (w) := G(w) + γw ≥ 0, w (G(w) + γw) = 0,

where G : W → Lp(Ω), p > 2 is C1.

Semismooth Reformulation:

(P) H(w) := w − [−γ−1G(w)]+ = 0.

Suitable FE discretization (e.g., piecewise constant finite elements for wh)results in the complementarity problem of finding wh ∈Wh ⊂W with

wh ≥ 0, Gh(wh) + γwh ≥ 0, wh (Gh(wh) + γwh) = 0.

The corresponding semismooth reformulation is given by

(Ph) Hh(wh) := wh − [−γ−1Gh(wh)]+ = 0.


Assumptions

For brevity, let h = 0 correspond to the continuous problem: (P0) = (P).Furthermore, we define Bh(wh, δ) := wh ∈Wh ; ‖wh − wh‖L2 < δ.

Assumptions:

There exist h0, δ0 > 0, p > 2 and L > 0 such that for all h ∈ [0, h0]:

w = w0 solves (P) and wh solves (Ph).

‖wh − w‖L2 → 0 as h→ 0+.

‖Gh(wh)− G(w)‖Lp → 0 as h→ 0+.

Gh : Wh →Wh is C1 on Bh(wh, δ0).∥∥Gh(w1h )− Gh(w2

h )∥∥

Lp ≤ LG∥∥w1

h − w2h∥∥

L2 ∀ w ih ∈ Bh(wh, δ0).∥∥G ′h(w1

h )− G ′h(w2h )∥∥

Wh→Wh≤ LG′

∥∥w1h − w2

h∥∥

L2 ∀ w ih ∈ Bh(wh, δ0).

Strict complementarity: meas(x ∈ Ω ; G(w)(x) = 0) = 0.


A Mesh-Independence Resultcf. Hintermüller, M.U. ’04; M.U. ’11

Under these assumptions, there holds (see M.U. ’11 for refinements):

Mesh-Independent Semismoothness:

For all η ∈ (0, 1) there exist δ ∈ (0, δ0] and h′ ∈ (0, h0] such that, for allh ∈ [0, h′], all sh ∈ Bh(wh, δ), and all M ∈ ∂H(wh + sh):

‖Hh(wh + sh)− Hh(wh)−Msh‖L2 ≤ η ‖sh‖L2 .

Mesh-Independence Result:

Let∥∥M−1h

∥∥Wh→Wh

≤ CM ∀ Mh ∈ ∂Hh(wh), wh ∈ Bh(wh, δ0), h ≤ h0.

Then for all η ∈ (0, 1) there exist δ ∈ (0, δ1] and h′ ∈ (0, h1] such that, forall h ∈ [0, h′] and all w0

h with∥∥w0

h − wh∥∥

L2 < δ, the semismooth Newtonmethod for (Ph) converges to wh with at least q-linear rate η:∥∥wk+1

h − wh∥∥

L2 ≤ η∥∥wk

h − wh∥∥

L2 ∀ k ≥ 0.


Mesh-Independent Order of Semismoothness

One can find examples showing that order of semismoothness at a point wis not stable w.r.t. perturbations of w .

This is bad news for mesh-independent order of semismoothness.

In M.U. ’11 two solutions are provided:

If G ′h are locally Hölder continuous at wh, h ∈ [0, h′] then we can show auniform order α > 0 of semismoothness for all h ∈ [0, h′] and allsh ∈ Bh(0, δ) under the following

Uniform Growth Condition for Complementarity:

There exist constants C > 0, κ > 0, and τ > 0 with

meas(x ; 0 < |Gh(wh)(x)| < t) ≤ Ctκ for all t ∈ (0, τ ], h ∈ [0, h′].


Mesh-Independent Order of Semismoothness (2)

Alternatively, if the above uniform growth condition for complementarity isreplaced by a condition only for h = 0,

meas(x ; |G(w)(x)| < t) ≤ Ctκ for all t ∈ (0, τ ]

we can show a bit less:

Mesh-Independent Order of Semismoothness Result:

There exist δ ∈ (0, δ0], h′ ∈ (0, h0], and C ′ > 0 such that, for allh ∈ [0, h′], all sh ∈ Bh(0, δ), and all Mh ∈ ∂H(wh + sh):

‖Hh(wh + sh)− Hh(wh)−Mhsh‖L2 ≤ C ′max‖sh‖αL2 , ‖Gh(wh)− G(w)‖αL2 ‖sh‖L2

with α = (p−2)κ2(κ+p) .

Corresponding mesh-independent q-orders of local convergence forsemismooth Newton methods can be shown.


Conclusions and Final Remarks

Semismooth Newton methods are an efficient tool for handlinginequality constraints, variational inequalities, and structurednonsmooth optimization problems.

They have been successfully applied in many fields.

A quite comprehensive theory on semismoothness and semismoothNewton is available and is further developing.

You should consider to make them part of your toolbox!

Also in other fields, e.g., machine learning and big data, second ordermethods are about to become increasingly important.


Many thanks for your attention!


semismooth newton methods in function space theoretical ... · outlineofpartii...

Documents