introductory course on non-smooth optimisation · jingwei liang,damtpintroduction to non-smooth...

Introductory Course on Non-smooth Optimisation

Lecture 03 - Krasnosel’skiı-Mann iteration

Jingwei Liang

Department of Applied Mathematics and Theoretical Physics

Table of contents

1 Method of alternating projection

2 Monotone and non-expansive mappings

3 Krasnosel’skiı-Mann iteration

4 “Accelerated” Krasnosel’skiı-Mann iteration

Recap

Recap of descent methods

include gradient descent, proximal gradient descent.

convergence (rate) properties– objective function value

◦ O(1/k) convergence rate.

◦ optimal O(1/k2) convergence rate.

– sequence◦ O(1/

√k) convergence rate.

◦ optimal O(1/k) convergence rate.

– linear convergence under e.g. strong convexity.

NB: end of happiness, most of the above results, especially for objective function values, will notbe true for non-descent type methods.

Operator splitting

Consider the problemminx∈Rn

µ1||x||1 + µ2||∇x||1 + 12 ||Ax− f||2.

In 1D, bothproxγ||·||1(·) and proxγ||∇·||1(·)

have close form solution. However, not for

proxγ(||·||1+||∇·||1)(·).

Operator splitting design properly structured scheme such that

the proximal mapping of non-smooth functions are evaluated separated.

gradient descent is applied to the smooth part.

Outline





Feasibility problem

Feasibility problemConsider finding a common point

find x ∈ X ∩ Y,

where X ,Y ∈ Rn are two closed and convex sets.

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation November 20, 2019

Method of alternating projection

Equivalent formulationminx∈Rn

ιX (x) + ιY(x).

Method of alternating projection (MAP)initial : x0 ∈ X ;repeat :

1. Projection onto Y : yk = PY(xk)

2. Projection onto X : xk+1 = PX (yk)

until : stopping criterion is satisfied.

The projection onto two sets are computed separately.

Stopping criterion: ||xk − xk−1|| ≤ ε.


Convergence analysis

MAPxk+1 = PX ◦ PY(xk).

Convergence proerties

convergence result for the objective function value?

convergence of the sequences {xk}k∈N, {yk}k∈N?


Outline





Notations

Given two non-empty sets X ,U ⊆ Rn, A : X ⇒ U is called set-valued operator if A maps everypoint in X to a subset of U , i.e.

A : X ⇒ U , x ∈ X 7→ A(x) ⊆ U .

The graph of A is defined by

gra(A)def={

(x, u) ∈ X × U : u ∈ A(x)}.

The domain and range of A are

dom(A)def={x ∈ X : A(x) 6= ∅

}, ran(A)

def= A(X ).

The inverse of A defined through its graph

gra(A−1)def={

(u, x) ∈ U × X : u ∈ A(x)}.

The set of zeros of A are the points such that

zer(A)def= A−1(0) =

{x ∈ X : 0 ∈ A(x)

}.


Monotone operator

Monotone operatorLet X ,U ⊆ Rn be two non-empty convex sets, A : X ⇒ U is monotone if

〈x− y, u− v〉 ≥ 0, ∀(x, u), (y, v) ∈ gra(A).

It is moreover maximal monotone if gra(A) is not strictly contained in the graph of any othermonotone operators.

A is called α-strongly monotone for some κ > 0 if

〈x− y, u− v〉 ≥ κ||x− y||2.

LemmaLet R ∈ Γ0, then ∂R is maximal monotone.


Cocoercive operator

Cocoercive operatorAn operator B : Rn → Rn is called β-cocoercive if there exists β > 0 such that

β||B(x)− B(y)||2 ≤ 〈B(x)− B(y), x− y〉, ∀x, y ∈ Rn.

The above equation implies that B is (1/β)-Lipschitz continuous.

Baillon-Haddad theoremLet F ∈ C1L, then∇F is β-cocoercive.

LemmaLet C : Rn ⇒ Rn be β-strongly monotone, then its inverse C−1 is β-cocoercive.


Resolvent of monotone operator

ResolventLet A : Rn ⇒ Rn be a maximal monotone operator and γ > 0, the resolvent of A is defined by

JAdef= (Id + A)−1.

The reflection of JA is defined byRA

def= 2JA − Id.

Given a function R ∈ Γ0 and its sub-differential ∂R,

proxR = J∂R.

Set of fixed points, x = proxR(x)

fix(proxR) = fix(J∂R) = zer(∂R).


Yosida approximation

Yosida approximationLet A : Rn ⇒ Rn be a maximal monotone operator and γ > 0, the Yosida approximation of Awith γ is

γA def= 1

γ(Id− JγA) = (γId + A−1)−1 = JA−1/γ(·/γ).

Moreover,Id = JγA(·) + γJA−1/γ

( ·γ

).

γA is γ-cocoercive


Non-expansive operator

Non-expansive operatorAn operator T : Rn → Rn is called non-expansive if it is 1-Lipschitz continuous, i.e.

||T (x)− T (y)|| ≤ ||x− y||, ∀x, y ∈ Rn.

For any α ∈]0, 1[, T is α-averaged if there exists a non-expansive operator T ′ such that

T = αT ′ + (1− α)Id.

A(α) denotes the class of α-averaged operators on Rn.

A( 12 ) is the class of firmly non-expansive operators.


Properties: α-averaged operators

LemmaLet T : Rn → Rn be non-expansive and α ∈]0, 1[. The following statements are equivalent:T is α-averaged non-expansive.

The operator (1− 1

α

)Id + 1

αT

is non-expansive.

For any x, y ∈ Rn,

||T (x)− T (y)||2 ≤ ||x− y||2 − 1− αα||(Id− T )(x)− (Id− T )(y)||2.


Properties: α-averaged operators

A(α) is closed under relaxations, convex combinations and compositions.

LemmaLetm ∈ N+, {Ti}i∈{1,...,m} be non-expansive operators on Rn, (ωi)i ∈]0, 1]m and

∑i ωi = 1, and

(αi)i ∈]0, 1]m such that Ti ∈ A(αi), i ∈ {1, ...,m}. Then,

Id + λi(Ti − Id) ∈ A(λiαi), λi ∈]0, 1αi

[ and i ∈ {1, ...,m}.∑i ωiTi ∈ A(α) with α = maxi αi.

T1 · · · Tm ∈ A(α) with α = mm−1+1/maxi∈{1,...,m} αi

.

Remark For the composition of two averaged operators, a sharper bound of α can be obtained,

α =α1 + α2 − 2α1α2

1− α1α2∈]0, 1[.


Properties: firmly non-expansive operators

LemmaLet T : Rn → Rn be non-expansive. The following statements are equivalent:

T is firmly non-expansive.

Id− T is firmly non-expansive.

2T − Id is non-expansive.

||T (x)− T (y)||2 ≤ 〈T (x)− T (y), x− y〉,∀x, y ∈ Rn.

T is the resolvent of a maximal monotone operator A, i.e. T = JA.

LemmaLet operator B : Rn → Rn be β-cocoercive for some β > 0. Then

βB ∈ A( 12 ), i.e. is firmly non-expansive.

Id− γB ∈ A( γ2β ) for γ ∈]0, 2β[.


Outline





Fixed point

Fixed pointLet T : Rn → Rn be a non-expansive operator, x ∈ Rn is called the fixed point of T if

x = T (x).

The set of fixed points of T is denoted as fix(T ).

fix(T ) may be empty, e.g. translation by a non-zero vector.

LemmaLet X be a non-empty bounded closed convex subset of Rn and T : X → Rn be anon-expansive operator, then fix(T ) 6= ∅.

LemmaLet X be a non-empty closed convex subset of Rn and T : X → Rn be a non-expansiveoperator, then fix(T ) is closed and convex.


Krasnosel’skiı-Mann iteration

Fixed-point iterationxk+1 = T (xk).

However, it may not converge, e.g. T = −Id...

Krasnosel’skiı-Mann iterationLet T : Rn → Rn be a non-expansive operator such that fix(T ) 6= ∅. Let λk ∈ [0, 1] and choosex0 arbitrarily from Rn, then the Krasnosel’skiı-Mann iteration of T reads

xk+1 = xk + λk(T (xk)− xk).

If T ∈ A(α), then λk ∈ [0, 1/α]


Fejérmonotonicity

Fejér monotonicityLet S ⊆ Rn be a non-empty set and {xk}k∈N be a sequence in Rn. Then

{xk}k∈N is Fejérmonotone with respect to S if

||xk+1 − x|| ≤ ||xk − x||, ∀x ∈ S, ∀k ∈ N.

{xk}k∈N is quasi-Fejérmonotone with respect to S, if there exists a summable sequence{εk}k∈N ∈ `1+ such that

∀k ∈ N, ||xk+1 − x|| ≤ ||xk − x||+ εk, ∀x ∈ S.

Example Let X ⊆ Rn be a non-empty convex set, and T : X → X be a non-expansive operatorsuch that fix(T ) 6= ∅. The sequence {xk}k∈N generated by

xk+1 = T (xk)

is Fejérmonotone with respect to fix(T ).


Convergence

LemmaLet S ⊆ Rn be a non-empty set and {xk}k∈N be a sequence in Rn. Assume the {xk}k∈N isquasi-Fejér monotone with respect to S, then the following holds{xk}k∈N is bounded.

||xk − x|| is bounded for any x ∈ S.

{dist(xk,S)}k∈N is decreasing and convergent.

If every sequential cluster point of {xk}k∈N belongs to S, then {xk}k∈N converges to a point in S.

Weak convergence in general real Hilbert space


Convergence

ConvergenceLet T : Rn → Rn be a non-expansive operator such that fix(T ) 6= ∅. Consider theKrasnosel’skiı-Mann iteration of T , and choose λk ∈ [0, 1] such that∑

k∈N

λk(1− λk) = +∞,

then the following holds{xk}k∈N is Fejérmonotone with respect to fix(T ).

{xk − T (xk)}k∈N converges strongly to 0.

{xk}k∈N converges to a point in fix(T ).

Remark When T is α-averaged, then

λk ∈ [0, 1/α] such that∑

k∈N λk(1/α− λk) = +∞.


Preliminiary

Krasnosel’skiı-Mann iteration with constant relaxationxk+1 = xk + λ

(T (xk)− xk

)=((1− λ)Id + λT

)(xk).

Denote Tλ = (1− λ)Id + λT , and define residual

ek = (Id− T )(xk) = 1λ

(xk − xk+1).

– Tλ ∈ A(λ) if λ ∈]0, 1[. If T ∈ A(α), then Tλ ∈ A(λα).

– For any x? ∈ fix(T ),

x? ∈ fix(T )⇔ x? ∈ fix(Tλ)⇔ x? ∈ zer(Id− T ).

– If λ ∈ [ε, 1− ε], ε ∈]0, 1/2],◦ ek converges to 0.

◦ {xk}k∈N is quasi-Fejér monotone with respect to fix(T ), and converges to a pointx? ∈ fix(T ).


Pointwise convergence rate

Rate of ||ek||2:

For residual||ek+1||2 ≤ ||ek||2 − 1− λ

λ||ek − ek+1||2.

Tλ ∈ A(λ), τ = λ(1− λ)

||xk+1 − x?||2 ≤ ||xk − x?||2 − τ ||ek||2.

Summation(k + 1)||ek||2 ≤ τ

∑ki=0 ||ei||

2 ≤ ||x0 − x?||2 − ||xk+1 − x?||2.

Rate||ek||2 ≤ ||x0 − x?||2

k+ 1 .

NB: if T ∈ A(α), then the above holds for λ ∈ [ε, 1/α− ε].


Ergodic convergence rate

Define ek = 1k+1

∑ki=0 ei.

Boundedness||xk+1 − x?|| = ||Tλ(xk)− Tλ(x?)|| ≤ ||xk − x?||

≤ ||x0 − x?||.

λek = xk − xk+1

||ek|| = 1k+ 1 ||

∑ki=0 ei|| = 1

λ(k+ 1) ||∑k

i=0 (xi − xi+1)||

= 1λ(k+ 1) ||x0 − xk+1||

≤ 1λ(k+ 1) (||x0 − x?||+ ||xk+1 − x?||)

≤ 2||x0 − x?||λ(k+ 1) .

NB: both rates (pointwise and ergodic) can be extended to the inexact case...


Metric sub-regularity

Metric sub-regularityA set-valued mapping A : Rn ⇒ Rn is called metrically sub-regular at x for u ∈ A(x) if thereexists κ ≥ 0 along with neighbourhood X of x such that

dist(x,A−1(u)) ≤ κ dist(u,A(x)), ∀x ∈ X .

The infimum of all κ such that above holds is called the modulus of metric sub-regularity, anddenoted by subreg(A; x|u).

Example Let F ∈ S1α,L and A = γ∇F with γ ≤ 1/L: x = argminRnF and u = 0,

dist(u,A(x)) = ||γ∇F(x)− γ∇F(x)||

≥ γα||x− y||


(Local) linear convergence

Let x? ∈ fix(T ), suppose T ′ def= Id− T is metrically sub-regular at x? with neighbourhood X of x?,

let κ > subreg(T ′; x?|0):

0 = T ′(x?), T ′−1(0) = fix(T )

dist(x, fix(T )) ≤ κdist(0, T ′(x)) = κ||x− T (x)||.

Denote dk = dist(xk, fix(T )), x ∈ fix(T ) such that dk = ||xk+1 − x||,

d2k+1 ≤ ||xk+1 − x||2 ≤ ||xk − x||2 − τ ||T ′(xk)− T ′(x)||2

≤ d2k − τκ2 d

2k

=(1− τ

κ2)d2k.

NB: As metric sub-regularity is a local propery, the linear convergence will happen only when xk isclose enough to fix(T ).


Optimal relaxation parameter?

Consider λk ∈ [0, 1] and xk+1 = (1− λk)xk + λkT (xk). Then

||xk+1 − x?||2 = ||(1− λk)(xk − x?) + λk(T (xk)− x?)||2

= (1− λk)||xk − x?||2 + λk||T (xk)− x?||2

− λk(1− λk)||xk − T (xk)||2

= λ2k||xk − T (xk)||2

− λk(||xk − x?||2 − ||T (xk)− x?||2 + ||xk − T (xk)||2

)+ ||xk − x?||2

which is a quadratic fucntion of λk, and minimises at

λ =12 +

||xk − x?||2 − ||T (xk)− x?||2

2||xk − T (xk)||2.

Approximation:

λ =12 +

||xk − T (xk)||2 − ||T (xk)− T 2(xk)||2

2||(xk − T (xk))− (T (xk)− T 2(xk))||2.


Outline





Inertial Krasnosel’skiı-Mann iteration

An inertial Krasnosel’skiı-Mann iterationInitial : x0 ∈ Rn, x−1 = x0;

yk = xk + ak(xk − xk−1), ak ∈ [0, 1],

zk = xk + bk(xk − xk−1), bk ∈ [0, 1],

xk+1 = (1− λk)yk + λkT (zk), λk ∈ [0, 1].

Covers heavy-ball method, Nesterov’s scheme, inertial FB and FISTA.

Convergence analysis is much harder than the inertial version of descent methods.

No convergence rate.

May perform very poorly in practice, slower than the original scheme.


A multi-step inertial scheme

A multi-step inertial Krasnosel’skiı-Mann iterationInitial : x0 ∈ Rn, x−1 = x0 and γ ∈]0, 2/L[;

yk = xk + a0,k(xk − xk−1) + a1,k(xk−1 − xk−2) + · · · ,

zk = xk + b0,k(xk − xk−1) + b1,k(xk−1 − xk−2) + · · · ,

xk+1 = (1− λk)yk + λkT (zk), λk ∈ [0, 1].

Even harder to analyse convergence.

No rate.

However, can outperform the original scheme...


Convergence

Conditional convergence, i = 0, 1, ...∑k∈N

max{

maxi|ai,k|,max

i|bi,k|

}∑i

||xk−i − xk−i−1|| < +∞.

Online updating ruleai,k = min

{ai, ci,k

}where

ci,k =ci

k1+δ∑

i ||xk−i − xk−i−1||, δ > 0.


Reference

H. Bauschke and P. L. Combettes. “Convex Analysis and Monotone Operator Theory in HilbertSpaces”. Springer, 2011.

P. L. Lions and B. Mercier. “Splitting algorithms for the sum of two nonlinear operators”. SIAMJournal on Numerical Analysis, 16(6):964–979, 1979.

J. Liang, J. Fadili, and G. Peyré. “Convergence rates with inexact non-expansive operators”.Mathematical Programming 159.1-2 (2016): 403-434.

introductory course on non-smooth optimisation · jingwei liang,damtpintroduction to non-smooth...

Documents