course #4 a very short introduction to ... - jerome...

1

Course #4 - A (very) short introduction to proximal algorithms

J.Bobin - [email protected] - Analyse de données parcimonieuses en astrophysique

mailto:[email protected]

CS-Orion meeting - 01/28/2011 Course #4 - Proximal algorithms

Solving inverse problems

More generally, we will focus on linear inverse problems where :

b = Ax+ n

data, observations, etc. observation operator

signal to be retrieved

noise, model imperfections, etc

This models many inverse problems arising in physics :

- Denoising (A is the identity operator)- Deconvolution (A is the convolution kernel) This course- Inpainting/missing data interpolation (A is a binary mask) Course #4- Tomographic reconstruction (A is the partial Radon transform)- Radio-interferometric reconstruction (A is the partial Fourier transform) Course #5- Compressed sensing Course #5- Blind source separation Course #6-8



Let’s assume x is sparse is some orthogonal basis: ↵ = �x

data fidelity term(measures how well the model fits the data)

sparsity-enforcing penalty

Examples of penalty terms:

k↵k`1 =X

i

|↵[i]|

P(↵) = k↵k`1The 0-norm counts the number of nonzero elements

P(↵) = k↵k`0

x = Argminx=�↵

P(↵) + kb��↵k22



Computing the solution to an inverse problems also boils down to solving a minimization of the form:

x = Argminx

g(x)

x = Argminx

�k�Txk`p +

1

2kb�Hxk22

Example: penalized least-square estimator, etc.

x = Argminx

f(x) + g(x)

Or more generally,

Example: least-square estimator, maximum likelihood estimator, etc.

x = Argmin

x

X

i

x

i

� b

i

log(x

i

)


Let’s warm up with a simple case

Let’s consider the following simple case:

x = Argminx

g(x)

where g verifies the following properties:

- It is convex : 8x, y 2 Domg,↵ 2 [0, 1]; g(↵x+ (1� ↵)y) ↵g(x) + (1� ↵)g(y)

- It is differentiable : rg is defined on Domg

- Its gradient is Lipschitz: 8x, y 2 Domg; krg(x)�rg(y)k Lkx� yk

Example:g(x) = kb�Hxk2`2 rg(x) = 2H?(Hx� b) L = 2kH?Hk2


Gradient descent

In that case, having access to first-order information about g, the most straightforward/simplest first-order algorithm is the gradient descent algorithm:

x

(t+1) = x

(t) � �rg(x(t))

g(x) = c

x

(0)

x

(1)

x

(2)


A more complex problem

Let’s reconsider the following L1-penalized least-square problem:

x = Argminx

�k�Txk`1 +

1

2kb�Hxk22

x = Argminx

f(x) + g(x)

convex and differentiablewith L-Lipschitz gradientconvex but not differentiable


Subgradient

A more precise description of f(x) requires defining the subgradient of a convex function:

@f(x) = {u 2 Domf ; 8y 2 Domf , f(x) + hy � x, ui f(y)}

For example, let’s go back to the L1-norm:

f(x) = kxk`1for x > 0, @f(x) = 1

x < 0, @f(x) = �1

x = 0, @f(x) = [�1, 1]

x


Proximal operator

This now allows to define the key element of proximal calculus: the proximal operator of a function f.

proxf (x) = Argminv f(v) +

1

2

kx� vk2`2

Domf

f(x) = c


Proximal operator, properties

Some useful properties (among others) :

i) translation :

ii) scaling :

iii) reflection :

iv) conjugation :

v) change of basis :

h(x) = f(x� z)

h(x) = f(x/�) proxh(x) = � prox1/�2f (x/�)

proxh(x) = z + proxf (x� z)

h(x) = f(�x) proxh(x) = �proxf (�x)

h(x) = f

?(x) = max

zhz, xi � f(z)

proxh(x) = x� proxf (x)

h(x) = f(�Tx) proxh(x) = � proxf (�

Tx)


Proximal operator, examples

The indicator function of a convex set K:

◆K(x) =

⇢0 if x 2 K,

+1 otherwise

prox◆K (x) = PK(x)

It is the orthogonal projector onto K !

Example of the non-negative orthant:

prox◆K (x) =

⇢x if x � 0,

0 otherwise

K = {u; u � 0}

The L2 norm:

f(x) = �kxk2`2 proxf (x) =1

1 + 2�

x


Proximal operator, examples

The L1 norm:f(x) = �kxk`1

proxf (x) = Argminu �kuk`1 +1

2

kx� uk2`2

By definition of the proximal operator:

soft-thresholding operator

S�

��

The Poisson log-likelihood:

f(x) = �k log(x) + x

proxf (x) =1

2

⇣x� 1 +

p|x� 1|2 + 4k

⌘


Forward-backward splitting algorithm

Let’s go back to our minimization problem:

x = Argminx

f(x) + g(x)


Then it has been shown that the following iterative scheme solves the problem:

x

(t+1)= prox�f

⇣x

(t) � �rg(x

(t))

⌘

prox gradient descent on g

� 2]0, 1/L[


Forward-backward splitting algorithm

Example:

g(x) = c

x

(1)

convex set K

minx2K

g(x)

x

(2)x

(3)


FBS: denoising with redundant representations

x = Argminx=�↵

�k↵k`1 +

1

2kb��↵k2

`2

We want to solve a denoising problem by imposing sparsity in a redundant, non-orthogonal transform (undecimated wavelets, curvelets, ridgelets, etc …):

convex and differentiablewith 1-Lipschitz gradientconvex but not differentiable

↵(t+1) = S��

⇣↵(t) + ��T (b��↵))

⌘The forward-backward algorithm then reads:


FBS: denoising with redundant representations


FBS: deconvolution

We want to solve a deconvolution problem by imposing sparsity in some transform :


The forward-backward algorithm then reads:

x = Argminx=�↵

�k↵k`1 +

1

2kb�H�↵k2

`2

↵(t+1) = S��

⇣↵(t) + ��THT (b�H�↵))

⌘� <

1

kHk22


FBS: deconvolution1066 STARCK, PANTIN, & MURTAGH

2002 PASP, 114:1051–1069

Fig. 9a Fig. 9b

Fig. 9c

Fig. 9.—(a) b Pictoris raw data; (b) filtered image; (c) deconvolved image.

and the Fourier domain:

X(x, y) if (x, y) ! D,P (X(x, y)) p (49)Cs {0 otherwise;

Î(u, v) p O(u, v) if (u, v) ! Q,ˆP (X(u, v)) pC ˆf {X(u, v) otherwise.

The projection operator replaces by zero all pixel valuesPCsthat are not in the spatial support defined by , and replacesD PCfall frequencies in the Fourier domain Q by the frequencies ofthe object O. The Gerchberg algorithm is as follows:

1. Compute p inverse Fourier transform of , and set0˜ Ô I.i p 0

2. Compute .i˜X p P (O )1 Cs3. Compute p Fourier transform of .X X1 1

4. Compute .ˆ ˆX p P (X )2 C 1f

5. Compute p inverse Fourier transform of .ˆX X2 2

6. Compute .i!1˜ Ô p P (X )C 2s

7. Set , , and go to 2.i!1˜X p O i p i! 11

The algorithm consists just of forcing iteratively the solutionto be zero outside the spatial domain and equal to the ob-Dserved visibilities inside the Fourier domain Q. It has been

1066 STARCK, PANTIN, & MURTAGH

2002 PASP, 114:1051–1069

Fig. 9a Fig. 9b

Fig. 9c













1066 STARCK, PANTIN, & MURTAGH

2002 PASP, 114:1051–1069

Fig. 9a Fig. 9b

Fig. 9c














FBS: Inpainting

Inpainting problems arise when one wants to recover an image from incomplete measurements:

90% of the pixels are missing

b = M� x+ n

Entry-wise multiplication (Hadamard product)binary mask


FBS: Inpainting

Inpainting has been tackled by solving a L1-penalized least-square problem of the form:

convex and differentiablewith 1-Lipschitz gradientconvex but not differentiable

The forward-backward algorithm then reads:

x = Argminx=�↵

�k↵k`1 +

1

2kb�M�↵k2

`2

mask recast as a diagonal matrix

↵(t+1)= prox�f

⇣↵(t)

+ ��T(b�M�↵))

⌘


FBS: Inpainting

𝚽 = [Curvelets, Local DCT]


FBS: convergence

The forward-backward algorithm is converges to the minimum of f + g at the following rate:

f(x(t))� f(x?) Lkx(0) � x

?k2t

which is called a “sublinear” rate of convergence.

The approximate number of iteration to reach a certain level of precision is ϵ:

t✏ =

⇠Lkx(0) � x

?k2✏

⇡

Remark: it is important to notice that a more precise convergence study would reveal that the speed of convergence also depends on the spectrum of the operator H.


FBS: refinement with multi-step techniques

The FBS can be sped up by using multi-step techniques, which further account on information about the previous estimates:

x

(t+1)= prox�f

⇣x

(t) � �rg(x

(t))

⌘

Only depends on x(t)

From the seminal work of Nesterov, first in the early 80s, and later around 2007. The accelerated FBS is defined as follows:

(0) ⌫1 = 1, x(1) = x0; y(1) = x0

(1) x

(t)= prox�f

⇣y

(t) � �rg(y

(t))

⌘

(2) ⌫t+1 =1 +

p1 + 4⌫2t2

(3) y(t+1) = x

(t) +⌫t � 1

⌫t+1(x(t) � x

(t�1))

averaging of previous iterates


FBS: refinement with multi-step techniques

In the case of the L1-penalized least-square algorithm:

Courtesy of Beck/Teboulle, 2009

f(x(t))� f(x?) 2Lkx(0) � x

?k(t+ 1)2

t✏ =

&r2Lkx(0) � x

?k2✏

� 1

'


Primal-dual algorithms

Things get slightly more complicated when we want to minimize a problem of the form:

x = Argminx

f(x) + g(x)

convex but not differentiable convex but not differentiable

Example 1: sparsity and quadratic constraint

Example 2: sparsity and impulsive noise removal

x = Argminx=�↵

�k↵k`1 + kb��↵k

`1

x = Argminx=�↵

k↵k`1 s.t. kb��↵k

`2 ✏



We will more specifically focus on minimization problems of the form:

minx

f(x) + g(Ax)

which includes most linear inverse problems. We further assume that both f and g are “proximable”.

The main idea consists in splitting the application of the proximal operators of each of the functions. For that purpose, one has to resort to the Fenchel dual or convex conjugate:

g(y) = max

yhy, xi � g

?(y)

The previous problem can be recast as:

min

x

max

y

hy,Axi � g

?

(y) + f(x)

which turns out to be a saddle point problem



Convergence to a saddle-point of this problem,min

x

max

y

hy,Axi � g

?

(y) + f(x)

can be done by using the following iterative procedure:

(1) y

(t+1)= max

yhy,Axi � g

?(y)� ⌧

2

ky � y

(t)k2`2

(3) x = x

(t+1) + ✓(x(t+1) � x

(t))

✓ 2 [0, 1]with ⌧�kAk2 < 1

(2)x(t+1) = minx

< y

(t+1), Ax > +f(x) +

�

2kx� x

(t)k2`2



which eventually reads:

It alternates proxf and proxg

(3) x = x

(t+1) + ✓(x(t+1) � x

(t))

✓ 2 [0, 1]with ⌧�kAk2 < 1

(2)x

(t+1)= prox

1� f

⇣x

(t) � �A

Ty

(t+1)⌘

(1) y

(t+1)= prox

1⌧ g?

⇣y

(t)+ ⌧Ax

⌘


Example: point sources removal


Example

In this context, each channel is made three components:

x = x1 + x2 + x3 + n

backgroundextended sourcespoint sources

minx12K,x2,x32B

�k�T

x2k`1 + �kFT

x3k`1 s.t. kb�Hx1 �Mx2 � x3k`2 ✏

This has been tackled by solving, using a primal-dual proximal algorithm:

wavelets harmonic basis PSF point sources mask

B set of band-limited signals

K non-negative orthant


Example

Here is a challenging example:

b x1


Going a bit further

Deriving the FBS; one wants to get:

x = Argminx

f(x) + g(x)


The main idea consists in building an approximation functional that gives an upper bound on f+g:

A(x, z) = f(x) + g(z) + hx� z,rg(z)i+ L

2kx� zk2`2

After some basic calculation, this yields:

A(x, z) = f(x) + g(z) +L

2

��x� (z � 1

L

rg(z))

��2

`2


Going a bit further (2)

This approximation functional admits a unique minimizer over x:

mA(z) = Argminx

f(x) +L

2

��x� (z � 1

L

rg(z))

��2

`2

which is no more than the proximal operator of f applied as follows:

mA(z) = prox

1L f

✓z � 1

Lrg(z)

◆

The FBS then reduces to :

x

(t+1) = mA(x(t))

course #4 a very short introduction to ... - jerome...

Documents