course #4 a very short introduction to ... - jerome...
TRANSCRIPT
1
Course #4 - A (very) short introduction to proximal algorithms
J.Bobin - [email protected] - Analyse de données parcimonieuses en astrophysique
CS-Orion meeting - 01/28/2011 Course #4 - Proximal algorithms
Solving inverse problems
More generally, we will focus on linear inverse problems where :
b = Ax+ n
data, observations, etc. observation operator
signal to be retrieved
noise, model imperfections, etc
This models many inverse problems arising in physics :
- Denoising (A is the identity operator)- Deconvolution (A is the convolution kernel) This course- Inpainting/missing data interpolation (A is a binary mask) Course #4- Tomographic reconstruction (A is the partial Radon transform)- Radio-interferometric reconstruction (A is the partial Fourier transform) Course #5- Compressed sensing Course #5- Blind source separation Course #6-8
CS-Orion meeting - 01/28/2011 Course #4 - Proximal algorithms
Solving inverse problems
Let’s assume x is sparse is some orthogonal basis: ↵ = �x
data fidelity term(measures how well the model fits the data)
sparsity-enforcing penalty
Examples of penalty terms:
k↵k`1 =X
i
|↵[i]|
P(↵) = k↵k`1The 0-norm counts the number of nonzero elements
P(↵) = k↵k`0
x = Argminx=�↵
P(↵) + kb��↵k22
CS-Orion meeting - 01/28/2011 Course #4 - Proximal algorithms
Solving inverse problems
Computing the solution to an inverse problems also boils down to solving a minimization of the form:
x = Argminx
g(x)
x = Argminx
�k�Txk`p +
1
2kb�Hxk22
Example: penalized least-square estimator, etc.
x = Argminx
f(x) + g(x)
Or more generally,
Example: least-square estimator, maximum likelihood estimator, etc.
x = Argmin
x
X
i
x
i
� b
i
log(x
i
)
CS-Orion meeting - 01/28/2011 Course #4 - Proximal algorithms
Let’s warm up with a simple case
Let’s consider the following simple case:
x = Argminx
g(x)
where g verifies the following properties:
- It is convex : 8x, y 2 Domg,↵ 2 [0, 1]; g(↵x+ (1� ↵)y) ↵g(x) + (1� ↵)g(y)
- It is differentiable : rg is defined on Domg
- Its gradient is Lipschitz: 8x, y 2 Domg; krg(x)�rg(y)k Lkx� yk
Example:g(x) = kb�Hxk2`2 rg(x) = 2H?(Hx� b) L = 2kH?Hk2
CS-Orion meeting - 01/28/2011 Course #4 - Proximal algorithms
Gradient descent
In that case, having access to first-order information about g, the most straightforward/simplest first-order algorithm is the gradient descent algorithm:
x
(t+1) = x
(t) � �rg(x(t))
g(x) = c
x
(0)
x
(1)
x
(2)
CS-Orion meeting - 01/28/2011 Course #4 - Proximal algorithms
A more complex problem
Let’s reconsider the following L1-penalized least-square problem:
x = Argminx
�k�Txk`1 +
1
2kb�Hxk22
x = Argminx
f(x) + g(x)
convex and differentiablewith L-Lipschitz gradientconvex but not differentiable
CS-Orion meeting - 01/28/2011 Course #4 - Proximal algorithms
Subgradient
A more precise description of f(x) requires defining the subgradient of a convex function:
@f(x) = {u 2 Domf ; 8y 2 Domf , f(x) + hy � x, ui f(y)}
For example, let’s go back to the L1-norm:
f(x) = kxk`1for x > 0, @f(x) = 1
x < 0, @f(x) = �1
x = 0, @f(x) = [�1, 1]
x
CS-Orion meeting - 01/28/2011 Course #4 - Proximal algorithms
Proximal operator
This now allows to define the key element of proximal calculus: the proximal operator of a function f.
proxf (x) = Argminv f(v) +
1
2
kx� vk2`2
Domf
f(x) = c
CS-Orion meeting - 01/28/2011 Course #4 - Proximal algorithms
Proximal operator, properties
Some useful properties (among others) :
i) translation :
ii) scaling :
iii) reflection :
iv) conjugation :
v) change of basis :
h(x) = f(x� z)
h(x) = f(x/�) proxh(x) = � prox1/�2f (x/�)
proxh(x) = z + proxf (x� z)
h(x) = f(�x) proxh(x) = �proxf (�x)
h(x) = f
?(x) = max
zhz, xi � f(z)
proxh(x) = x� proxf (x)
h(x) = f(�Tx) proxh(x) = � proxf (�
Tx)
CS-Orion meeting - 01/28/2011 Course #4 - Proximal algorithms
Proximal operator, examples
The indicator function of a convex set K:
◆K(x) =
⇢0 if x 2 K,
+1 otherwise
prox◆K (x) = PK(x)
It is the orthogonal projector onto K !
Example of the non-negative orthant:
prox◆K (x) =
⇢x if x � 0,
0 otherwise
K = {u; u � 0}
The L2 norm:
f(x) = �kxk2`2 proxf (x) =1
1 + 2�
x
CS-Orion meeting - 01/28/2011 Course #4 - Proximal algorithms
Proximal operator, examples
The L1 norm:f(x) = �kxk`1
proxf (x) = Argminu �kuk`1 +1
2
kx� uk2`2
By definition of the proximal operator:
soft-thresholding operator
S�
���
The Poisson log-likelihood:
f(x) = �k log(x) + x
proxf (x) =1
2
⇣x� 1 +
p|x� 1|2 + 4k
⌘
CS-Orion meeting - 01/28/2011 Course #4 - Proximal algorithms
Forward-backward splitting algorithm
Let’s go back to our minimization problem:
x = Argminx
f(x) + g(x)
convex and differentiablewith L-Lipschitz gradientconvex but not differentiable
Then it has been shown that the following iterative scheme solves the problem:
x
(t+1)= prox�f
⇣x
(t) � �rg(x
(t))
⌘
prox gradient descent on g
� 2]0, 1/L[
CS-Orion meeting - 01/28/2011 Course #4 - Proximal algorithms
Forward-backward splitting algorithm
Example:
g(x) = c
x
(1)
convex set K
minx2K
g(x)
x
(2)x
(3)
CS-Orion meeting - 01/28/2011 Course #4 - Proximal algorithms
FBS: denoising with redundant representations
x = Argminx=�↵
�k↵k`1 +
1
2kb��↵k2
`2
We want to solve a denoising problem by imposing sparsity in a redundant, non-orthogonal transform (undecimated wavelets, curvelets, ridgelets, etc …):
convex and differentiablewith 1-Lipschitz gradientconvex but not differentiable
↵(t+1) = S��
⇣↵(t) + ��T (b��↵))
⌘The forward-backward algorithm then reads:
CS-Orion meeting - 01/28/2011 Course #4 - Proximal algorithms
FBS: denoising with redundant representations
CS-Orion meeting - 01/28/2011 Course #4 - Proximal algorithms
FBS: deconvolution
We want to solve a deconvolution problem by imposing sparsity in some transform :
convex and differentiablewith L-Lipschitz gradientconvex but not differentiable
The forward-backward algorithm then reads:
x = Argminx=�↵
�k↵k`1 +
1
2kb�H�↵k2
`2
↵(t+1) = S��
⇣↵(t) + ��THT (b�H�↵))
⌘� <
1
kHk22
CS-Orion meeting - 01/28/2011 Course #4 - Proximal algorithms
FBS: deconvolution1066 STARCK, PANTIN, & MURTAGH
2002 PASP, 114:1051–1069
Fig. 9a Fig. 9b
Fig. 9c
Fig. 9.—(a) b Pictoris raw data; (b) filtered image; (c) deconvolved image.
and the Fourier domain:
X(x, y) if (x, y) ! D,P (X(x, y)) p (49)Cs {0 otherwise;
ˆI(u, v) p O(u, v) if (u, v) ! Q,ˆP (X(u, v)) pC ˆf {X(u, v) otherwise.
The projection operator replaces by zero all pixel valuesPCsthat are not in the spatial support defined by , and replacesD PCfall frequencies in the Fourier domain Q by the frequencies ofthe object O. The Gerchberg algorithm is as follows:
1. Compute p inverse Fourier transform of , and set0˜ ˆO I.i p 0
2. Compute .i˜X p P (O )1 Cs3. Compute p Fourier transform of .X X1 1
4. Compute .ˆ ˆX p P (X )2 C 1f
5. Compute p inverse Fourier transform of .ˆX X2 2
6. Compute .i!1˜ ˆO p P (X )C 2s
7. Set , , and go to 2.i!1˜X p O i p i! 11
The algorithm consists just of forcing iteratively the solutionto be zero outside the spatial domain and equal to the ob-Dserved visibilities inside the Fourier domain Q. It has been
1066 STARCK, PANTIN, & MURTAGH
2002 PASP, 114:1051–1069
Fig. 9a Fig. 9b
Fig. 9c
Fig. 9.—(a) b Pictoris raw data; (b) filtered image; (c) deconvolved image.
and the Fourier domain:
X(x, y) if (x, y) ! D,P (X(x, y)) p (49)Cs {0 otherwise;
ˆI(u, v) p O(u, v) if (u, v) ! Q,ˆP (X(u, v)) pC ˆf {X(u, v) otherwise.
The projection operator replaces by zero all pixel valuesPCsthat are not in the spatial support defined by , and replacesD PCfall frequencies in the Fourier domain Q by the frequencies ofthe object O. The Gerchberg algorithm is as follows:
1. Compute p inverse Fourier transform of , and set0˜ ˆO I.i p 0
2. Compute .i˜X p P (O )1 Cs3. Compute p Fourier transform of .X X1 1
4. Compute .ˆ ˆX p P (X )2 C 1f
5. Compute p inverse Fourier transform of .ˆX X2 2
6. Compute .i!1˜ ˆO p P (X )C 2s
7. Set , , and go to 2.i!1˜X p O i p i! 11
The algorithm consists just of forcing iteratively the solutionto be zero outside the spatial domain and equal to the ob-Dserved visibilities inside the Fourier domain Q. It has been
1066 STARCK, PANTIN, & MURTAGH
2002 PASP, 114:1051–1069
Fig. 9a Fig. 9b
Fig. 9c
Fig. 9.—(a) b Pictoris raw data; (b) filtered image; (c) deconvolved image.
and the Fourier domain:
X(x, y) if (x, y) ! D,P (X(x, y)) p (49)Cs {0 otherwise;
ˆI(u, v) p O(u, v) if (u, v) ! Q,ˆP (X(u, v)) pC ˆf {X(u, v) otherwise.
The projection operator replaces by zero all pixel valuesPCsthat are not in the spatial support defined by , and replacesD PCfall frequencies in the Fourier domain Q by the frequencies ofthe object O. The Gerchberg algorithm is as follows:
1. Compute p inverse Fourier transform of , and set0˜ ˆO I.i p 0
2. Compute .i˜X p P (O )1 Cs3. Compute p Fourier transform of .X X1 1
4. Compute .ˆ ˆX p P (X )2 C 1f
5. Compute p inverse Fourier transform of .ˆX X2 2
6. Compute .i!1˜ ˆO p P (X )C 2s
7. Set , , and go to 2.i!1˜X p O i p i! 11
The algorithm consists just of forcing iteratively the solutionto be zero outside the spatial domain and equal to the ob-Dserved visibilities inside the Fourier domain Q. It has been
CS-Orion meeting - 01/28/2011 Course #4 - Proximal algorithms
FBS: Inpainting
Inpainting problems arise when one wants to recover an image from incomplete measurements:
90% of the pixels are missing
b = M� x+ n
Entry-wise multiplication (Hadamard product)binary mask
CS-Orion meeting - 01/28/2011 Course #4 - Proximal algorithms
FBS: Inpainting
Inpainting has been tackled by solving a L1-penalized least-square problem of the form:
convex and differentiablewith 1-Lipschitz gradientconvex but not differentiable
The forward-backward algorithm then reads:
x = Argminx=�↵
�k↵k`1 +
1
2kb�M�↵k2
`2
mask recast as a diagonal matrix
↵(t+1)= prox�f
⇣↵(t)
+ ��T(b�M�↵))
⌘
CS-Orion meeting - 01/28/2011 Course #4 - Proximal algorithms
FBS: Inpainting
𝚽 = [Curvelets, Local DCT]
CS-Orion meeting - 01/28/2011 Course #4 - Proximal algorithms
FBS: convergence
The forward-backward algorithm is converges to the minimum of f + g at the following rate:
f(x(t))� f(x?) Lkx(0) � x
?k2t
which is called a “sublinear” rate of convergence.
The approximate number of iteration to reach a certain level of precision is ϵ:
t✏ =
⇠Lkx(0) � x
?k2✏
⇡
Remark: it is important to notice that a more precise convergence study would reveal that the speed of convergence also depends on the spectrum of the operator H.
CS-Orion meeting - 01/28/2011 Course #4 - Proximal algorithms
FBS: refinement with multi-step techniques
The FBS can be sped up by using multi-step techniques, which further account on information about the previous estimates:
x
(t+1)= prox�f
⇣x
(t) � �rg(x
(t))
⌘
Only depends on x(t)
From the seminal work of Nesterov, first in the early 80s, and later around 2007. The accelerated FBS is defined as follows:
(0) ⌫1 = 1, x(1) = x0; y(1) = x0
(1) x
(t)= prox�f
⇣y
(t) � �rg(y
(t))
⌘
(2) ⌫t+1 =1 +
p1 + 4⌫2t2
(3) y(t+1) = x
(t) +⌫t � 1
⌫t+1(x(t) � x
(t�1))
averaging of previous iterates
CS-Orion meeting - 01/28/2011 Course #4 - Proximal algorithms
FBS: refinement with multi-step techniques
In the case of the L1-penalized least-square algorithm:
Courtesy of Beck/Teboulle, 2009
f(x(t))� f(x?) 2Lkx(0) � x
?k(t+ 1)2
t✏ =
&r2Lkx(0) � x
?k2✏
� 1
'
CS-Orion meeting - 01/28/2011 Course #4 - Proximal algorithms
Primal-dual algorithms
Things get slightly more complicated when we want to minimize a problem of the form:
x = Argminx
f(x) + g(x)
convex but not differentiable convex but not differentiable
Example 1: sparsity and quadratic constraint
Example 2: sparsity and impulsive noise removal
x = Argminx=�↵
�k↵k`1 + kb��↵k
`1
x = Argminx=�↵
k↵k`1 s.t. kb��↵k
`2 ✏
CS-Orion meeting - 01/28/2011 Course #4 - Proximal algorithms
Primal-dual algorithms
We will more specifically focus on minimization problems of the form:
minx
f(x) + g(Ax)
which includes most linear inverse problems. We further assume that both f and g are “proximable”.
The main idea consists in splitting the application of the proximal operators of each of the functions. For that purpose, one has to resort to the Fenchel dual or convex conjugate:
g(y) = max
yhy, xi � g
?(y)
The previous problem can be recast as:
min
x
max
y
hy,Axi � g
?
(y) + f(x)
which turns out to be a saddle point problem
CS-Orion meeting - 01/28/2011 Course #4 - Proximal algorithms
Primal-dual algorithms
Convergence to a saddle-point of this problem,min
x
max
y
hy,Axi � g
?
(y) + f(x)
can be done by using the following iterative procedure:
(1) y
(t+1)= max
yhy,Axi � g
?(y)� ⌧
2
ky � y
(t)k2`2
(3) x = x
(t+1) + ✓(x(t+1) � x
(t))
✓ 2 [0, 1]with ⌧�kAk2 < 1
(2)x(t+1) = minx
< y
(t+1), Ax > +f(x) +
�
2kx� x
(t)k2`2
CS-Orion meeting - 01/28/2011 Course #4 - Proximal algorithms
Primal-dual algorithms
which eventually reads:
It alternates proxf and proxg
(3) x = x
(t+1) + ✓(x(t+1) � x
(t))
✓ 2 [0, 1]with ⌧�kAk2 < 1
(2)x
(t+1)= prox
1� f
⇣x
(t) � �A
Ty
(t+1)⌘
(1) y
(t+1)= prox
1⌧ g?
⇣y
(t)+ ⌧Ax
⌘
CS-Orion meeting - 01/28/2011 Course #4 - Proximal algorithms
Example: point sources removal
CS-Orion meeting - 01/28/2011 Course #4 - Proximal algorithms
Example
In this context, each channel is made three components:
x = x1 + x2 + x3 + n
backgroundextended sourcespoint sources
minx12K,x2,x32B
�k�T
x2k`1 + �kFT
x3k`1 s.t. kb�Hx1 �Mx2 � x3k`2 ✏
This has been tackled by solving, using a primal-dual proximal algorithm:
wavelets harmonic basis PSF point sources mask
B set of band-limited signals
K non-negative orthant
CS-Orion meeting - 01/28/2011 Course #4 - Proximal algorithms
Example
Here is a challenging example:
b x1
CS-Orion meeting - 01/28/2011 Course #4 - Proximal algorithms
CS-Orion meeting - 01/28/2011 Course #4 - Proximal algorithms
Going a bit further
Deriving the FBS; one wants to get:
x = Argminx
f(x) + g(x)
convex and differentiablewith L-Lipschitz gradientconvex but not differentiable
The main idea consists in building an approximation functional that gives an upper bound on f+g:
A(x, z) = f(x) + g(z) + hx� z,rg(z)i+ L
2kx� zk2`2
After some basic calculation, this yields:
A(x, z) = f(x) + g(z) +L
2
����x� (z � 1
L
rg(z))
����2
`2
CS-Orion meeting - 01/28/2011 Course #4 - Proximal algorithms
Going a bit further (2)
This approximation functional admits a unique minimizer over x:
mA(z) = Argminx
f(x) +L
2
����x� (z � 1
L
rg(z))
����2
`2
which is no more than the proximal operator of f applied as follows:
mA(z) = prox
1L f
✓z � 1
Lrg(z)
◆
The FBS then reduces to :
x
(t+1) = mA(x(t))