a first-order primal-dual algorithm for convex problems with applications to...

J Math Imaging Vis (2011) 40: 120–145DOI 10.1007/s10851-010-0251-1

A First-Order Primal-Dual Algorithm for Convex Problemswith Applications to Imaging

Antonin Chambolle · Thomas Pock

Published online: 21 December 2010© Springer Science+Business Media, LLC 2010

Abstract In this paper we study a first-order primal-dual al-gorithm for non-smooth convex optimization problems withknown saddle-point structure. We prove convergence to asaddle-point with rate O(1/N) in finite dimensions for thecomplete class of problems. We further show accelerationsof the proposed algorithm to yield improved rates on prob-lems with some degree of smoothness. In particular we showthat we can achieve O(1/N2) convergence on problems,where the primal or the dual objective is uniformly con-vex, and we can show linear convergence, i.e. O(ωN) forsome ω ∈ (0,1), on smooth problems. The wide applica-bility of the proposed algorithm is demonstrated on severalimaging problems such as image denoising, image deconvo-lution, image inpainting, motion estimation and multi-labelimage segmentation.

Keywords Convex optimization · Dual approaches · Totalvariation · Inverse problems · Image · Reconstruction

1 Introduction

Variational methods have proven to be particularly useful tosolve a number of ill-posed inverse imaging problems. Theycan be divided into two fundamentally different classes: con-vex and non-convex problems. The advantage of convex

A. Chambolle (�)CMAP, Ecole Polytechnique, CNRS, 91128 Palaiseau, Francee-mail: [email protected]

T. PockInstitute for Computer Graphics and Vision,Graz University of Technology, 8010 Graz, Austriae-mail: [email protected]

problems over non-convex problems is that a global opti-mum can be computed, in many cases with a good preci-sion and in a reasonable time, independent of the initializa-tion. Hence, the quality of the solution solely depends onthe accuracy of the model. On the other hand, non-convexproblems have often the ability to model more precisely theprocess behind an image acquisition, but have the drawbackthat the quality of the solution is more sensitive to the ini-tialization and the optimization algorithm.

Total variation minimization plays an important role inconvex variational methods for imaging. The major advan-tage of the total variation is that it allows for sharp discon-tinuities in the solution. This is of vital interest for manyimaging problems, since edges represent important features,e.g. object boundaries or motion boundaries. However, it isalso well known that variational methods incorporating to-tal variation regularization are difficult to minimize due tothe non-smoothness of the total variation. The aim of thispaper is therefore to provide a flexible algorithm which isparticularly suitable for such non-smooth convex optimiza-tion problems in imaging.

In Sect. 2 we re-visit a primal-dual algorithm proposedby Pock, Bischof, Cremers and Chambolle in [29] for mini-mizing a convex relaxation of the Mumford-Shah functional.In subsequent work [12], Esser et al. studied the same algo-rithm in a more general framework and established connec-tions to other known algorithms.

We show in this paper that this algorithm has a conver-gence rate of O(1/N), more precisely, a partial1 primal-dual gap decays like one over the total number of iterations.This rate equals the rates recently proven by Nesterov in [27]

1The partial primal-dual gap is a weaker measure than the commonlyused primal-dual gap but it can be applied to functions with unboundeddomain.

mailto:[email protected]

mailto:[email protected]

J Math Imaging Vis (2011) 40: 120–145 121

and more recently by Nemirovski in [23] on (almost) thesame class of problems as in this paper. We further showin Sect. 5 that for certain problems, the theoretical rate ofconvergence can be further improved. In particular we showthat the proposed algorithm can be modified to yield a rate ofconvergence O(1/N2) for problems which have some reg-ularity in the primal or in the dual objective and is linearlyconvergent (O(ωN), ω < 1) for smooth problems.

The primal-dual algorithm proposed in this paper can beeasily adapted to different problems, is easy to implementand can be effectively accelerated on parallel hardware suchas graphics processing units (GPUs). This is particularly ap-pealing for imaging problems, where real-time applicationsplay an important role. This is demonstrated in Sect. 6 onseveral variational problems such as deconvolution, zoom-ing, inpainting, motion estimation and segmentation. Weend the paper with a short discussion.

2 The General Problem

Let X,Y be two finite-dimensional real vector spaces2

equipped with an inner product 〈·, ·〉 and norm ‖ ·‖ = 〈·, ·〉 12 .

The map K : X → Y is a continuous linear operator with in-duced norm

‖K‖ = max {‖Kx‖ : x ∈ X with ‖x‖ ≤ 1} . (1)

The general problem we consider in this paper is the genericsaddle-point problem

minx∈X

maxy∈Y

〈Kx,y〉 + G(x) − F ∗(y) (2)

where G : X → [0,+∞] and F ∗ : Y → [0,+∞] are proper,convex, lower-semicontinuous (l.s.c.) functions, F ∗ being it-self the convex conjugate of a convex l.s.c. function F . Letus observe that this saddle-point problem is a primal-dualformulation of the nonlinear primal problem

minx∈X

F(Kx) + G(x), (3)

or of the corresponding dual problem

maxy∈Y

− (G∗(−K∗y) + F ∗(y)). (4)

We refer to [32] for more details. We assume that these prob-lems have at least one solution (x, y) ∈ X × Y , which there-fore satisfies

Kx ∈ ∂F ∗(y),

−(K∗y) ∈ ∂G(x),(5)

2In fact, in most of the paper X and Y could be general real Hilbertspaces with infinite dimension, however, in that case, the assumptionthat K has finite norm is very restrictive. We will emphasize when theassumption of finite dimensionality is really needed.

where ∂F ∗ and ∂G are the subgradients of the convex func-tions F ∗ and G. See again [32] for details. Throughout thepaper we will assume that F and G are “simple”, in thesense that their resolvent operator defined through

x = (I + τ∂F )−1(y)

= arg minx

{‖x − y‖2

2τ+ F(x)

}

has a closed-form representation (or can be efficientlysolved up to a high precision, e.g. using a Newton method inlow dimension). This is the case in many interesting prob-lems in imaging, see Sect. 6. We recall that it is as easy tocompute (I + τ∂F )−1 as (I + τ∂F ∗)−1, as it is shown bythe celebrated Moreau’s identity:

x = (I + τ∂F )−1(x) + τ

(I + 1

τ∂F ∗

)−1(x

τ

), (6)

see for instance [32].

3 The Algorithm

The primal-dual algorithm we study in this paper is summa-rized in Algorithm 1. Note that the algorithm can also bewritten with yn+1 = yn+1 + θ(yn+1 − yn) instead of xn+1

and by exchanging the updates for yn+1 and xn+1. We willfocus on the special case θ = 1 since in that case, it is rela-tively easy to get estimates on the convergence of the algo-rithm.3 However, other cases are interesting, and in partic-ular the semi-implicit classical Arrow-Hurwicz algorithm,which corresponds to θ = 0, has been presented in the re-cent literature as an efficient approach for solving some typeof imaging problems [37]. We’ll see that in smoother cases,that approach seems indeed to perform very well, even ifwe can actually prove estimates for larger choices of θ . Wewill also consider later on (when F or G have some knownregularity) some variants where the steps σ and τ and theparameter θ can be modified at each iteration, see Sect. 5.

3.1 Convergence Analysis for θ = 1

For practical use, we introduce the partial primal-dual gap

GB1×B2(x, y) = maxy′∈B2

⟨y′,Kx

⟩− F ∗(y′) + G(x)

− minx′∈B1

⟨y,Kx′⟩− F ∗(y) + G(x′).

3Note that the convergence itself is not really an issue, in particular, itis shown in a very recent preprint [16] that for θ = 1, the approach is aninstance of the Proximal Point Algorithm, see [31] and Sect. 4 below.

122 J Math Imaging Vis (2011) 40: 120–145

Algorithm 1

• Initialization: Choose τ, σ > 0, θ ∈ [0,1], (x0, y0) ∈ X × Y and set x0 = x0.• Iterations (n ≥ 0): Update xn, yn, xn as follows:

⎧⎪⎪⎨

⎪⎪⎩

yn+1 = (I + σ∂F ∗)−1(yn + σKxn)

xn+1 = (I + τ∂G)−1(xn − τK∗yn+1)

xn+1 = xn+1 + θ(xn+1 − xn)

(7)

Then, as soon as B1 × B2 contains a saddle-point (x, y),defined by (2), we have

GB1×B2(x, y) ≥ (⟨y,Kx

⟩− F ∗(y) + G(x))

− (⟨y,Kx

⟩− F ∗(y) + G(x))≥ 0.

However, this is not a good measure of optimality, in partic-ular, it might vanish for (x, y) which is not a saddle-point.What is easy to check is that if GB1×B2(x, y) = 0 and (x, y)

lies in the interior of B1 × B2, then it is a saddle-point. Inpractice, if B1 and B2 are “large enough”, the gap will mea-sure the optimality (thanks to point (a) in the following theo-rem), an important issue being that the constants will dependon these (large) unknown sets. See also Remark 3 below. Inthe general case we have the following result:

Theorem 1 Let L = ‖K‖ and assume problem (2) hasa saddle-point (x, y). Choose θ = 1, τσL2 < 1, and let(xn, xn, yn) be defined by (7). Then:

(a) For any n, (xn, yn) remains bounded, indeed:

‖yn − y‖2σ

2

+ ‖xn − x‖2τ

2

≤ C

(‖y0 − y‖2σ

2

+ ‖x0 − x‖2τ

2)(8)

where the constant C ≤ (1 − τσL2)−1;(b) If we let xN = (

∑Nn=1 xn)/N and yN = (

∑Nn=1 yn)/N ,

for any bounded B1 × B2 ⊂ X × Y the restricted gaphas the following bound:

GB1×B2(xN , yN) ≤ D(B1,B2)

N, (9)

where

D(B1,B2) = sup(x,y)∈B1×B2

‖x − x0‖2τ

2

+ ‖y − y0‖2σ

2

.

Moreover, the weak cluster points of (xN , yN) aresaddle-points of (2);

(c) There exists a saddle-point (x∗, y∗) such that xn → x∗and yn → y∗.

Remark 1 Points (a) and (b) would still hold in an infi-nite dimensional Hilbert setting, as can be checked from theproofs.

Proof Let us first write the iterations (7) in the general form

{yn+1 = (I + σ∂F ∗)−1(yn + σKx)

xn+1 = (I + τ∂G)−1(xn − τK∗y).(10)

We have

∂F ∗(yn+1) yn − yn+1

σ+ Kx

∂G(xn+1) xn − xn+1

τ− K∗y

so that for any (x, y) ∈ X × Y ,

F ∗(y) ≥ F ∗(yn+1) +⟨yn − yn+1

σ,y − yn+1

⟩

+ ⟨Kx, y − yn+1

⟩

G(x) ≥ G(xn+1) +⟨xn − xn+1

τ, x − xn+1

⟩

− ⟨K(x − xn+1), y

⟩.

(11)

Summing both inequalities, it follows:

‖y − yn‖2σ

2

+ ‖x − xn‖2τ

2

≥[⟨

Kxn+1, y⟩− F ∗(y) + G(xn+1)

]

−[⟨

Kx,yn+1⟩− F ∗(yn+1) + G(x)

]

+ ‖y − yn+1‖2σ

2

+ ‖x − xn+1‖2τ

2

+ ‖yn − yn+1‖2σ

2

+ ‖xn − xn+1‖2τ

2

+⟨K(xn+1 − x), yn+1 − y

⟩

−⟨K(xn+1 − x), yn+1 − y

⟩. (12)


From this inequality it can be seen that the expression in thelast line of (12) plays an important role in proving conver-gence of the algorithm.

The best choice of course would be to make the schemefully implicit, i.e. x = xn+1 and y = yn+1, which however isnot feasible, since this choice would require to solve prob-lems beforehand which are as difficult as the original prob-lem. It is easy to see that by the natural order of the iter-ates, the scheme can be easily made semi implicit by takingx = xn and y = yn+1. This choice, corresponding to θ = 0in Algorithm 1, yields the classical Arrow-Hurwicz algo-rithm [1] and has been used in [37] for total variation min-imization. A proof of convergence for this choice is givenin [12] but with some additional restrictions on the step-widths; see also Sect. 3.2 for a more detailed analysis ofthis scheme.

Another choice is to compute so-called leading points ob-tained from taking an extragradient step based on the currentiterates [17, 23, 30].

Here, we consider Algorithm 1 with θ = 1. As in thesemi-implicit case, we choose y = yn+1, while we choosex = 2xn − xn−1 which corresponds to a simple linear ex-trapolation based on the current and previous iterates. Thiscan be seen as an approximate extragradient step. With thischoice, the last line of (12) becomes⟨K(xn+1 − x), yn+1 − y

⟩−⟨K(xn+1 − x), yn+1 − y

⟩

=⟨K((xn+1 − xn) − (xn − xn−1)), yn+1 − y

⟩

=⟨K(xn+1 − xn), yn+1 − y

⟩−⟨K(xn − xn−1), yn − y

⟩

−⟨K(xn − xn−1), yn+1 − yn

⟩

≥⟨K(xn+1 − xn), yn+1 − y

⟩−⟨K(xn − xn−1), yn − y

⟩

− L‖xn − xn−1‖‖yn+1 − yn‖. (13)

For any α > 0, we have that (using 2ab ≤ αa2 +b2/α forany a, b)

L‖xn − xn−1‖‖yn+1 − yn‖ ≤ Lατ

2τ‖xn − xn−1‖2

+ Lσ

2ασ‖yn+1 − yn‖2

and we choose α = √σ/τ , so that Lατ = Lσ/α =√

στL < 1.Summing the last inequality together with (12) and (13),

we get that for any x ∈ X and y ∈ Y ,

‖y − yn‖2σ

2

+ ‖x − xn‖2τ

2

≥[⟨

Kxn+1, y⟩− F ∗(y) + G(xn+1)

]

−[⟨

Kx,yn+1⟩− F ∗(yn+1) + G(x)

]

+ ‖y − yn+1‖2σ

2

+ ‖x − xn+1‖2τ

2

+ (1 − √στL)

‖yn − yn+1‖2σ

2

+ ‖xn − xn+1‖2τ

2

− √στL

‖xn−1 − xn‖2τ

2

+⟨K(xn+1 − xn), yn+1 − y

⟩

−⟨K(xn − xn−1), yn − y

⟩. (14)

Let us now sum (14) from n = 0 to N − 1. It follows thatfor any x and y,

N∑

n=1

[⟨Kxn,y

⟩− F ∗(y) + G(xn)]

− [⟨Kx,yn

⟩− F ∗(yn) + G(x)]

+ ‖y − yN‖2σ

2

+ ‖x − xN‖2τ

2

+ (1 − √στL)

N∑

n=1

‖yn − yn−1‖2σ

2

+ (1 − √στL)

N−1∑

n=1

‖xn − xn−1‖2τ

2

+ ‖xN − xN−1‖2τ

2

≤ ‖y − y0‖2σ

2

+ ‖x − x0‖2τ

2

−⟨K(xN − xN−1), yN − y

⟩

where we have introduced x−1 = x0, in coherence with theinitial choice of x0 = x0. Now, as before, |〈K(xN − xN−1),

yN −y〉| ≤ ‖xN − xN−1‖2/(2τ)+(τσL2)‖y − yN‖2/(2σ),and it follows that

N∑

n=1

([⟨Kxn,y

⟩− F ∗(y) + G(xn)]

− [⟨Kx,yn

⟩− F ∗(yn) + G(x)])

+ (1 − στL2)‖y − yN‖

2σ

2

+ ‖x − xN‖2τ

2

+ (1 − √στL)

N∑

n=1

‖yn − yn−1‖2σ

2

+ (1 − √στL)

N−1∑

n=1

‖xn − xn−1‖2τ

2

≤ ‖y − y0‖2σ

2

+ ‖x − x0‖2τ

2

. (15)


First we substitute (x, y) = (x, y) in (15) where (x, y)

is a saddle-point of (2). Then, it follows from (2) that thefirst summation in (15) is non-negative, and point (a) inTheorem 1 follows. We then deduce from (15) and the con-vexity of G and F ∗ that, letting xN = (

∑Nn=1 xn)/N and

yN = (∑N

n=1 yn)/N ,

[〈KxN,y〉 − F ∗(y) + G(xN)]

− [〈Kx,yN 〉 − F ∗(yN) + G(x)]

≤ 1

N

(‖y − y0‖2σ

2

+ ‖x − x0‖2τ

2)(16)

for any (x, y) ∈ X × Y , which yields (9). Consider now acluster point (x∗, y∗) of (xN , yN) (which is a bounded se-quence, hence compact). Being G and F ∗ convex and l.s.c.,and it follows from (16) that

[⟨Kx∗, y

⟩− F ∗(y) + G(x∗)]

− [⟨Kx,y∗⟩− F ∗(y∗) + G(x)

]≤ 0

for any (x, y) ∈ X ×Y . This shows that (x∗, y∗) satisfies (2)and therefore is a saddle-point. We have shown point (b) inTheorem 1.

It remains to prove the convergence to a saddle-point ofthe whole sequence. Observe that this is the only part wherewe really need to assume that X and Y have finite dimen-sion. Point (a) establishes that (xn, yn) is a bounded se-quence, so that some subsequence (xnk , ynk ) converges tosome limit (x∗, y∗) (strongly, since the dimension of thespace is finite). Observe that (15) implies that limn(x

n −xn−1) = limn(y

n − yn−1) = 0, in particular also xnk−1 andynk−1 converge respectively to x∗ and y∗. It follows that thelimit (x∗, y∗) is a fixed point of the iterations (7), hence asaddle-point of our problem.

We can then take (x, y) = (x∗, y∗) in (14), which we sumfrom n = nk to N − 1, N > nk . We obtain

‖y∗ − yN‖2σ

2

+ ‖x∗ − xN‖2τ

2

+ (1 − √στL)

N∑

n=nk+1

‖yn − yn−1‖2σ

2

− ‖xnk − xnk−1‖2τ

2

+ (1 − √στL)

N−1∑

n=nk

‖xn − xn−1‖2τ

2

+ ‖xN − xN−1‖2τ

2

+⟨K(xN − xN−1), yN − y∗⟩

−⟨K(xnk − xnk−1), ynk − y∗⟩

≤ ‖y∗ − ynk‖2σ

2

+ ‖x∗ − xnk‖2τ

2

from which we easily deduce that xN → x∗ and yN → y∗as N → ∞. �

Remark 2 Note that when using τσL2 = 1 in (15), the con-trol of the estimate for yN is lost. However, one still has anestimate on xN

‖x − xN‖2τ

2

≤ ‖y − y0‖2σ

2

+ ‖x − x0‖2τ

2

.

An analog estimate can be obtained by writing the algorithmin y.

‖y − yN‖2σ

2

≤ ‖y − y0‖2σ

2

+ ‖x − x0‖2τ

2

.

Remark 3 Let us observe that also the global gap convergeswith the same rate O(1/N), under the additional assumptionthat F and G∗ have full domain. More precisely, we observethat if F ∗(y)/|y| → ∞ as |y| → ∞, then for any R > 0,F ∗(y) ≥ R|y| for y large enough which yields that domF ⊃B(0,R). Hence F has full domain. Conversely, if F has fulldomain, one can check that lim|y|→∞ F ∗(y)/|y| = +∞. Itis classical that in this case, F is locally Lipschitz in Y . Onechecks, then, that

maxy∈Y

〈y,KxN 〉 − F ∗(y) + G(xN) = F(KxN) + G(xN)

is reached at some y ∈ ∂F (KxN), which is globally boundedthanks to (8). It follows from (9) that F(KxN) + G(xN) −(F (Kx) + G(x)) ≤ D/N for some constant depending onthe starting point (x0, y0), F and L. In the same way, iflim|x|→∞ G(x)/|x| → ∞ (G∗ has full domain), we haveF ∗(yN) + G∗(−K∗yN) − (F ∗(y) + G∗(−K∗y)) ≤ D/N .If both F ∗(y)/|y| and G(x)/|x| diverge as |y| and |x| go toinfinity, then the global gap G(xN , yN) ≤ D/N .

3.2 The Arrow-Hurwicz Method (θ = 0)

We have seen that the classical Arrow-Hurwicz method [1]corresponds to the choice θ = 0 in Algorithm 1, that is, theparticular choice y = yn+1 and x = xn in (10). This leads to

{yn+1 = (I + σ∂F ∗)−1(yn + σKxn)

xn+1 = (I + τ∂G)−1(xn − τK∗yn+1).(17)

In [37], Zhu and Chan used this classical Arrow-Hurwiczmethod to solve the Rudin Osher and Fatemi (ROF) imagedenoising problem [33]. See also [12] for a proof of conver-gence of the Arrow-Hurwicz method with very small steps.A characteristic of the ROF problem (and also many others)is that the domain of F ∗ is bounded, i.e. F ∗(y) < +∞ ⇒‖y‖ ≤ D. With this assumption, we can modify the proof ofTheorem 1 to show the convergence of the Arrow-Hurwicz


algorithm within O(1/√

N). A similar result can be foundin [22]. It seems, experimentally, that the convergence is alsoensured without this assumption, at least when G is uni-formly convex (which is the case in [37]), see Sect. 5 andthe experiments in Sect. 6.

Choosing x = xn, y = yn+1 in (12), we find that for anyβ ∈ (0,1]:⟨K(xn+1 − x), yn+1 − y

⟩−⟨K(xn+1 − x), yn+1 − y

⟩

=⟨K(xn+1 − xn), yn+1 − y

⟩

≥ −β‖xn+1 − xn‖

2τ

2

− τL2 ‖yn+1 − y‖2β

2

≥ −β‖xn+1 − xn‖

2τ

2

− τL2D2

2β(18)

where D = diam(domF ∗) and provided F ∗(y) < +∞.Then:

N∑

n=1

[⟨Kxn,y

⟩− F ∗(y) + G(xn)]

− [⟨Kx,yn

⟩− F ∗(yn) + G(x)]

+ ‖y − yN‖2σ

2

+ ‖x − xN‖2τ

2

+N∑

n=1

‖yn − yn−1‖2σ

2

+ (1 − β)

N∑

n=1

‖xn − xn−1‖2τ

2

≤ ‖y − y0‖2σ

2

+ ‖x − x0‖2τ

2

+ NτL2D2

2β(19)

so that (16) is transformed in[〈KxN,y〉 − F ∗(y) + G(xN)

]

− [〈Kx,yN 〉 − F ∗(yN) + G(x)]

≤ 1

N

(‖y − y0‖2σ

2

+ ‖x − x0‖2τ

2)+ τ

L2D2

2β. (20)

This estimate differs from our estimate (16) by an addi-tional term, which shows that O(1/N) convergence can onlybe guaranteed within a certain error range. Observe that bychoosing τ = 1/

√N one obtains global O(1/

√N) rate of

convergence of the gap, which equals the worst case rateof black-box subgradient methods [24]. In case of the ROFmodel, Zhu and Chan [12] showed that by a clever adap-tion of the step sizes the Arrow-Hurwicz method achievesmuch faster convergence, although a theoretical justificationof the acceleration is still missing. In Sect. 5, we prove thata similar strategy applied to our algorithm also drasticallyimproves the convergence, in case one function has some

regularity property. We then have checked experimentallythat our same rules, applied to the Arrow-Hurwicz method,apparently yield a similar acceleration, but a proof is stillmissing, see Remarks 5 and 6.

4 Connections to Existing Algorithms

In this section we establish connections to well knownmethods. We first establish similarities with two algorithmswhich are based on extrapolational gradients [17, 30]. Wefurther show that for K being the identity, the proposedalgorithm reduces to the Douglas Rachford splitting algo-rithm [19]. As observed in [12], we finally show that it canalso be understood as a preconditioned version of the alter-nating direction method of multipliers.

4.1 Extrapolational Gradient Methods

We have already mentioned that the proposed algorithmshares similarities with two old methods [17, 30]. Let usbriefly recall these methods to point out some connections.In order to describe these methods, it is convenient to intro-duce the primal-dual pair z = ( x

y

), the convex, l.s.c. function

H(z) = G(x) + F ∗(y), and the linear map K = (−K∗,K) :(Y × X) → (X × Y).

The modified Arrow-Hurwicz method proposed by Popovin [30] can be written as

{zn+1 = (I + τ∂H)−1(zn + τKzn),

zn+1 = (I + τ∂H)−1(zn+1 + τKzn),(21)

where τ > 0 denotes the step size. This algorithm is knownto converge as long as τ < (3L)−1, L = ‖K‖. Observe thatin contrast to the proposed algorithm (7), Popov’s algorithmrequires a sequence {zn} of primal and dual leading points. Ittherefore has a larger memory footprint and it is less efficientin cases, where the evaluation of the resolvent operators iscomplex.

A similar algorithm, the so-called extragradient method,had been proposed (earlier) by Korpelevich in [17]. It con-sists in solving:

{zn+1 = (I + τ∂H)−1(zn + τKzn)

zn+1 = (I + τ∂H)−1(zn+1 + τKzn+1),(22)

where τ < (√

2L)−1, L = ‖K‖. The extragradient methodbears a lot of similarities with (21), although it is not com-pletely equivalent. In contrast to (21), the primal-dual lead-ing point zn+1 is now computed by taking an extragradi-ent step based on the current iterate. In [23], Nemirovskishowed that the extragradient method converges with a rateof O(1/N) for the gap.


4.2 The Douglas-Rachford Splitting Algorithm

Computing the solution of a convex optimization problemis equivalent to the problem of finding zeros of a maximalmonotone operator T associated with the subgradient of theoptimization problem. The proximal point algorithm [31] isprobably the most fundamental algorithm for finding zeroesof T . It is written as the recursion

wn+1 = (I + τnT )−1(wn), (23)

where τn > 0 are the steps. Unfortunately, in most inter-esting cases (I + τnT )−1 is hard to evaluate and hence thepractical interest in the proximal point algorithm is limited.

If the operator T can be split up into a sum of two max-imal monotone operators A and B such that T = A + B

and (I + τA)−1 and (I + τB)−1 are easier to evaluate than(I +τT )−1, then one can devise algorithms which only needto evaluate the resolvent operators with respect to A and B .A number of different algorithms in this scenario have beenproposed. Let us focus here on the Douglas-Rachford split-ting algorithm (DRS) [19], which is known to be a specialcase of the proximal point algorithm (23), see [10]. The ba-sic DRS algorithm is defined through the iterations

{wn+1 = (I + τA)−1(2xn − wn) + wn − xn

xn+1 = (I + τB)−1(wn+1).(24)

Let us now apply the DRS algorithm to the primal prob-lem (3).4 We let A = K∗∂F (K) and B = ∂G and apply theDRS algorithm to A and B .⎧⎪⎪⎨

⎪⎪⎩

wn+1 = arg minv F (Kv) + 12τ

‖v − (2xn − wn)‖2

+ wn − xn

xn+1 = arg minx G(x) + 12τ

‖x − wn+1‖2.

(25)

By standard Fenchel-Rockafellar duality, we can derive thedual problem of the first minimization problem of (25). It iswritten

yn+1 = arg miny

F ∗(y) + τ

2

∥∥∥∥K

∗y − 2xn − wn

τ

∥∥∥∥

2

, (26)

where the primal and dual solutions are related via

wn+1 = xn − τK∗yn+1. (27)

Similarly, the dual of the minimization problem in the sec-ond line of (25) is

zn+1 = arg minz

G∗(z) + τ

2

∥∥∥∥z − wn+1

τ

∥∥∥∥

2

, (28)

4Clearly, the DRS algorithm can also be applied to the dual problem.

where the primal and the dual solutions are related via

xn+1 = wn+1 − τzn+1. (29)

Finally, by combining (29) with (26), substituting (27)into (28) and substituting (27) into (29) we arrive at⎧⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎩

yn+1 = arg miny F ∗(y) − 〈K∗y, xn〉 + τ2 ‖K∗y + zn‖2,

zn+1 = arg minz G∗(z) − 〈z, xn〉+ τ

2 ‖K∗yn+1 + z‖2,

xn+1 = xn − τ(K∗yn+1 + zn+1).

(30)

This variant of the DRS algorithm is also known as the al-ternating method of multipliers (ADMM). Using Moreau’sidentity (6), we can further simplify (30),⎧⎪⎪⎪⎨

⎪⎪⎪⎩

yn+1 = arg miny F ∗(y) − 〈K∗y, xn〉 + τ2 ‖K∗y + zn‖2

xn+1 = (I + τ∂G)−1(xn − τK∗yn+1)

zn+1 = xn − xn+1

τ − K∗yn+1.

(31)

We can now see that for K = I , the above scheme reducesto (7), meaning that in this case Algorithm 1 is equivalent tothe DRS algorithm (24) as well as to the ADMM (30).

4.3 Preconditioned ADMM

In many practical problems, G(x) and F ∗(y) are relativelyeasy to invert (e.g. total variation methods), but the mini-mization of the first step in (31) is still hard since it amountsto solve a least squares problem including the linear opera-tor K . As recently observed in [12], a clever idea is to addan additional prox term of the form

1

2

⟨M(y − yn), y − yn

⟩,

where M is a positive definite matrix, to the first step in (31).Then, by the particular choice

M = 1

σI − τKK∗, 0 < τσ < 1/L2

the update of the first step in (31) reduces to

yn+1 = arg miny

F ∗(y) − ⟨K∗y, xn

⟩+ τ

2

∥∥K∗y + zn∥∥2

+ 1

2

⟨(1

σ− τKK∗

)(y − yn), y − yn

⟩

= arg miny

F ∗(y) − ⟨y,Kxn

⟩+ τ

2

⟨y,KK∗y

⟩

+ τ⟨y,Kzn

⟩

+ 1

2σ〈y, y〉 − τ

2

⟨y,KK∗y

⟩


−⟨y,

(1

σ− τKK∗

)yn

⟩

= arg miny

F ∗(y)

+ 1

2σ

∥∥y − (yn + σK

(xn − τ(K∗yn + zn)

))∥∥2.

(32)

This can be further simplified to

yn+1 = (I + σ∂F ∗)−1 (yn + σKxn), (33)

where we have defined

xn = xn − τ(K∗yn + zn)

= xn − τ

(K∗yn + xn−1 − xn

τ− K∗yn

)

= 2xn − xn−1. (34)

By the additional prox term the first step becomes explicitand hence, it can be understood as a preconditioner. Notethat the preconditioned version of the ADMM is equivalentto the proposed primal-dual algorithm.

5 Acceleration

Is shown in [2, 25, 27, 28] that if G or F ∗ is uniformlyconvex (such that G∗, or respectively F , has a Lipschitzcontinuous gradient), O(1/N2) convergence can be guar-anteed. Furthermore, in case both G and F ∗ are uniformlyconvex (equivalently, both G∗ and F have Lipschitz con-tinuous gradient), it is shown in [26] that linear convergence(i.e. O(ωN), ω < 1) can be achieved. In this section we showhow we can modify our algorithm in order to accelerate theconvergence in these situations, to the same rate.

5.1 The Case G or F ∗ Uniformly Convex

For simplicity we will only treat the case where G is uni-formly convex, since by symmetry, the case where F ∗ isuniformly convex is completely equivalent. Let us assumethe existence of γ > 0 such that for any x ∈ dom ∂G,

G(x′) ≥ G(x) + ⟨p,x′ − x

⟩+ γ

2‖x − x′‖2,

∀p ∈ ∂G(x), x′ ∈ X. (35)

In that case one can show that ∇G∗ is 1/γ -Lipschitz so thatthe dual problem (4) can be solved in O(1/N2) using anyof the accelerated first order methods of [2, 25, 27], in thesense that the objective (in this case, the dual energy) ap-proaches its optimal value at the rate O(1/N2), where N

is the number of first order iterations. We explain now thata modification of our approach yields essentially the samerate of convergence. An alternative way to reach accelera-tion is to use a reinitialization, such as in [28]. However, wefound this approach less efficient, see the Appendix in [9]for details.

In view of (5), it follows from (35) that for any saddle-point (x, y) and any (x, y) ∈ X × Y .[⟨Kx, y

⟩− F ∗(y) + G(x)]− [⟨

Kx, y⟩− F ∗(y) + G(x)

]

= G(x) − G(x) + ⟨K∗y, x − x

⟩+ F ∗(y)

− F ∗(y) − ⟨Kx, y − y

⟩

≥ γ

2‖x − x‖2. (36)

Observe that in case G satisfies (35), we can also replace thesecond equation in (11) with the stronger inequality:

G(x) ≥ G(xn+1) +⟨xn − xn+1

τ, x − xn+1

⟩

−⟨K(x − xn+1), y

⟩+ γ

2‖x − xn+1‖2.

Then, modifying (12) accordingly, choosing (x, y) = (x, y)

a saddle-point and using (36), we deduce that

‖y − yn‖2σ

2

+ ‖x − xn‖2τ

2

≥ γ ‖x − xn+1‖2 + ‖y − yn+1‖2σ

2

+ ‖x − xn+1‖2τ

2

+ ‖yn − yn+1‖2σ

2

+ ‖xn − xn+1‖2τ

2

+⟨K(xn+1 − x), yn+1 − y

⟩

−⟨K(xn+1 − x), yn+1 − y

⟩. (37)

Now, we will show that we can gain acceleration of thealgorithm, provided we use dynamic steps (τn, σn) and avariable relaxation parameter θn ∈ [0,1]. We therefore con-sider the case where we choose in (37)

x = xn + θn−1(xn − xn−1), y = yn+1.

We obtain, adapting (13), and introducing the dependenceon n also for τ, σ ,

‖y − yn‖2σn

2

+ ‖x − xn‖2τn

2

≥ γ ‖x − xn+1‖2 + ‖y − yn+1‖2σn

2

+ ‖x − xn+1‖2τn

2

+ ‖yn − yn+1‖2σn

2

+ ‖xn − xn+1‖2τn

2


Algorithm 2

• Initialization: Choose τ0, σ0 > 0 with τ0σ0L2 ≤ 1, (x0, y0) ∈ X × Y , and x0 = x0.

• Iterations (n ≥ 0): Update xn, yn, xn, θn, τn, σn as follows:

⎧⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎩

yn+1 = (I + σn∂F ∗)−1(yn + σnKxn)

xn+1 = (I + τn∂G)−1(xn − τnK∗yn+1)

θn = 1/√

1 + 2γ τn, τn+1 = θnτn, σn+1 = σn/θn

xn+1 = xn+1 + θn(xn+1 − xn)

(38)

+⟨K(xn+1 − xn), yn+1 − y

⟩

− θn−1

⟨K(xn − xn−1), yn − y

⟩

− θn−1L‖xn − xn−1‖‖yn+1 − yn‖.It follows that

‖y − yn‖σn

2

+ ‖x − xn‖τn

2

≥ (1 + 2γ τn)τn+1

τn

‖x − xn+1‖τn+1

2

+ σn+1

σn

‖y − yn+1‖σn+1

2

+ ‖yn − yn+1‖σn

2

+ ‖xn − xn+1‖τn

2

− ‖yn − yn+1‖σn

2

−θ2n−1L

2σnτn−1‖xn − xn−1‖

τn−1

2

+ 2⟨K(xn+1 − xn), yn+1 − y

⟩

− 2θn−1

⟨K(xn − xn−1), yn − y

⟩. (39)

It is clear that we can get something interesting out of (39)provided we can choose the sequences (τn)n,(σn)n in such away that

(1 + 2γ τn)τn+1

τn

= σn+1

σn

> 1.

This motivates the following Algorithm 2, which is a variantof Algorithm 1.

Observe that it means choosing θn−1 = τn/τn−1 in (39).Now, (1 + 2γ τn)τn+1/τn = σn+1/σn = 1/θn = τn/τn+1, sothat (39) becomes, denoting for each n ≥ 0

n = ‖y − yn‖σn

2

+ ‖x − xn‖τn

2

,

dividing the equation by τn, and using L2σnτn =L2σ0τ0 ≤ 1,

n

τn

≥ n+1

τn+1+ ‖xn − xn+1‖

τ 2n

2

− ‖xn − xn−1‖τ 2n−1

2

+ 2

τn

⟨K(xn+1 − xn), yn+1 − y

⟩

− 2

τn−1

⟨K(xn − xn−1), yn − y

⟩. (40)

It remains to sum this equation from n = 0 to n = N − 1,N ≥ 1. We obtain (using x−1 = x0)

0

τ0≥ N

τN

+ ‖xN−1 − xN‖τ 2N−1

2

+ 2

τN−1

⟨K(xN − xN−1), yN − y

⟩

≥ N

τN

+ ‖xN−1 − xN‖τ 2N−1

2

− ‖xN−1 − xN‖τ 2N−1

2

− L2‖yN − y‖2

which eventually gives:

τ 2N

1 − L2σ0τ0

σ0τ0‖y − yN‖2 + ‖x − xN‖2

≤ τ 2N

(‖x − x0‖τ 2

0

2

+ ‖y − y0‖σ0τ0

2). (41)

In case one chooses exactly σ0τ0L2 = 1, it boils down to:

‖x − xN‖2 ≤ τ 2N

(‖x − x0‖τ 2

0

2

+ L2‖y − y0‖2)

. (42)

Now, let us show that γ τN ∼ N−1 for N (not too large),for any “reasonable” choice of τ0. Here by “reasonable”,we mean any (large) number which can be encoded on astandard computer. It will follow that our scheme shows anO(1/N2) convergence to the optimum for the variable xN ,which is the “best” known rate for this general class of prob-lem [2, 25, 27].


Lemma 1 Let λ ∈ (1/2,1) and assume γ τ0 > λ. Then after

N ≥ 1

ln 2ln

(ln 2γ τ0

ln 2λ

)(43)

iterations, one has γ τN ≤ λ.

Observe that in particular, if we take for instance γ τ0 =1020 and λ = 3/4, we find that γ τN ≤ 3/4 as soon asN ≥ 17 (this estimate is far from optimal, as in this casewe already have γ τ7 ≈ 0.546 < 3/4).

Proof From (38) we see that τN = γ τN follows the updaterule

τN+1 = τN√1 + 2τN

, (44)

in particular, letting for each N ≥ 0 sN = 1/τN , it follows

√2sN ≤ sN+1 = (sN + 1)

√

1 − 1

(sN + 1)2

≤ sN + 1. (45)

From the left-hand side inequality it follows that sN ≥2(s0/2)1/2N

. Now, γ τN ≤ λ if and only if sN ≥ 1/λ, whichis ensured as soon as (s0/2)1/2N ≥ 1/(2λ), and we de-duce (43). �

Lemma 2 Let λ > 0, and N ≥ 0 with γ τN ≤ λ. Then forany l ≥ 0,

(γ τN)−1 + l

1 + λ≤ (γ τN+l)

−1 ≤ (γ τN)−1 + l. (46)

Proof The right-hand side inequality trivially follows from(45). Using

√1 − t ≥ 1 − t for t ∈ [0,1], we also deduce

that

sN+1 ≥ sN + 1 − 1

sN + 1= sN + sN

sN + 1.

Hence if sN ≥ 1/λ, sN+l ≥ 1/λ for any l ≥ 0 and we deduceeasily the left-hand side of (46). �

Corollary 1 One has limN→∞ Nγ τN = 1.

We have shown the following result:

Theorem 2 Choose τ0 > 0, σ0 = 1/(τ0L2), and let

(xn, yn)n≥1 be defined by Algorithm 2. Then for any ε > 0,there exists N0 (depending on ε and γ τ0) such that for anyN ≥ N0,

‖x − xN‖2 ≤ 1 + ε

N2

(‖x − x0‖γ 2τ 2

0

2

+ L2

γ 2‖y − y0‖2

).

Fig. 1 The figure shows the sequence (γ τn)n≥1, using γ = 1. Observethat it goes very fast to 1/n, in a way which is quite insensitive to theinitial τ0

Of course, the important point is that the convergence inCorollary 1 is relatively fast, for instance, if γ τ0 = 1020, onecan check that γ τ100 ≈ 1.077/100. Hence, the value of N0

in Theorem 2 is never very large, even for large values of τ0,see Fig. 1. If one has some estimate on the initial distance‖x −x0‖, a good choice is to pick γ τ0 � ‖x −x0‖ in whichcase the convergence estimate boils down approximately to

‖x − xN‖2 � 1

N2

(η + L2

γ 2‖y − y0‖2

)

with η � 1, and for N (not too) large enough.

Remark 4 In [2, 25, 27], the O(1/N2) estimate is the-oretically better than ours since it is on the dual energyG∗(−K∗yN)+F ∗(yN)−(G∗(−K∗y)+F ∗(y)) (which caneasily be shown to bound ‖xN − x‖2, see for instance [13]).In practice, however, we did not observe that our approachwas slower than the other accelerated first-order schemes.

Remark 5 We have observed that replacing the last updatingrule in (38) with xn+1 = xn+1, which corresponds to con-sidering the standard Arrow-Hurwicz algorithm with vary-ing steps as in [37], the same rate of convergence is ob-served (with even smaller constants, in practice) than us-ing Algorithm 2. It seems that for an appropriate choice ofthe initial steps, this algorithm has also the O(1/N2) rate,however, some instabilities are observed in the convergence,see Fig. 3. On the other hand, it is relatively easy to showO(1/N) convergence of this approach with the assumptionthat domF ∗ is bounded, just as in Sect. 3.2.

5.2 The Case when G and F ∗ are Uniformly Convex

In case G and F ∗ are both uniformly convex, it is knownthat first order algorithms should converge linearly to the


(unique) optimal value. We show that it is indeed a featureof our algorithm.

We assume that G satisfies (35), and that F ∗ satisfies asimilar inequality with a parameter δ > 0 instead of γ . Inparticular, (36) becomes in this case

[⟨Kx, y

⟩− F ∗(y) + G(x)]− [〈Kx,y〉 − F ∗(y) + G(x)

]

≥ γ

2‖x − x‖2 + δ

2‖y − y‖2 (47)

where (x, y) is the unique saddle-point of our problem. Fur-thermore, (11) becomes

F ∗(y) ≥ F ∗(yn+1) +⟨yn − yn+1

σ,y − yn+1

⟩

+⟨Kx, y − yn+1

⟩+ δ

2‖y − yn+1‖2,

G(x) ≥ G(xn+1) +⟨xn − xn+1

τ, x − xn+1

⟩

−⟨K(x − xn+1), y

⟩+ γ

2‖x − xn+1‖2.

In this case, the inequality (12), for x = x and y = y, be-comes

‖y − yn‖2σ

2

+ ‖x − xn‖2τ

2

≥(

2δ + 1

σ

) ‖y − yn+1‖2

2

+(

2γ + 1

τ

) ‖x − xn+1‖2

2

+ ‖yn − yn+1‖2σ

2

+ ‖xn − xn+1‖2τ

2

+⟨K(xn+1 − x), yn+1 − y

⟩

−⟨K(xn+1 − x), yn+1 − y

⟩. (48)

Let us define μ = 2√

γ δ/L, and choose σ , τ with

τ = μ

2γ= 1

L

√δ

γ, σ = μ

2δ= 1

L

√γ

δ. (49)

In particular we still have στL2 = 1. Let also

n := δ‖y − yn‖2 + γ ‖x − xn‖2, (50)

and (48) becomes

n ≥ (1 + μ)n+1 + δ‖yn − yn+1‖2

+ γ ‖xn − xn+1‖2 + μ⟨K(xn+1 − x), yn+1 − y

⟩

− μ⟨K(xn+1 − x), yn+1 − y

⟩. (51)

Let us now choose

x = xn + θ(xn − xn−1), y = yn+1, (52)

in (51), for (1 + μ)−1 ≤ θ ≤ 1. In case θ = 1 it is thesame rule as in Theorem 1, but as we will see the conver-gence seems theoretically improved if one chooses insteadθ = 1/(1 + μ). It follows from rule (52) that⟨K(xn+1 − x), yn+1 − y

⟩−⟨K(xn+1 − x), yn+1 − y

⟩

=⟨K(xn+1 − xn), yn+1 − y

⟩

− θ⟨K(xn − xn−1), yn+1 − y

⟩. (53)

We now introduce ω ∈ [(1 + μ)−1, θ ], ω < 1, which wewill choose later on (if θ = 1/(1 + μ) we will obviously letω = θ ). We rewrite (53) as⟨K(xn+1 − xn), yn+1 − y

⟩− ω

⟨K(xn − xn−1), yn − y

⟩

− ω⟨K(xn − xn−1), yn+1 − yn

⟩

− (θ − ω)⟨K(xn − xn−1), yn+1 − y

⟩

≥⟨K(xn+1 − xn), yn+1 − y

⟩

− ω⟨K(xn − xn−1), yn − y

⟩

− ωL

(α

‖xn − xn−1‖2

2

+ ‖yn+1 − yn‖2α

2)

− (θ − ω)L

(α

‖xn − xn−1‖2

2

+ ‖yn+1 − y‖2α

2), (54)

for any α > 0, and gathering (53) and (54) we obtain

μ⟨K(xn+1 − x), yn+1 − y

⟩

− μ⟨K(xn+1 − x), yn+1 − y

⟩

≥ μ(⟨

K(xn+1 − xn), yn+1 − y⟩

−ω⟨K(xn − xn−1), yn − y

⟩)

− μθLα‖xn − xn−1‖

2

2

− μωL‖yn+1 − yn‖

2α

2

− μ(θ − ω)L‖yn+1 − y‖

2α

2

. (55)

From (51), (55), and choosing α = ω(√

γ /δ), we find

n ≥ 1

ωn+1 +

(1 + μ − 1

ω

)n+1 + δ‖yn − yn+1‖2


Algorithm 3

• Initialization: Choose μ ≤ 2√

γ δ/L, τ = μ/(2γ ), σ = μ/(2δ), and θ ∈ [1/(1 + μ),1]. Let (x0, y0) ∈ X × Y , andx0 = x0.

• Iterations (n ≥ 0): Update xn, yn, xn as follows:⎧⎪⎨

⎪⎩

yn+1 = (I + σ∂F ∗)−1(yn + σKxn)

xn+1 = (I + τ∂G)−1(xn − τK∗yn+1)

xn+1 = xn+1 + θ(xn+1 − xn)

(56)

+ γ ‖xn − xn+1‖2 + μ(⟨K(xn+1 − xn), yn+1 − y

⟩

− ω⟨K(xn − xn−1), yn − y

⟩)− ωθγ ‖xn−1 − xn‖2

− δ‖yn+1 − yn‖2 − θ − ω

ωδ‖yn+1 − y‖2. (57)

We require that (1 + μ − 1ω) ≥ (θ − ω)/ω, which is ensured

by letting

ω = 1 + θ

2 + μ= 1 + θ

2(1 +√

γ δL

). (58)

Then, (57) becomes

n ≥ 1

ωn+1 + γ ‖xn − xn+1‖2 − ωθγ ‖xn−1 − xn‖2

+ μ(⟨K(xn+1 − xn), yn+1 − y

⟩

− ω⟨K(xn − xn−1), yn − y

⟩), (59)

which we sum from n = 0 to N − 1 after multiplying byω−n, and assuming that x−1 = x0:

0 ≥ ω−NN + ω−N+1γ ‖xN − xN−1‖2

+ μω−N+1⟨K(xN − xN−1), yN − y

⟩

≥ ω−NN + ω−N+1γ ‖xN − xN−1‖2

− μω−N+1L

(√γ

δ

‖xN − xN−1‖2

2

+√

δ

γ

‖yN − y‖2

2)

≥ ω−NN − ω−N+1δ‖yN − y‖2.

We deduce the estimate

γ ‖xN − x‖2 + (1 − ω)δ‖yN − y‖2

≤ ωN(γ ‖x0 − x‖2 + δ‖y0 − y‖2

)(60)

showing linear convergence of the iterates (xN , yN) to the(unique) saddle-point. We have shown the convergence of

the algorithm summarized in Algorithm 3. We concludewith the following theorem.

Theorem 3 Consider the sequence (xn, yn) provided byAlgorithm 3 and let (x, y) be the unique solution of (2).Let ω < 1 be given by (58). Then (xN , yN) → (x, y) inO(ωN/2), more precisely, there holds (60).

Observe that if we choose θ = 1, this is an improvementover Theorem 1 (with a particular choice of the steps, givenby (49)). It would be interesting to understand whether thesteps can be estimated in Algorithm 1 without the a prioriknowledge of γ and δ.

Remark 6 Again, we checked experimentally that takingθ ∈ [0,1/(1 + μ)] in (56) also yields convergence, andsometimes faster, of Algorithm 3. In particular, the standardArrow-Hurwicz method (θ = 0) seems to work very wellwith these choices of τ and σ . On the other hand, it is rela-tively easy to show linear convergence of this method withan appropriate (different) choice of τ and σ , however, thetheoretical rate is then worse than the one which we findin (58).

6 Comparisons and Applications

In this section we first present comparisons of the proposedalgorithms to state-of-the-art methods. Then we illustratethe wide applicability of the proposed algorithm on severaladvanced imaging problems. Let us first introduce the dis-crete setting which we will use in the rest of this section.

6.1 Discrete Setting

We consider a regular Cartesian grid of size M × N :

{(ih, jh) : 1 ≤ i ≤ M,1 ≤ j ≤ N} ,

where h denotes the size of the spacing and (i, j) denotethe indices of the discrete locations (ih, jh) in the image


domain. Let X = RMN be a finite dimensional vector space

equipped with a standard scalar product

〈u,v〉X =∑

i,j

ui,j vi,j , u, v ∈ X.

The gradient ∇u is a vector in the vector space Y = X ×X. For discretization of ∇ : X → Y , we use standard finitedifferences with Neumann boundary conditions

(∇u)i,j =(

(∇u)1i,j

(∇u)2i,j

)

,

where

(∇u)1i,j =

{ui+1,j −ui,j

hif i < M

0 if i = M,

(∇u)2i,j =

{ui,j+1−ui,j

hif j < N

0 if j = N.

We also define a scalar product in Y

〈p,q〉Y =∑

i,j

p1i,j q

1i,j + p2

i,j q2i,j ,

p = (p1,p2), q = (q1, q2) ∈ Y.

Furthermore we will also need the discrete divergence op-erator divp : Y → X, which is chosen to be adjoint to thediscrete gradient operator. In particular, one has −div = ∇∗which is defined through the identity

〈∇u,p〉Y = −〈u,divp〉X .

We also need to compute a bound on the norm of the linearoperator ∇ . According to (1), one has

L2 = ‖∇‖2 = ‖div‖2 ≤ 8/h2.

See again [6] for a proof.

6.2 Total Variation Based Image Denoising

In order to evaluate and compare the performance of the pro-posed primal-dual algorithm to state-of-the-art methods, wewill consider three different convex image denoising mod-els, each having a different degree of regularity. Throughoutthe experiments, we will make use of the following proce-dure to determine the performance of each algorithm. Wefirst run a well performing method for a very long time(∼100000 iterations) in order to compute a “ground truth”solution. Then, we apply each algorithm until the error(based on the solution or the energy) to the pre-determinedground truth solution is below a certain threshold ε. For eachalgorithm, the parameters are optimized to give an optimal

performance, but stay constant for all experiments. All algo-rithms were implemented in Matlab and executed on a 2.66GHz CPU, running a 64 Bit Linux system.

6.2.1 The ROF Model

As a prototype for total variation methods in imaging werecall the total variation based image denoising model pro-posed by Rudin, Osher and Fatemi in [33]. The ROF modelis defined as the variational problem

minx

∫

�

|Du| + λ

2‖u − g‖2

2 , (61)

where � ⊂ Rd is the d-dimensional image domain, u ∈

L1(�) is the sought solution and g ∈ L1(�) is the noisyinput image. The parameter λ is used to define the tradeoffbetween regularization and data fitting. The term

∫�

|Du|is the total variation of the function u, where Du denotesthe distributional derivative, which is, in an integral sense,also well-defined for discontinuous functions. For suffi-ciently smooth functions u, e.g. u ∈ W 1,1(�) it reduces to∫�

|∇u|dx. The main advantage of the total variation andhence of the ROF model is its ability to preserve sharp edgesin the image, which is important for many imaging prob-lems. Using the discrete setting introduced above (in dimen-sion d = 2), the discrete ROF model, which we call the pri-mal ROF problem is then given by

h2 minu∈X

‖∇u‖1 + λ

2‖u − g‖2

2, (62)

where u,g ∈ X are the unknown solution and the givennoisy data. The norm ‖u‖2

2 = 〈u,u〉X denotes the standardsquared L2 norm in X and ‖∇u‖1 denotes the discrete ver-sion of the isotropic total variation norm defined as

‖∇u‖1 =∑

i,j

|(∇u)i,j |,

|(∇u)i,j | =√

((∇u)1i,j )

2 + ((∇u)2i,j )

2.

Casting (62) in the form of (3), we see that F(∇u) = ‖∇u‖1

and G(u) = λ2 ‖u − g‖2

2. Note that in what follows, we willalways disregard the multiplicative factor h2 appearing inthe discretized energies such as (62), since it causes only arescaling of the energy and does not change the solution.

According to (2), the primal-dual formulation of the ROFproblem is given by

minu∈X

maxp∈Y

−〈u,divp〉X + λ

2‖u − g‖2

2 − δP (p), (63)

where p ∈ Y is the dual variable. The convex set P is givenby

P = {p ∈ Y : ‖p‖∞ ≤ 1} , (64)


Fig. 2 Image denoising usingthe ROF model. The left columnshows the noisy input image ofsize 256 × 256, with additivezero mean Gaussian noise(σ = 0.05) and the denoisedimage using λ = 16. The rightcolumn shows the noisy inputimage but now with (σ = 0.1)

and the denoised image usingλ = 8

and ‖p‖∞ denotes the discrete maximum norm defined as

‖p‖∞ = maxi,j

|pi,j |, |pi,j | =√

(p1i,j )

2 + (p2i,j )

2.

Note that the set P is the product of pointwise L2 balls.The function δP denotes the indicator function of the set P

which is defined as

δP (p) ={

0 if p ∈ P,

+∞ if p /∈ P.(65)

Furthermore, the primal ROF problem (62) and the primal-dual ROF problem (63) and are equivalent to the dual ROFproblem

maxp∈Y

−(

1

2λ‖divp‖2

2 + 〈g,divp〉X + δP (p)

). (66)

In order to apply the proposed algorithms to (63), it re-mains to detail the resolvent operators (I + σ∂F ∗)−1 and(I + τ∂G)−1. First, casting (63) in the form of the gen-eral saddle-point problem (2) we see that F ∗(p) = δP (p)

and G(u) = λ2 ‖u − g‖2

2. Since F ∗ is the indicator functionof a convex set, the resolvent operator reduces to pointwiseEuclidean projectors onto L2 balls

p = (I + σ∂F ∗)−1(p) ⇐⇒ pi,j = pi,j

max(1, |pi,j |) .

The resolvent operator with respect to G poses simple point-wise quadratic problems. The solution is trivially given by

u = (I + τ∂G)−1(u) ⇐⇒ ui,j = ui,j + τλgi,j

1 + τλ.

Observe that G(u) is uniformly convex with convexity pa-rameter λ and hence we can make use of the acceleratedO(1/N2) algorithm.

Figure 2 shows the denoising capability of the ROFmodel using different noise levels. Note that the ROF modelefficiently removes the noise while preserving the disconti-nuities in the image. For performance evaluation, we use thefollowing algorithms and parameter settings:

• ALG1: O(1/N) primal-dual algorithm as described inAlgorithm 1, with τ = 0.01, τσL2 = 1, taking the lastiterate instead of the average.

• ALG2: O(1/N2) primal-dual algorithm as described inAlgorithm 2, with adaptive steps, τ0 = 1/L, τnσnL

2 = 1,γ = 0.35λ.

• ALG4: O(1/N2) algorithm described in the appendix of[9], based on ALG1 and reinitializations. The parametersare q = 1, N0 = 1, r = 2, γ = λ, and we used the lastiterate in the inner loop instead of the averages.

• AHMOD: Arrow-Hurwicz primal-dual algorithm (17) us-ing the rule described in (38), τ0 = 0.02, τnσnL

2/4 = 1,γ = 0.35λ.


Table 1 Performance evaluation using the images shown in Fig. 2. The entries in the table refer to the number of iterations, respectively the CPUtimes in seconds, which were needed to drop the root mean squared error of the solution below the error tolerance ε. The “–” entries indicate thatthe algorithm failed to drop the error below ε within a maximum number of 100000 iterations

λ = 16 λ = 8

ε = 10−4 ε = 10−6 ε = 10−4 ε = 10−6

ALG1 214 (3.38 s) 19544 (318.35 s) 309 (5.20 s) 24505 (392.73 s)

ALG2 108 (1.95 s) 937 (14.55 s) 174 (2.76 s) 1479 (23.74 s)

ALG4 124 (2.07 s) 1221 (19.42 s) 200 (3.14 s) 1890 (29.96 s)

AHMOD 64 (0.91 s) 498 (6.99 s) 122 (1.69 s) 805 (10.97 s)

AHZC 65 (0.98 s) 634 (9.19 s) 105 (1.65 s) 1001 (14.48 s)

FISTA 107 (2.11 s) 999 (20.36 s) 173 (3.84 s) 1540 (29.48 s)

NEST 106 (3.32 s) 1213 (38.23 s) 174 (5.54 s) 1963 (58.28 s)

ADMM 284 (4.91 s) 25584 (421.75 s) 414 (7.31 s) 33917 (547.35 s)

PGD 620 (9.14 s) 58804 (919.64 s) 1621 (23.25 s) –

CFP 1396 (20.65 s) – 3658 (54.52 s) –

• AHZC: Arrow-Hurwicz primal-dual algorithm (17) withadaptive steps proposed by Zhu and Chan in [37].

• FISTA: O(1/N2) fast iterative shrinkage thresholding al-gorithm on the dual ROF problem (66) [2, 25]. We alsotested a monotone variant of FISTA, called MFISTA,which however showed exactly the same performance onthe tested problems.

• NEST: O(1/N2) algorithm proposed by Nesterov in [27],on the dual ROF problem (66).

• ADMM: Alternating direction method of multipliers (30),on the dual ROF problem (66), τ = 20. (See also [11,15, 19].) Two Jacobi iterations to approximately solve thelinear sub-problem.

• PGD: O(1/N) projected (sub)gradient descent on thedual ROF problem (66) [2, 7].

• CFP: Chambolle’s fixed-point algorithm proposed in [6],on the dual ROF problem (66).

Table 1 shows the results of the performance evalua-tion for the images showed in Fig. 2. One can see thatthe ROF problem gets harder for stronger regularization.This is explained by the fact that for stronger regulariza-tion the data term becomes less important and hence theTV-regularization term dominates. Furthermore, one can seethat the theoretical efficiency rates of the algorithms are wellreflected by the experiments. For the O(1/N) methods, thenumber of iterations are increased by approximately a fac-tor of 100 when decreasing the error threshold by a factorof 100. In contrast, for the O(1/N2) methods, the numberof iterations is only increased by approximately a factor of10. The Arrow-Hurwicz type methods (AHMOD, AHZC)appear to be the fastest algorithms. This still remains a mys-tery, since we do not have a theoretical explanation yet. In-terestingly, by using our acceleration rule, AHMOD evenoutperforms AHZC. The performance of ALG2 is slightly

Fig. 3 Convergence of AHZC and ALG2 for the experiment in the lastcolumn of Table 1

worse, but still outperforms well established algorithms suchas FISTA and NEST. Figure 3 plots the convergence ofAHZC and ALG together with the theoretical O(1/N2) rate.ALG4 appears to be slightly worse than ALG2, which isalso justified theoretically. ALG1 appears to be the fastestO(1/N) method, but note that the O(1/N) methods quicklybecome infeasible when requiring a higher accuracy. Inter-estingly, ADMM, which is often considered to be a fastmethod for solving L1 related problems, seems to be slow inour experiments. PGD and CFP are competitive only, whenrequiring a low accuracy.

6.2.2 The TV-L1 Model

The TV-L1 model is obtained as a variant of the ROFmodel (61) by replacing the squared L2 norm in the data


Fig. 4 Image denoising in thecase of impulse noise. (a) showsthe 500 × 375 input image and(b) is a noisy version which hasbeen corrupted by 25% salt andpepper noise. (c) is the result ofthe ROF model. (d) is the resultof the TV−L1 model. Note thatthe T V − L1 model is able toremove the noise while stillpreserving some small details

term by the robust L1 norm.

minu

∫

�

|Du| + λ‖u − g‖1. (67)

Although only a slight change, the TV-L1 model offers somepotential advantages over the ROF model. First, one cancheck that it is contrast invariant. Second, it turns out thatthe TV-L1 model is much more effective in removing noisecontaining strong outliers (e.g. salt&pepper noise). The dis-crete version of (67) is given by

minu∈X

‖∇u‖1 + λ‖u − g‖1. (68)

In analogy to (63), the saddle-point formulation of (68) isgiven by

minu∈X

maxp∈Y

−〈u,divp〉X + λ‖u − g‖1 − δP (p). (69)

Comparing with the ROF problem (63), we see that theonly difference is that the function G(u) is now G(u) =λ‖u − g‖1, and hence we only have to change the resolventoperator with respect to G. The solution of the resolvent op-erator is given by the pointwise shrinkage operations

u = (I + τ∂G)−1(u) ⇐⇒

ui,j =⎧⎨

⎩

ui,j − τλ if ui,j − gi,j > τλ

ui,j + τλ if ui,j − gi,j < −τλ

gi,j if |ui,j − gi,j | ≤ τλ.

Observe that in contrast to the ROF model, the TV-L1 modelposes a non-smooth optimization problem. Hence, we haveto apply the proposed O(1/N) primal-dual algorithm.

Figure 4 shows an example of outlier removal using theTV-L1 model. Note that while the ROF leads to an over-regularized result, the TV-L1 model efficiently removes theoutliers while preserving small details. Next, we comparethe performance of different algorithms on the TV-L1 prob-lem. For performance evaluation, we use the following algo-rithms and parameters:

• ALG1: O(1/N) primal dual algorithm as described in Al-gorithm 1, τ = 0.02, τσL2 = 1.

• ADMM: Alternating direction method of multipliers (30),on the dual TV-L1 problem, τ = 10 (see also [11]). TwoJacobi iterations to approximately solve the linear sub-problem.

• EGRAD: O(1/N) extragradient method (22), step sizeτ = 1/

√2L2 (see also [17, 23]).

• NEST: O(1/N) method proposed by Nesterov in [27], onthe primal TV-L1 problem, smoothing parameter μ = ε.

Table 2 presents the results of the performance evalua-tion for the image shown in Fig. 4. Since the solution ofthe TV-L1 model is in general not unique, we can not com-pare the RMSE of the solution. Instead we use the normal-ized error of the primal energy (En − E∗)/E∗ > 0, whereEn is the primal energy of the current iterate n and E∗ isthe primal energy of the true solution, as a stopping crite-rion. ALG1 appears to be the fastest algorithm, followedby ADMM. Figure 5 plots the convergence of ALG1 and


Table 2 Performance evaluation using the image shown in Fig. 4. Theentries in the table refer to the number of iterations, respectively theCPU times in seconds, which were needed to drop the normalized errorof the primal energy below the error tolerance ε

λ = 1.5

ε = 10−4 ε = 10−5

ALG1 187 (15.81 s) 421 (36.02 s)

ADMM 385 (33.26 s) 916 (79.98 s)

EGRAD 2462 (371.13 s) 8736 (1360.00 s)

NEST 2406 (213.41 s) 15538 (1386.95 s)

Fig. 5 Convergence for the TV-L1 model

ADMM together with the theoretical O(1/N) bound. Notethat again, the proposed primal-dual algorithm significantlyoutperforms the state-of-the-art method ADMM. Paradoxi-cally, it seems that both ALG1 and ADMM converge likeO(1/N2) in the end, but we do not have any explanation forthis yet.

6.2.3 The Huber-ROF Model

Total Variation methods applied to image regularization suf-fer from the so-called staircasing problem. The effect refersto the formation of artificial flat areas in the solution (seeFig. 6 (b)). A remedy for this unwanted effect is to replacethe L1 norm in the total variation term by the Huber-norm

|x|α ={

|x|22α

if |x| ≤ α

|x| − α2 if |x| > α

(70)

where α > 0 is a small parameter defining the tradeoff be-tween quadratic regularization (for small values) and totalvariation regularization (for larger values).

This change can be easily integrated into the primal-dualROF model (63) by replacing the term F ∗(p) = δP (p) by

Table 3 Performance evaluation using the image shown in Fig. 6. Theentries in the table refer to the number of iterations respectively theCPU times in seconds the algorithms needed to drop the root meansquared error below the error tolerance ε

λ = 5, α = 0.05

ε = 10−15

ALG3 187 (3.85 s)

NEST 248 (5.52 s)

F ∗(p) = δP (p) + α2 ‖p‖2

2. Hence the primal-dual formula-tion of the Huber-ROF model is given by

minu∈X

maxp∈Y


2‖u−g‖2

2 −δP (p)− α

2‖p‖2

2. (71)

Consequently, the resolvent operator with respect to F ∗ isgiven by the pointwise operations

p = (I + σ∂F ∗)−1(p) ⇐⇒ pi,j =pi,j

1+σα

max(1, | pi,j

1+σα|)

.

Note that the Huber-ROF model is uniformly convex inG(u) and F ∗(p), with convexity parameters λ and α. There-fore, we can make use of the linearly convergent algorithm.

Figure 6 shows a comparison between the ROF modeland the Huber-ROF model. While the ROF model leadsto the development of artificial discontinuities (staircasing-effect), the Huber-ROF model yields a piecewise smooth,and hence more natural results.

For the performance evaluation, we use the following al-gorithms and parameter settings:

• ALG3: Linearly convergent primal-dual algorithm as de-scribed in Algorithm 3, using the convexity parametersγ = λ, δ = α, μ = 2

√γ δ/L, θ = 1/(1 + μ).

• NEST: Restarted version of Nesterov’s algorithm [2,25, 28], on the dual Huber-ROF problem. Algorithm isrestarted every k = �√8LH /α� iterations, where LH =L2/λ+α is the Lipschitz constant of the dual Huber-ROFmodel (with our choices, k = 17).

Table 3 shows the result of the performance evaluation.Both ALG3 and NEST show linear convergence whereasALG3 has a slightly better performance. Figure 7 plots theconvergence of ALG3, NEST together with the theoreticalO(ωN/2) bound. Note that ALG3 reaches machine preci-sion after approximately 200 iterations.

6.3 Advanced Imaging Problems

In this section, we illustrate the wide applicability of theproposed primal-dual algorithms to advanced imaging prob-lems such as image deconvolution, image inpainting, motion


Fig. 6 Comparison between theROF model and the Huber-ROFmodel for the noisy imageshown in Fig. 2 (b). While theROF model exhibits strongstaircasing, the Huber-ROFmodel leeds to piecewisesmooth, and hence more naturalimages

Fig. 7 Linear convergence of ALG3 and NEST for the Huber-ROFmodel. Note that after approximately 200 iterations, ALG3 reachesmachine precision

estimation, and image segmentation. We show that the pro-posed algorithms can be easily adapted to all these applica-tions and yield state-of-the-art results.

6.3.1 Image Deconvolution and Zooming

The standard ROF model (61) can be easily extended forimage deconvolution and digital zooming.

minu

∫

�

|Du| + λ

2‖Au − g‖2

2, (72)

where A is a linear operator. In the case of image deconvo-lution, A is the convolution with the point spread function(PSF). In the case of image zooming, A describes the down-sampling procedure, which is often assumed to be a blur-ring kernel followed by subsampling operator. In the dis-crete setting, this problem can be easily rewritten in termsof a saddle-point problem

minu∈X

maxp∈Y


2‖Au − g‖2

2 − δP (p). (73)

Now, the question is how to implement the resolvent oper-ator with respect to G(u) = λ

2 ‖Au − g‖22. In case Au can

be written as a convolution, i.e. Au = kA ∗ u, where kA isthe convolution kernel, FFT based method can be used tocompute the resolvent operator.

u = (I + τ∂G)−1(u)

⇐⇒ u = arg minu

‖u − u‖2τ

+ λ

2‖kA ∗ u − g‖2

2

⇐⇒ u = F −1(

τλF (g)F (kA)∗ + F (u)

τλF (kA)2 + 1

),

where F (·) and F −1(·) denote the FFT and inverse FFT,respectively. According to the well-known convolution the-orem, the multiplication and division operators are under-stood pointwise in the above formula. Note that only oneFFT and one inverse FFT are required to evaluate the resol-vent operator (all other quantities can be pre-computed).

If the linear operator can not be implemented efficientlyin this way, an alternative approach consists of additionallydualizing the functional with respect to G(u), yielding

minu∈X

maxp∈Y,q∈X

−〈u,divp〉X

+ 〈Au − g,q〉X − δP (p) − 1

2λ‖q‖2, (74)

where q ∈ X is the additional dual variable. In this case,we now have F ∗(p, q) = δP (p)+ 1

2λ‖q‖2. Accordingly, the

resolvent operator is given by

(p, q) = (I + σ∂F ∗)−1(p, q)

⇐⇒ pi,j = pi,j

max(1, |pi,j |) , qi,j = qi,j

1 + σλ.

Figure 8 shows the application of the energy (72) to motiondeblurring. While the classical Wiener filter is not able torestore the image, the total variation based approach yieldsa far better result. Figure 9 shows the application of (72)to zooming. One can observe that total variation based


Fig. 8 Motion deblurring usingtotal variation regularization. (a)and (b) show the 400 × 470clean image and a degradedversion containing motion blurof approximately 30 pixels andGaussian noise of standarddeviation σ = 0.01. (c) is theresult of standard Wienerfiltering. (d) is the result of thetotal variation baseddeconvolution method. Notethat the TV-based method yieldsvisually much more appealingresults

Fig. 9 Image zooming usingtotal variation regularization.(a) shows the 384 × 384 originalimage and a by a factor of 4downsampled version. (b) is theresult of zooming by a factor of4 using bicubic interpolation.(c) is the result of the totalvariation based zooming model.One can see that total variationbased zooming yields muchsharper image edges

zooming leads to a superresolved image with sharp bound-aries whereas standard bicubic interpolation to a much moreblurry result.

6.3.2 Image Inpainting

Image inpainting is the process of filling in lost image data.Although the total variation is useful for a number of ap-plications, it is a rather weak prior for image inpainting. In

the last years a lot of effort has been put into the develop-ment of more powerful image priors. An interesting classof priors is given by linear multi-level transformations suchas wavelets, curvelets, etc. (see for example [14]). Thesetransformations come along with the advantage of provid-ing a compact representation of images while being compu-tational efficient.

A generic model for image restoration can be derivedfrom the classical ROF model (in the discrete setting), bysimply replacing the gradient operator by a more general


linear transform.

minu∈X

‖�u‖1 + λ

2‖u − g‖2

2, (75)

where � : X → W denotes the linear transform and W =C

K denotes the space of coefficients, usually some complex-or real-valued finite dimensional vector space. Here K ∈ N,the dimension of W , may depend on different parameterssuch as the image size, the number of levels, orientations,etc.

Note that for unitary � (i.e. �−1 = �T ), the above modelis equivalent to the following standard sparsity model:

minx∈X

‖x‖1 + λ

2‖�T x − g‖2

2.

The L1 norm ‖x‖1 is used to explicitly enforce sparsity inthe coefficients x. We however found that model (75) leadsto less artifacts in the solution and hence we propose to usethis model.

The aim of model (75) is to find a sparse and hence com-pact representation of the image u in the domain of �, whichhas a small squared distance to the noisy data g. In particu-lar, sparsity in the coefficients is attained by minimizing itsL1 norm. Clearly, minimizing the L0 norm (i.e., the numberof non-zero coefficients) would be better: but this problemis known to be NP-hard.

For the task of image inpainting, we consider a simplemodification of (75). Let D = {(i, j), 1 ≤ i ≤ M, 1 ≤ j ≤N} denote the set of indices of the image domain and letI ⊂ D denote the set of indices of the inpainting domain.The inpainting model can then be defined as

minu∈X

‖�u‖1 + λ

2

∑

i,j∈D\I(ui,j − gi,j )

2. (76)

Note that the choice λ ∈ (0,∞) corresponds to joint inpaint-ing and denoising and the choice λ = +∞ corresponds topure inpainting.

The saddle-point formulation of (76) that fits into thegeneral class of problems we are considering in this papercan be derived as

minu∈X

maxc∈W

〈�u,c〉 + λ

2

∑

i,j∈D\I(ui,j − gi,j )

2 − δC(c), (77)

where C is the convex set defined as

C = {c ∈ W : ‖c‖∞ ≤ 1} , ‖c‖∞ = maxk

|ck|.

Let us identify in (77) G(u) = λ2

∑i,j∈D\I (ui,j − gi,j )

2 andF ∗(c) = δC(c) in (77). Hence, the resolvent operators with

respect to these functions can be easily evaluated.

c = (I + σ∂F ∗)−1(c) ⇐⇒ ck = ck

max(1, |ck|) ,

and

u = (I + τ∂G)−1(u)

⇐⇒ ui,j ={

ui,j if (i, j) ∈ I

ui,j +τλgi,j

1+τλelse.

Since (77) is non-smooth we have to choose Algorithm 1for optimization. Figure 10 shows the application of the in-painting model (76) to the recovery of lost lines (80% ran-domly chosen) of a color image. Figure 10 (c) shows theresult when using � = ∇ , i.e. the usual gradient operator.Figure 10 (d) shows the result but now using � to be the fastdiscrete curvelet transform [5]. One can see that the curveletis much more successful in recovering long elongated struc-tures and the smooth background structures. This exampleshows that different linear operators can be easily integratedin the proposed primal-dual algorithm.

6.3.3 Motion Estimation

Motion estimation (optical flow) is one of the central topicsin imaging. The goal is to compute the apparent motion inimage sequences. A typical variational formulation of totalvariation based motion estimation is given by (see e.g. [36])5

minv

∫

�

|Dv| + λ‖ρ(v)‖1, (78)

where v = (v1, v2)T : � → R

2 is the motion field, andρ(v) = It + (∇I )T (v − v0) is the traditional optical flowconstraint (OFC). It is obtained from a linearization of theassumption that the intensities of the pixels stay constantover time. It is the time derivative of the image sequence,∇I is the spatial image gradient, and v0 is some given mo-tion field. The parameter λ is again used to defined the trade-off between data fitting and regularization.

In any practical situation, however, it is very unlikely(due to illumination changes and shadows), that the imageintensities stay constant over time. This motivates the fol-lowing slightly improved OFC, which explicitly models thevarying illumination by means of an additive function u (seee.g. [34]).

ρ(u, v) = It + (∇I )T (v − v0) + βu.

5Interestingly, total variation regularization appears in [34] in the con-text of motion estimation several years before it was popularized byRudin, Osher and Fatemi in [33] for image denoising.


Fig. 10 Recovery of lost imageinformation. (a) shows the384 × 384 clean image,(b) shows the destroyed image,where 80% of the lines are lost,(c) shows the result of theinpainting model (76) using atotal variation prior and (d)shows the result when using acurvelet prior

The function u : � → R is expected to be smooth and hencewe also regularize u by means of the total variation. Theparameter β controls the influence of the illumination term.The improved motion estimation model is then given by

minu,v

∫

�

|Du| +∫

�

|Dv| + λ‖ρ(u, v)‖1.

Note that the OFC is valid only for small motion (v − v0).In order to account for large motion, the entire approach hasto be integrated into a coarse-to-fine framework in order tore-estimate v0. See again [36] for more details.

After discretization, we obtain the following primal mo-tion model formulation

minu∈X,v∈Y

‖∇u‖1 + ‖∇v‖1 + λ‖ρ(u, v)‖1, (79)

where the discrete version of the OFC is now given by

ρ(ui,j , vi,j ) = (It )i,j + (∇I )Ti,j (vi,j − v0i,j ) + βui,j .

The vectorial gradient ∇v = (∇v1,∇v2) is in the space Z =Y × Y equipped with a scalar product

〈q, r〉Z =∑

i,j

q1i,j r

1i,j + q2

i,j r2i,j + q3

i,j r3i,j + q4

i,j r4i,j ,

q = (q1, q2, q3, q4), r = (r1, r2, r3, r4) ∈ Z,

and a norm

‖∇v‖1 =∑

i,j

|∇vi,j |,

|∇vi,j |= √

((∇v1)1i,j

)2 + ((∇v1)2i,j

)2 + ((∇v2)1i,j

)2 + ((∇v2)2i,j

)2.

The saddle-point formulation of the primal motion esti-mation model (79) is obtained as

minu∈X,v∈Y

maxp∈Y,q∈Z

〈∇u,p〉Y+ 〈∇v, q〉Z + λ‖ρ(u, v)‖ − δP (p) − δQ(q), (80)

where the convex set P is defined as in (64) and the convexset Q is defined as

Q = {q ∈ Z : ‖q‖∞ ≤ 1} ,

and ‖q‖∞ is the discrete maximum norm defined in Z as

‖q‖∞ = maxi,j

|qi,j |,

|qi,j | =√

(q1i,j )

2 + (q2i,j )

2 + (q3i,j )

2 + (q4i,j )

2.

Let us observe that (80) is a non-smooth convex prob-lem with G(u,v) = λ‖ρ(u, v)‖1 and F ∗(p, q) = δP (p) +


Fig. 11 Motion estimationusing total variationregularization and explicitillumination estimation.(a) shows one of two 584 × 388input images and (b) shows thecolor coded ground truth motionfield (black pixels indicateunknown motion vectors). (c)and (d) shows the estimatedillumination and the color codedmotion field

δQ(q). Hence we will have to rely on the basic Algorithm 1to minimize (79).

Let us now describe the resolvent operators needed forthe implementation of the algorithm. The resolvent operatorwith respect to F ∗(p, q) is again given by simple pointwiseprojections onto L2 balls.

(p, q) = (I + σ∂F ∗)−1(p, q)

⇐⇒ pi,j = pi,j

max(1, |pi,j |) , qi,j = qi,j

max(1, |qi,j |) .

Next, we give the resolvent operator with respect to G(u,v).First, it is convenient to define ai,j = (β, (∇I )i,j ) and|a|2i,j = β2 + |∇I |2i,j . The solution of the resolvent opera-tor is then given by

(u, v) = (I + τ∂G)−1(u, v)

⇐⇒ (ui,j , vi,j ) = (ui,j , vi,j )

+⎧⎪⎨

⎪⎩

τλai,j if ρ(ui,j , vi,j ) < −τλ|a|2i,j

−τλai,j if ρ(ui,j , vi,j ) > τλ|a|2i,j

−ρ(ui,j , vi,j )ai,j /|a|2i,j if |ρ(ui,j , vi,j )| ≤ τλ|a|2i,j

.

Figure 11 shows the results of applying the motion es-timation model with explicit illumination estimation to theArmy sequence from the Middlebury optical flow bench-mark data set (http://vision.middlebury.edu/flow/). We inte-grated the algorithm into a standard coarse-to-fine frame-work in order to re-estimate v0. The parameters of the model

were set to λ = 40 and β = 0.01. One can see that the esti-mated motion field is very close to the ground truth motionfield. Furthermore, one can see that illumination changesand shadows are well captured by the model (see for ex-ample the shadow on the left side of the shell). We haveadditionally implemented the algorithm on dedicated graph-ics hardware. This leads to a real-time performance of 30frames per second for 640 × 480 images.

6.3.4 Image Segmentation

Finally, we consider the problem of finding a partition ofan image into k pairwise disjoint regions, which minimizesthe total interface between the sets, as for example in thepiecewise constant Mumford-Shah problem [21]

min(Rl)

kl=1,(cl )

kl=1

1

2

k∑

l=1

Per(Rl;�)

+ λ

2

k∑

l=1

∫

Rl

|g(x) − cl |2 dx (81)

where g : � → R is the input image, cl ∈ R are the optimalmean values and the regions (Rl)

kl=1 form a partition of �,

that is, Rl ∩ Rm = ∅ if l �= m and⋃k

l=1 Rl = �. The para-meter λ is again used to balance the data fitting term andthe length term. Of course, given the partition (Rl)

kl=1, the

optimal constant cl = ∫Rl

g ds/|Rl | is the average value of g

http://vision.middlebury.edu/flow/


Fig. 12 Triple junctionexperiment with k = 3.(a) shows the 200 × 200 inputimage with given boundarydatum. (b) shows the resultusing the relaxation (83), and(c) shows the result using therelaxation (84)

on Rl for each l = 1, . . . , k. On the other hand, finding theminimum of (81) with respect to the partition (Rl)

kl=1 is a

hard task, even for fixed values (cl)kl=1. It is known that its

discrete counterpart (the Pott’s model) is NP-hard, so that itis unlikely that (81) has a simple convex representation, atleast without increasing drastically the number of variables.In the following, we will assume that the optimal mean val-ues cl are known and fixed, which leads us to consider thefollowing generic representation of (81).

minu=(ul)

kl=1

J (u) +k∑

l=1

∫

�

ulfl dx,

(82)

ul(x) ≥ 0,

k∑

l=1

ul(x) = 1, ∀x ∈ �,

where u = (ul)kl=1 : � → R

k is the labeling function andfl = λ|g(x) − cl |2/2 as in (81) or any other weightingfunction obtained from more sophisticated data terms (e.g.,based on histograms, or a similarity measure in case ofstereo). Different choices have been proposed for relax-ations of the length term J (u). The most straightforwardrelaxation as used in [35] is

J1(u) = 1

2

k∑

l=1

∫

�

|Dul |, (83)

which is simply the sum of the total variation of each label-ing function ui However, it can be shown that this relaxationis too small [8]. A better choice is (see Fig. 12)

J2(u) =∫

�

�(Du),

(84)�(p) = sup

q{〈p,q〉 : |ql − qm| ≤ 1, 1 ≤ l < m ≤ k} ,

where p = (p1, . . . , pk) and q = (q1, . . . , qk). This energy isalso a sort of total variation but now defined on the completevector-valued function u. This construction is related to thetheory of paired calibrations [4, 18].

Let us now turn to the discrete setting. The primal-dualformulation of the partitioning problem (82) is obtained as

minu=(ul)

kl=1

maxp=(p)kl=1

(k∑

l=1

〈∇ul,pl〉 + 〈ul, fl〉)

+ δU (u) − δP (p), (85)

where f = (fl)kl=1 ∈ Xk is the discretized weighting func-

tion, u = (ul)kl=1 ∈ Xk is the primal variable, representing

the assignment of each pixel to the labels and p = (pl)kl=1 ∈

Y k is the dual variable, which will be constrained to stay ina set P which we will soon make precise.

In the above formula, we can identify G(u) = δU (u),which forces the solution to stay in the unit simplex U , de-fined as

U ={

u ∈ Xk : (ul)i,j ≥ 0,

k∑

l=1

(ul)i,j = 1

}

.

Furthermore we can identify F ∗(p) = δP (p), where theconvex sets P = P1 or P2 either realizes the standard re-laxation J1(u) or the stronger relaxation J2(u) of the totalinterface surface. In particular, the set P1 arises from an ap-plication of the dual formulation of the total variation to eachvector ul ,

P1 ={p ∈ Y k : ‖pl‖∞ ≤ 1

2

}.

On the other hand, the set P2 is directly obtained from thedefinition of relaxation (84),

P2 ={p ∈ Y k : ‖pl − pm‖∞ ≤ 1, 1 ≤ l < m ≤ k

},

which is essentially an intersection of unit balls.Next, we detail the resolvent operators. The resolvent op-

erator with respect to G is an orthogonal projector onto theunit simplex defined by the convex set U . It is known thatthis projection can be performed in a finite number of steps.See for example [20] for an algorithm based on successiveprojections and corrections.


Fig. 13 Piecewise constantMumford-Shah segmentation ofa natural image with k = 16.(a) shows the 580 × 435 inputimage and (b) is the minimizerof energy (82)

Fig. 14 Stereo estimation usingthe Potts model. The first rowshows the left and right inputimages of size 427 × 370,downloaded from(http://vision.middlebury.edu/stereo/). The second row showthe ground truth disparity image(black pixels indicate unknowndisparity values) and theestimated disparity values of thesegmentation model (red pixelsindicate occluded pixels)

The resolvent operator with respect to F ∗ is also an or-thogonal projector. In case of P1 the projection is very easy,since it reduces to pointwise projections onto unit balls. Incase of P2 the projection is more complicated, since thecomplete vector pi,j is projected onto an intersection of con-vex sets. This can for example be performed by Dykstra’s al-gorithm [3]. Finally we adhere that since (85) is non-smooth,we have to use Algorithm 1 to minimize the segmentationmodel.

Figure 12 shows the result of different relaxations for thetriple-junction experiment. Here, the task is to complete thesegmentation boundary in the gray area, where the weight-ing function gl is set to zero. One can see that the simple re-laxation (83) leads to a non-integer (binary, with u ∈ {0,1}ka.e.) solution. On the other hand, the stronger relaxation (84)yields an almost binary solution and hence a globally opti-mal solution of the segmentation model.

Figure 13 shows the result of piecewise constant Mum-ford-Shah segmentation (81) using the relaxation J2(u). Weused k = 16 labels and the mean color values ci have beeninitialized using k-means clustering. The regularization pa-rameter was set to λ = 5. Note that again, the result is almostbinary.

Finally, Fig. 14 shows the application of the Potts modelto disparity estimation. Here, the labels correspond to dis-tinct disparity (and hence depth) values. The weightingfunctions fl are computed based on a pixel-wise similar-ity measure between the stereo images for a given disparityvalue associated with the label l. We use the sum of the ab-solute differences between the image gradients of the leftand the right images to achieve some invariance to intensitychanges. Furthermore, we introduce an additional label en-dowed with a small constant weighting function to accountfor occluded pixels.


In all, we obtain a Potts model with approximately 200labels, which in our example amounts to solve an optimiza-tion problem of more than 30 million unknowns. Due tomemory restrictions, we apply the weaker relaxation (83),which however gives satisfactory results for this example.Our GPU-based implementation of the primal-dual algo-rithm finds an approximate solution (with reasonable accu-racy) in less than 30 seconds. This shows that the algorithmcan be applied to very large convex optimization problems.

7 Discussion

In this paper we have proposed a first-order primal-dual al-gorithm and showed how it can be useful for solving effi-ciently a large family of convex problems arising in imag-ing. We have shown that it can be adapted to yield an ac-celerated rate of convergence depending of the regularity ofthe problem, optimal with respect to what is know in the sci-entific literature. In particular, the algorithm converges withO(1/N) for non-smooth problems, with O(1/N2) for prob-lems where either the primal or dual objective is uniformlyconvex, and that it converges linearly, i.e. like O(ωN) forω < 1, for “smooth” problems (where both the primal anddual are uniformly convex). Our theoretical results are sup-ported by the numerical experiments.

There are still several interesting questions, which needto be addressed in the future: (a) the case where the lin-ear operator K is unbounded or has a large (in general un-known) norm, as it is usually the case in infinite dimensionor in finite elements discretization of continuous problems;(b) how to automatically determine the smoothness parame-ters or to locally adapt to the regularity of the objective;(c) understand why does the Arrow-Hurwicz method per-form so well in some situations.

Acknowledgements Antonin Chambolle is partially supported bythe Agence Nationale de la Recherche (grant “MICA” ANR-08-BLAN-0082). Thomas Pock is partially supported by the Austrian Sci-ence Fund (FWF) under the doctoral program “Confluence of Visionand Graphics” W1209. The authors wish to thank Daniel Cremers andMarc Teboulle for helpful discussion. Also, we are very grateful toMarc Teboulle and the anonymous referees who carefully read a firstversion of this work and provided us we very helpful comments.

References

1. Arrow, K.J., Hurwicz, L., Uzawa, H.: Studies in linear and non-linear programming. In: Chenery, H.B., Johnson, S.M., Karlin, S.,Marschak, T., Solow, R.M. (eds.) Stanford Mathematical Studiesin the Social Sciences, vol. II. Stanford University Press, Stanford(1958)

2. Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding al-gorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1),183–202 (2009)

3. Boyle, J.P., Dykstra, R.L.: A method for finding projections ontothe intersection of convex sets in Hilbert spaces. In: Advances inOrder Restricted Statistical Inference, Iowa City, Iowa, 1985. Lec-ture Notes in Statist., vol. 37, pp. 28–47. Springer, Berlin (1986)

4. Brakke, K.A.: Soap films and covering spaces. J. Geom. Anal.5(4), 445–514 (1995)

5. Candès, E., Demanet, L., Donoho, D., Ying, L.: Fast discretecurvelet transforms. Multiscale Model. Simul. 5(3), 861–899(2006) (electronic)

6. Chambolle, A.: An algorithm for total variation minimization andapplications. J. Math. Imaging Vis. 20(1–2), 89–97 (2004). Spe-cial issue on mathematics and image analysis

7. Chambolle, A.: Total variation minimization and a class of binaryMRF models. In: Energy Minimization Methods in Computer Vi-sion and Pattern Recognition, pp. 136–152 (2005)

8. Chambolle, A., Cremers, D., Pock, T.: A convex approach forcomputing minimal partitions. Technical Report 649, CMAP,Ecole Polytechnique, France (2008)

9. Chambolle, A., Pock, T.: A first-order primal-dual algorithmfor convex problems with applications to imaging. http://hal.archives-ouvertes.fr/hal-00490826 (2010)

10. Eckstein, J., Bertsekas, D.P.: On the Douglas-Rachford splittingmethod and the proximal point algorithm for maximal monotoneoperators. Math. Program., Ser. A 55(3), 293–318 (1992)

11. Esser, E.: Applications of Lagrangian-based alternating directionmethods and connections to split Bregman. CAM Reports 09-31,UCLA, Center for Applied Math. (2009)

12. Esser, E., Zhang, X., Chan, T.: A general framework for a classof first order primal-dual algorithms for tv minimization. CAMReports 09-67, UCLA, Center for Applied Math. (2009)

13. Fadili, J., Peyré, G.: Total variation projection with first orderschemes. http://hal.archives-ouvertes.fr/hal-00380491/ (2009)

14. Fadili, J., Starck, J.-L., Elad, M., Donoho, D.: Mcalab: Repro-ducible research in signal and image decomposition and inpaint-ing. Comput. Sci. Eng. 12(1), 44–63 (2010)

15. Goldstein, T., Osher, S.: The split Bregman algorithm for l1 regu-larized problems. CAM Reports 08-29, UCLA, Center for AppliedMath. (2008)

16. He, B., Yuan, X.: Convergence analysis of primal-dual al-gorithms for total variation image restoration. Technical Re-port 2790, Optimization Online, November 2010 (available atwww.optimization-online.org)

17. Korpelevic, G.M.: An extragradient method for finding saddlepoints and for other problems. Èkon. Mat. Metody 12(4), 747–756(1976)

18. Lawlor, G., Morgan, F.: Paired calibrations applied to soap films,immiscible fluids, and surfaces or networks minimizing othernorms. Pac. J. Math. 166(1), 55–83 (1994)

19. Lions, P.-L., Mercier, B.: Splitting algorithms for the sum oftwo nonlinear operators. SIAM J. Numer. Anal. 16(6), 964–979(1979)

20. Michelot, C.: A finite algorithm for finding the projection of apoint onto the canonical simplex of Rn. J. Optim. Theory Appl.50(1), 195–200 (1986)

21. Mumford, D., Shah, J.: Optimal approximation by piecewisesmooth functions and associated variational problems. Commun.Pure Appl. Math. 42, 577–685 (1989)

22. Nedic, A., Ozdaglar, A.: Subgradient methods for saddle-pointproblems. J. Optim. Theory Appl. 142(1), 1 (2009)

23. Nemirovski, A.: Prox-method with rate of convergence O(1/t) forvariational inequalities with Lipschitz continuous monotone oper-ators and smooth convex-concave saddle point problems. SIAM J.Optim. 15(1), 229–251 (2004) (electronic)

24. Nemirovski, A., Yudin, D.: Problem Complexity and Method Effi-ciency in Optimization. A Wiley-Interscience Publication. Wiley,New York (1983). Translated from the Russian and with a preface

http://hal.archives-ouvertes.fr/hal-00490826

http://hal.archives-ouvertes.fr/hal-00490826

http://hal.archives-ouvertes.fr/hal-00380491/

http://www.optimization-online.org


by E.R. Dawson, Wiley-Interscience Series in Discrete Mathemat-ics

25. Nesterov, Yu.: A method for solving the convex programmingproblem with convergence rate O(1/k2). Dokl. Akad. Nauk SSSR269(3), 543–547 (1983)

26. Nesterov, Yu.: Introductory Lectures on Convex Optimization.Applied Optimization, vol. 87. Kluwer Academic, Boston (2004).A basic course

27. Nesterov, Yu.: Smooth minimization of non-smooth functions.Math. Program., Ser. A 103(1), 127–152 (2005)

28. Nesterov, Yu.: Gradient methods for minimizing composite ob-jective function. Technical report, CORE DISCUSSION PAPER(2007)

29. Pock, T., Cremers, D., Bischof, H., Chambolle, A.: An algorithmfor minimizing the Mumford-Shah functional. In: ICCV Proceed-ings, LNCS. Springer, Berlin (2009)

30. Popov, L.D.: A modification of the Arrow-Hurwitz method ofsearch for saddle points. Mat. Zametki 28(5), 777–784, 803(1980)

31. Rockafellar, R.T.: Monotone operators and the proximal point al-gorithm. SIAM J. Control Optim. 14(5), 877–898 (1976)

32. Rockafellar, R.T.: Convex Analysis, Princeton Landmarks inMathematics. Princeton University Press, Princeton (1997).Reprint of the 1970 original, Princeton Paperbacks

33. Rudin, L., Osher, S.J., Fatemi, E.: Nonlinear total variation basednoise removal algorithms. Physica D 60, 259–268 (1992) [alsoin Experimental Mathematics: Computational Issues in NonlinearScience (Proc. Los Alamos Conf. 1991)]

34. Shulman, D., Hervé, J.-Y.: Regularization of discontinuous flowfields. In: Proceedings Workshop on Visual Motion, pp. 81–86(1989)

35. Zach, C., Gallup, D., Frahm, J.M., Niethammer, M.: Fast globallabeling for real-time stereo using multiple plane sweeps. In: Vi-sion, Modeling, and Visualization 2008, pp. 243–252. IOS Press,Amsterdam (2008)

36. Zach, C., Pock, T., Bischof, H.: A duality based approach for real-time TV-L1 optical flow. In: 29th DAGM Symposium on PatternRecognition, pp. 214–223. Heidelberg, Germany (2007)

37. Zhu, M., Chan, T.: An efficient primal-dual hybrid gradient algo-rithm for total variation image restoration. CAM Reports 08-34,UCLA, Center for Applied Math. (2008)

a first-order primal-dual algorithm for convex problems with applications to...

Documents