sparse optimization - lecture 1: review of convex optimization
TRANSCRIPT
Sparse Optimization
Lecture 1: Review of Convex Optimization
Instructor: Wotao Yin
July 2013
online discussions on piazza.com
Those who complete this lecture will know
• convex optimization background
• various standard concepts and terminology
• reformulating `1 optimization and its optimality conditions
1 / 30
Resources for convex optimization
• Book: Convex Analysis by T. Rockafellar
• Book: Convex Optimization by S. Boyd and L. Vandenberge, along with
online videos and slides
• Book: Introductory Lectures on Convex Optimization: A Basic Course by
Y. Nesterov
• A large number of online lecture slides, notes, and videos online
2 / 30
Review: mathematical optimization
Formulation
minimizex
f0(x)
subject to fi(x) ≤ 0, i = 1, . . . ,m,
hj(x) = 0, j = 1, . . . , p.
• decision variables: x = (x1, . . . , xn)
• objective function: f0 : Rn → R
• functions defining inequality constraints: fi : Rn → R, i = 1, . . . ,m
• functions defining equality constraints: hj : Rn → R, j = 1, . . . , p
3 / 30
Terminology
• feasible solutions: all points x satisfying the constraints
fi(x) ≤ 0 (i = 1, . . . ,m) and hj(x) = 0 (j = 1, . . . , p).
• feasible set: the set of all feasible solutions, often denoted by X .
• (global) (optimal) solution: feasible solution x∗ that achieves the
minimum objective value among all feasible solutions.
• local (optimal) solution: feasible solution x∗ that achieves the minimal
objective value among a neighborhood around x∗, say, the set
{x : ‖x− x∗‖ ≤ δ} ∩ X for some δ > 0
4 / 30
Some examples
• Find two nonnegative numbers whose product up to 9 and so that the sum
of the two numbers is a maximum.
• Find the largest area a rectangular region provided that its perimeter is not
great than 100.
• Given a sequences of nonnegative numbers, find a start point and an end
point so that the partial sum of the sequence between the two points is a
maximum.
5 / 30
Solving optimization problems
In general, everything is optimization, but optimization problems are generally
not solvable, even by the most powerful computers.
Some classes of problems can be solved efficiently and reliably, for example:
• least-squares problems
• linear programming problems
• quadratic programming problems
• convex optimization problems
• a subclass of network-flow problems
• submodular function minimization
(.... more, but not much more...)
• some sparse optimization problems
6 / 30
Least squares
minimizex
‖Ax− b‖22
• analytic solution x∗ = (ATA)−1ATb if A has independent columns
• reliable and efficient algorithms and software packages
• computation time proportional to n2k (A ∈ Rk×n), less if structured
• a mature technology (unless A is huge and/or distributed)
7 / 30
Linear programming (LP)
minimizex
cTx
subject to aTi x ≤ bi, i = 1, . . . ,m
• no analytic formula for solutions
• reliable and efficient algorithms and software packages
• computation time proportional to n2m if m ≥ n, less with structured data
• a mature technology
• a few standard tricks used to convert problems (with `1 or `∞, piecewise
linear functions) into linear programs
8 / 30
Convex optimization
minimizex
f0(x)
subject to fi(x) ≤ 0, i = 1, . . . ,m,
Ax = b.
where objective and constraint functions are convex, i.e.,
fi(θx1 + (1− θ)x2) ≤ θfi(x1) + (1− θ)fi(x2)
for all i = 0, 1, . . . ,m, θ ∈ (0, 1) and x1,x2 ∈ domfi.
• no analytic solution
• relatively reliable and efficient algorithms and software packages
• computation time (roughly) proportional to max{n3, n2m,F}, where F is
cost of evaluating fi’s and their first and second derivatives.
• almost a technology
Least-squares and linear programs are special convex programs.
9 / 30
Non-convex optimization problems
General optimization problems are non-convex
minimizex
f0(x)
subject to fi(x) ≤ 0, i = 1, . . . ,m
Local optimization methods
• find a solution which minimizes f0 among feasible solutions near it
• fast and handle large problems
• require initial guess
• provide no information about the distance to global optima
Global optimization methods
• find the global solution
• worst-case complexity grows exponentially with problem size.
These methods are often based on solving convex subproblems.
10 / 30
Brief history of convex optimization
theory (convex analysis): 1900–1970s
algorithms
• 1947: simplex algorithm for linear programming (Dantzig)
• 1960s: early interior-point methods (Fiacco & McCormick, Dikin, . . . )
• 1970s: ellipsoid method and other subgradient methods
• 1980s: polynomial-time interior-point methods for linear programming
(Karmarkar 1984)
• late 1980s-2000s: polynomial-time interior-point methods for nonlinear
convex optimization (Nesterov & Nemirovski 1994)
• recently: revived interests in first-order (gradient-based) algorithms,
solving big-data problems
applications
• before 1990: mostly in operations research; few in engineering
• since 1990: many new applications in engineering (control, signal
processing, communications, circuit design, . . . ); new problem classes
(semidefinite and second-order cone programming, robust optimization,
sparse optimization)11 / 30
Convex set
A set C is called convex if the segment between any two points in C lies
entirely in C.
Formally, C is convex if for any x1,x2 ∈ C and θ ∈ (0, 1), we have
θx1 + (1− θ)x2 ∈ C.
Examples:
• Euclidean balls: B(xc, r) = {x : ‖x− xc‖2 ≤ r}
• ellipsoid: {x : (x− xc)TP−1(x− xc) ≤ 1} with P being symmetric
positive definite
• polyhedra: {x : Ax ≤ b, Cx = d} with A ∈ Rm×n, C ∈ Rp×n
• several operations preserving convexity: intersection; affine function;
perspective function; linear-fractional functions.
In most time, recognizing a convex set is not a problem.
12 / 30
Convex functions
A function f : Rn → R is convex if domf is convex and for any
x1,x2 ∈ domf and θ ∈ (0, 1), we have
f(θx1 + (1− θ)x2) ≤ θf(x1) + (1− θ)f(x2).
f is concave if (−f) is convex.
f is strictly convex if domf is convex and
f(θx1 + (1− θ)x2) < θf(x1) + (1− θ)f(x2).
13 / 30
Examples of convex functions
Examples in Rn
• affine function f(x) = aTx+ b
• norms: ‖x‖p = (∑ni=1 |xi|
p)1/p for p ≥ 1; ‖x‖∞ = maxi |xi|.
Examples in Rm×n
• affine function
f(X) = tr(ATX) + b =m∑i=1
n∑j=1
AijXij + b
• spectral norm (maximum singular value)
f(X) = ‖X‖2 = σmax(X) = (λmax(XTX))1/2
• nuclear norm
f(X) = ‖X‖∗ =min{m,n}∑
i=1
σi
14 / 30
Terminology
• extended value: f may take on value +∞, reduce the need of domf
• proper: exists x so that f(x) is finite
• lower semi-continuous (LSC): lim infx→x0 f(x) ≥ f(x0)
• closed: f has a closed epigraph
epif = {(x, µ) : µ ∈ R, µ ≥ f(x)}
• Lemma: a proper convex function is closed if and only if its is LSC
• subdifferential
∂f(x) = {p : f(y) ≥ f(x) + 〈p,y − x〉 ∀y}
- each p ∈ ∂f(x) is called a subgradient
- if f ∈ C1 near x, then ∂f(x) = {∇f(x)}
15 / 30
First-order condition
f is differentiable if the derivative
∇f(x) =[∂f(x)
∂x1,∂f(x)
∂x2, . . . ,
∂f(x)
∂xn
]Texists at every x ∈ domf .
first-order condition: differentiable f with convex domain is convex iff
f(y) ≥ f(x) +∇f(x)T (y − x) for all x,y ∈ domf
first-order condition: subdifferentiable f with convex domain is convex iff
f(y) ≥ f(x) + pT (y − x) for all x,y ∈ domf, p ∈ ∂f(x)
first-order optimality condition: x∗ ∈ argmin f(x)⇐⇒ 0 ∈ ∂f(x∗)
16 / 30
Second-order condition
f is twice differentiable if Hessian ∇2f(x) ∈ Sn defined by
∇2f(x)ij =∂2f(x)
∂xi∂xj, i, j = 1, . . . , n,
exists at every x ∈ domf .
second-order condition: twice differentiable f with convex domain is convex iff
∇2f(x) � 0, for all x ∈ domf.
Furthermore, if ∇2f(x) � 0 for all x ∈ domf , then f is strictly convex.
Very useful in general convex optimization but not so in sparse optimization
17 / 30
Convex optimization formulation
Standard-form convex optimization problem
minimizex
f0(x)
subject to fi(x) ≤ 0, i = 1, . . . ,m,
Ax = b.
- the feasible set of a convex optimization problem is convex.
- f0, f1, . . . , fm are convex; equality constraints are affine.
18 / 30
Local and global solutions
Theorem
Any local solution of a convex problem is a global solution.
Proof.
Suppose that x is a local solution and y is a global solution and that
f0(y) < f0(x).
Consider z = θy + (1− θ)x. Since
f0(z) ≤ θf0(x) + (1− θ)f0(y) < f0(x)
for any θ ∈ (0, 1) and ‖x− z‖ can be arbitrary small, x cannot be a local
solution.
19 / 30
Optimality criterion for differentiable f0
Since the feasible set is convex and
f0(y) ≥ f0(x) +∇f0(x)T (y − x),
x is optimal iff it is feasible and
∇f0(x)T (y − x) ≥ 0 for all feasible y.
20 / 30
• unconstrained problem: x is optimal if and only if
x ∈ domf0, ∇f0(x) = 0
• equality constrained problem:
minimizex
f0(x) subject to Ax = b
x is optimal if and only if their exist a vector ν such that
x ∈ domf0, Ax = b, ∇f0(x) +AT ν = 0
• minimization over nonnegative orthant
minimizex
f0(x) subject to x ≥ 0
x is optimal if and only if
x ∈ domf0, x ≥ 0,
{∇f0(x)i ≥ 0 xi = 0
∇f0(x)i = 0 xi > 0
21 / 30
Unconstrained problem with nondifferentiable f0
g is a subgradient of a convex function f at x ∈ domf if
f(y) ≥ f(x) + gT (y − x), ∀y ∈ domf.
the subdifferential ∂f(x) of f at x is the set of all subgradients:
∂f(x) = {g : gT (y − x) ≤ f(y)− f(x) ∀y ∈ domf}
x∗ minimizes f0(x) if and only if
0 ∈ ∂f0(x∗)
22 / 30
Optimality criteria in the general case
Standard form problem (not necessarily convex)
minimizex
f0(x)
s.t. fi(x) ≤ 0, i = 1, . . . ,m
hj(x) = 0, j = 1, . . . , p
domain D, optimal value p∗
Lagrangian: L : Rn × Rm × Rp → R with domL = D × Rm × Rp,
L(x, λ, ν) = f0(x) +
m∑i=1
λifi(x) +
p∑j=1
νjhj(x)
• λi is Lagrange multiplier associated with fi(x) ≤ 0
• νj is Lagrange multiplier associated with hj(x) = 0
23 / 30
Lagrange dual function
Lagrange dual function: g : Rm × Rp → R,
g(λ, ν) = infx∈D
L(x, λ, ν)
= infx∈D
(f0(x) +
m∑i=1
λifi(x) +
p∑j=1
νjhj(x)
)
g is concave, can be −∞ for some λ, ν.
Lower bound property: if λ � 0, then g(λ, ν) ≤ p∗.
24 / 30
Dual problem
Lagrange dual problem
maximizeλ,ν
g(λ, ν)
subject to λ � 0
• finds the best lower bound n p∗
• a convex optimization problem; optimal value denoted d∗
• λ, ν are dual feasible if λ � 0, (λ, ν) ∈ domg
Strong duality: d∗ = p∗
• does not hold in general
• (usually) holds for convex problems
• conditions that guarantee strong duality in convex problems are called
constraint qualifications
25 / 30
Slater’s constraint qualification
Strong duality holds for a convex problem
minimizex
f0(x)
subject to fi(x) ≤ 0, i = 1, . . . ,m
Ax = b
if it is strictly feasible, i.e.,
∃x ∈ intD : fi(x) < 0, i = 1, . . . ,m, Ax = b
• also guarantees that the dual optimum is attained (if p∗ > −∞)
• linear inequalities do not need to hold with strict inequality
• there are many other types of constraint qualifications
• some non-convex optimization problems may have strong duality
26 / 30
Complementary slackness
Assume strong duality holds, x∗ is primal optimal, and (λ∗, ν∗) is dual optimal
f0(x∗) = g(λ∗, ν∗) = inf
x
(f0(x) +
m∑i=1
λ∗i fi(x) +
p∑j=1
ν∗j hj(x)
)
≤ f0(x∗) +m∑i=1
λ∗i fi(x∗) +
p∑j=1
ν∗j hj(x∗)
≤ f0(x∗)
• x∗ minimizes L(x, λ∗, ν∗)
• λ∗i fi(x∗) = 0 for i = 1, . . . ,m (complementary slackness)
27 / 30
Karush-Kuhn-Tucker (KKT) conditions
KKT conditions for a problem with differentiable fi, hj :
• primal constraints: fi(x) ≤ 0, i = 1, . . . ,m, hj(x) = 0, j = 1, . . . , p
• dual constraints: λ � 0
• complementary slackness: λifi(x) = 0, i = 1, . . . ,m
• gradient of Lagrangian with respect to x vanishes:
∇f0(x) +m∑i=1
λi∇fi(x) +p∑i=1
νj∇hj(x) = 0
If x̃, λ̃, ν̃ satisfy KKT for a convex problem, then they are optimal.
28 / 30
Exercise: constrained/unconstrained `1 problem
Consider two `1 problems
minimizex
‖x‖1
subject to Ax = b
l ≤ x ≤ u
and
minimizex
‖x‖1 +λ
2‖Ax− b‖22
subject to l ≤ x ≤ u
Exercises: derive their
• LP or QP formulations
• Lagrange dual problems
• KKT conditions
29 / 30
Exercise: total variation problem∗
The discrete total variation of a vector x ∈ Rn is
TV(x) =
n−1∑i=1
|xi+1 − xi|.
Consider problem
minimizex
TV(x) +λ
2‖Ax− b‖22
subject to l ≤ x ≤ u
Exercises: derive its
• SOCP formulation (refer to Sec.4.2.2 of Boyd&Vandenberghe)
• Lagrange dual problem
• KKT conditions
30 / 30