introduction to optimization - tu berlinintroduction to optimization marc toussaint july 11, 2013...
TRANSCRIPT
Introduction to Optimization
Marc Toussaint
July 11, 2013
This is a direct concatenation and reformatting of all lecture slides and exercises fromthe Optimization course (summer term 2013, U Stuttgart), including a topic list to preparefor exams.
Contents
1 Introduction 2
2 Gradient-based Methods 4Plain gradient descent, stepsize adaptation & monotonicity, steepest descent, conjugate gradient, Rprop
3 Constrained Optimization 7General definition, log barriers, central path, squared penalties, augmented Lagrangian (equalities & inequalities), theLagrangian, force balance view & KKT conditions, saddle point view, dual problem, min-max max-min duality, modifiedKKT & log barriers, Phase I
4 Second-Order Methods 132nd order gives better stepsize & direction, Newton methods, adaptive stepsize, Levenberg-Marquardt, Gauss-Newtonmethod, Quasi-Newton methods, BFGS, primal-dual interior point Newton method
5 Convex Optimization 17Convex, quasiconvex, unimodal, convex optimization problem, linear program (LP), standard form, simplex algorithm,LP-relaxation of integer linear programs, quadratic programming (QP), sequential quadratic programming
6 Stochastic Search & Heuristics 21Blackbox optimization, stochastic search, (µ+λ)-ES, CMA-ES, Evolutionary Algorithms, Simulated Annealing, Hill Climing,Nelder-Mead downhill simplex
7 Global Optimization 25Multi-armed bandits, exploration vs. exploitation, navigation through belief space, upper confidence bound (UCB), globaloptimization = infinite bandits, Gaussian Processes, probability of improvement, expected improvement, UCB
8 Exercises 28
9 Topic list 32
1
2 Introduction to Optimization, Marc Toussaint—July 11, 2013
1 Introduction
Why Optimization is interesting!
• Which science does not use optimality principles to describe
nature & artifacts?
• Endless applications
1:1
The content of an optimization course
• Catholic way: Convex Optimization
• Discrete Optimization (Stefan Funke)
• Exotics: Evolutionary Algorithms, Swarm optimization, etc
• Here:
I asked colleagues “What are the optimization methods one should
know / you use most.” → Everybody gave very different an-
swers.
1:2
log-barriersimplexparticle swarmMCMC (simulated annealing)(L)BFGSblackbox stochastic searchNewtonRpropEMprimal/dualgreedy(conj.) gradientsKKTline searchlinear/quadratic programming
1:3
This is the first time I give the lecture!
• It’ll be improvised
• You can tell me what to include
1:4
Planned Outline
• Gradient-based optimization (1st order methods)– plain grad., steepest descent, conjugate grad., Rprop, stochastic
grad.– adaptive stepsize heuristics
• Constrained Optimization– squared penalties, augmented Lagrangian, log barrier– Lagrangian, KKT conditions, Lagrange dual, log barrier↔ approx.
KKT
• 2nd order methods– Newton, Gauss-Newton, Quasi-Newton, (L)BFGS
– constrained case, primal-dual Newton
• Special convex cases– Linear Programming, (sequential) Quadratic Programming– Simplex algorithm– relation to relaxed discrete optimization
• Blackbox optimization (“0th order methods”)– blackbox stochastic search– Markov Chain Monte Carlo methods– evolutionary algorithms
1:5
Rough Types of Optimization Problems
• Generic optimization problem:
Let x ∈ Rn, f : Rn → R, g : Rn → Rm, find
minx
f(x)
s.t. g(x) ≤ 0
• Blackbox: only f(x) can be evaluated
• Gradient: ∇f(x) can be evaluated
• Gauss-Newton type: f(x) = φ(x)>φ(x) and∇φ(x) can be evaluated
• 2nd order: ∇2f(x) can be evaluated
• “Approximate upgrade”:
– Use samples of f(x) to approximate ∇f(x) locally
– Use samples of ∇f(x) to approximate ∇2f(x) locally
1:6
Books
Boyd and Vandenberghe: ConvexOptimization.http://www.stanford.edu/
˜boyd/cvxbook/
(this course will not go to the full depth in math of Boyd et al.)
1:7
Organisation
• Vorlesungs-Webpage:
http://ipvs.informatik.uni-stuttgart.de/mlr/marc/teaching/13-Optimization/
– Slides, Ubungen & Software (C++)– Links zu Buchern und anderen Ressourcen
• Sekretariat/Organisatorische Fragen:
Carola Stahl, [email protected], Raum 2.217
Introduction to Optimization, Marc Toussaint—July 11, 2013 3
• 1 geplante Ubung: Dienstag 17:30-19:00, 0.453
• Regelung zu Ubungen:– Bearbeitung der Ubungen ist wichtig!
– Zu Beginn jeder Ubung in Liste eintragen:– Teilnahme– Welche Aufgaben wurden bearbeitet
– Zufallige Auswahl zur Prasentation der Losung– 50% bearbeitete Aufgaben notwendig fur aktive Teilnahme
1:8
4 Introduction to Optimization, Marc Toussaint—July 11, 2013
2 Gradient-based Methods
Plain gradient descent, stepsize adaptation & monotonicity, steepest
descent, conjugate gradient, Rprop
Gradient descent methods – outline
• Plain gradient descent (with adaptive stepsize)
• Steepest descent (w.r.t. a known metric)
• Conjugate gradient (requires line search)
• Rprop (heuristic, but quite efficient)
2:1
Gradient descent
• Notation:
objective function: f : Rn → R
gradient vector: ∇f(x) =[∂∂xf(x)
]>∈ Rn
• Problem:
minxf(x)
where we can evaluate f(x) and ∇f(x) for any x ∈ Rn
• Gradient descent:
Make iterative steps in the direction −∇f(x).
2:2
Plain Gradient Descent
2:3
Fixed stepsize
BAD! gradient descent:
Input: initial x ∈ Rn, function ∇f(x), stepsize α, toleranceθ
Output: x1: repeat2: x← x− α∇f(x)
3: until |∆x| < θ [perhaps for 10 iterations in sequence]
2:4
Making steps proportional to ∇f(x)??
large gradient large step?
small gradient small step?
NO!
We need methods indep. of |∇f(x)|, invariant of scaling of f and
x!
2:5
How can we become independent of |∇f(x)|?
• Line search — which we’ll discuss briefly later
• Stepsize adaptation
2:6
Gradient descent with stepsize adaptation
Input: initial x ∈ Rn, functions f(x) and ∇f(x), initial step-size α, tolerance θ
Output: x1: repeat2: y ← x− α ∇f(x)
|∇f(x)|3: if [ thenstep is accepted]f(y) ≤ f(x)
4: x← y
5: α← 1.2α // increase stepsize6: else[step is rejected]7: α← 0.5α // decrease stepsize8: end if9: until |y − x| < θ [perhaps for 10 iterations in sequence]
(“magic numbers”)
α determins the absolute stepsize
stepsize is automatically adapted
2:7
• Guaranteed monotonicity (by construction)
If f is convex⇒ convergence
For typical non-convex bounded f ⇒ convergence to local opti-
mum
2:8
Steepest Descent
2:9
Introduction to Optimization, Marc Toussaint—July 11, 2013 5
Steepest Descent
• The gradient∇f(x) is sometimes called steepest descent direc-
tion
Is it really?
• Here is a possible definition:
The steepest descent direction is the one where, when I make
a step of length 1, I get the largest decrease of f in its linear
approximation.
argminδ∇f(x)>δ s.t. ||δ|| = 1
2:10
Steepest Descent
• But the norm ||δ||2 = δ>Aδ depends on the metric A!
Let A = B>B (Cholesky decomposition) and z = Bδ
δ∗ = argminδ∇f>δ s.t. δ>Aδ = 1
= B-1 argminz
(B-1z)>∇f s.t. z>z = 1
= B-1 argminz
z>B->∇f s.t. z>z = 1
= B-1[−B->∇f ] = −A-1∇f
The steepest descent direction is δ = −A-1∇f2:11
Behavior under linear coordinate transforma-tions
• Let B be a matrix that describes a linear transformation in coor-
dinates
• A coordinate vector x transforms as z = Bx
• The gradient vector ∇xf(x) transforms as ∇zf(z) = B->∇xf(x)
• The metric A transforms as Az = B->AxB-1
• The steepest descent transforms as A-1z∇zf(z) = BA-1
x∇xf(x)
The steepest descent transforms like a normal coordinate vector
(covariant)
2:12
(Nonlinear) Conjugate Gradient
2:13
Conjugate Gradient
• The “Conjugate Gradient Method” is a method for solving large
linear eqn. systems Ax+ b = 0
We mention its extension for optimizing nonlinear functions f(x)
• A key insight:
– at xk we computed ∇f(xk)
– we made a (line-search) step to xk+1
– at xk+1 we computed ∇f(xk+1)
What conclusions can we draw about the “local quadratic shape” of f?
2:14
Conjugate Gradient
Input: initial x ∈ Rn, functions f(x),∇f(x), tolerance θOutput: x
1: initialize descent direction d = g = −∇f(x)
2: repeat3: α← argminα f(x+ αd) // line search4: x← x+ αd
5: g′ ← g, g = −∇f(x) // store and compute grad
6: β ← max
{g>(g−g′)g′>g′
, 0
}7: d← g + βd // conjugate descent direction8: until |∆x| < θ
• Notes:– β > 0: The new descent direction always adds a bit of the old direc-tion!– This essentially provides 2nd order information– The equation for β is by Polak-Ribiere: On a quadratic function f(x) =x>Ax this leads to conjugate search directions, d′>Ad = 0.
– All this really only works with line search
2:15
Conjugate Gradient
• For quadratic functions CG converges in n iterations. But each
iteration does line search!
2:16
Conjugate Gradient
• Useful tutorial on CG and line search:
J. R. Shewchuk: An Introduction to the Conjugate Gradient Method
Without the Agonizing Pain
6 Introduction to Optimization, Marc Toussaint—July 11, 2013
2:17
Rprop
2:18
Rprop
“Resilient Back Propagation” (outdated name from NN times...)
Input: initial x ∈ Rn, function f(x),∇f(x), initial stepsize α, toler-ance θ
Output: x1: initialize x = x0, all αi = α, all gi = 0
2: repeat3: g ← ∇f(x)
4: x′ ← x
5: for i = 1 : n do6: if [ thensame direction as last time]gig′i > 0
7: αi ← 1.2αi8: xi ← xi − αi sign(gi)
9: g′i ← gi10: else if [ thenchange of direction]gig′i < 0
11: αi ← 0.5αi12: xi ← xi − αi sign(gi)
13: g′i ← 0 // force last case next time14: else15: xi ← xi − αi sign(gi)
16: g′i ← gi17: end if18: optionally: cap αi ∈ [αmin xi, αmax xi]
19: end for20: until |x′ − x| < θ for 10 iterations in sequence
2:19
Rprop
• Rprop is a bit crazy:
– stepsize adaptation in each dimension separately
– it not only ignores |∇f | but also its exact direction
step directions may differ up to < 90◦ from ∇f
– Often works very robustly
– Guarantees? See work by Ch. Igel
• If you like, have a look at:Christian Igel, Marc Toussaint, W. Weishui (2005): Rprop using the nat-ural gradient compared to Levenberg-Marquardt optimization. In Trendsand Applications in Constructive Approximation. International Series ofNumerical Mathematics, volume 151, 259-272.
2:20
Backtracking line search
• Line search in general denotes the problem
minα≥0
f(x+ α∆)
for some step direction ∆
• The most common line search on convex functions is backtracking
Input: start point x, direction ∆, function f(x), parametersa ∈ (0, 1
2), b ∈ (0, 1)
Output: x1: initialize α = 1
2: while f(x+ α∆) > f(x) + a∇f(x)>(α∆) do3: t← bt
4: end while
b describes the stepsize decrement in case of a rejected stepa describes a minimum desired decrease in f(x)
• In the 2nd order methods we described, we chose a = 0:We did not invest into further line search steps if f(x+ α∆) ≤ f(x)
• Boyd at al: typically a ∈ [0.01, 0.3] and b ∈ [0.1, 0.8]
2:21
Backtracking line search for convex functions
(From Boyd et al.; notation differs from previous slide.)
2:22
Appendix
Two little comments on stopping criteria & costs...
2:23
Appendix: Stopping Criteria
• Standard references (Boyd) define stopping criteria based on
the “change” in f(x), e.g. |∆f(x)| < θ or |∇f(x)| < θ.
• Throughout I will define stopping criteria based on the change
in x, e.g. |∆x| < θ! In my experience this is in many problems
more meaningful, and invariant of the scaling of f .
2:24
Appendix: Evaluating optimization costs
• Standard references (Boyd) assume line search is cheap and
measure optimization costs as the number of iterations (count-
ing 1 per line search).
• Throughout I will assume that every evaluation of f(x) or (f(x),∇f(x))
or (f(x),∇f(x),∇2f(x)) is equally expensive!
2:25
Introduction to Optimization, Marc Toussaint—July 11, 2013 7
3 Constrained Optimization
General definition, log barriers, central path, squared penalties, aug-
mented Lagrangian (equalities & inequalities), the Lagrangian, force
balance view & KKT conditions, saddle point view, dual problem, min-
max max-min duality, modified KKT & log barriers, Phase I
Constrained Optimization
• General constrained optimization problem:
Let x ∈ Rn, f : Rn → R, g : Rn → Rm, h : Rn → Rl find
minx
f(x) s.t. g(x) ≤ 0, h(x) = 0
In this lecture I’ll focus (mostly) on inequality constraints g!
• Applications
– Find an optimal, non-colliding trajectory in robotics
– Optimize the shape of a turbine blade, s.t. it must not break
– Optimize the train schedule, s.t. consistency/possibility
3:1
General approaches
• Try to somehow transform the constraint problem to
a series of unconstraint problems
a single but larger unconstraint problem
another constraint problem, hopefully simpler (dual, con-
vex)
3:2
General approaches
• Penalty & Barriers– Associate a (adaptive) penalty cost with violation of the constraint– Associate an additional “force compensating the gradient into the con-straint” (augmented Lagrangian)– Associate a log barrier with a constraint, becoming ∞ for violation(interior point method)
• Gradient projection methods (mostly for linear contraints)– For ‘active’ constraints, project the step direction to become tangantial– When checking a step, always pull it back to the feasible region
• Lagrangian & dual methods– Rewrite the constrained problem into an unconstrained one– Or rewrite it as a (convex) dual problem
• Simplex methods (linear constraints)– Walk along the constraint boundaries
3:3
Penalties & Barries
• Convention:
A barrier is really∞ for g(x) > 0
A penalty is zero for g(x) ≤ 0 and increases with g(x) > 0
3:4
Log barrier method or Interior Point method
3:5
Log barrier method
• Instead of
minx
f(x) s.t. g(x) ≤ 0
we address
minx
f(x)− µ∑i
log(−gi(x))
3:6
Log barrier
• For µ→ 0, −µ log(−g) converges to∞[g > 0]
Notation: [boolean expression] ∈ {0, 1}
• The barriers gradient ∇− log(−g) = ∇gg
pushes away from the
constraint
• Eventually we want to have a very small µ – but choosing small
µ makes the barrier very non-smooth, which is bad for Gradient
and 2nd order methods
3:7
Central Path
• Every µ defines a different optimal x∗(µ)
x∗(µ) = argminx
f(x)− µ∑i
log(−gi(x))
8 Introduction to Optimization, Marc Toussaint—July 11, 2013
• Each point on the path can be understood as the optimal com-
promise of minimizing f(x) and a repelling force of the con-
straints. (Which corresponds to dual variables λ∗(µ).)
3:8
Log barrier method
Input: initial x ∈ Rn, functions f(x), g(x),∇f(x),∇g(x),tolerances θ, ε
Output: x1: initialize µ = 1
2: repeat3: find x ← argminx f(x) − µ
∑i log(−gi(x)) with tol-
erance 10θ
4: decrease µ← µ/10
5: until |∆x| < θ and ∀i : gi(x) < ε
Note: See Boyd & Vandenberghe for stopping criteria based on
f precision (duality gap) and better choice of initial µ (which is
called t there).
3:9
We will revisit the log barrier method later, once we introduced
the Langrangian...
3:10
Squared Penalty Method
3:11
Squared Penalty Method
• This is perhaps the simplest approach
• Instead of
minx
f(x) s.t. g(x) ≤ 0
we address
minx
f(x) + µ
m∑i=1
[gi(x) > 0] gi(x)2
Input: initial x ∈ Rn, functions f(x), g(x),∇f(x),∇g(x), tol.θ, ε
Output: x1: initialize µ = 1
2: repeat3: find x ← argminx f(x) + µ
∑i[gi(x) > 0] g(x)2 with
tolerance 10θ
4: µ← 10µ
5: until |∆x| < θ and ∀i : gi(x) < ε
3:12
Squared Penalty Method
• The method is ok, but will always lead to some violation of con-
straints
• A better idea would be to add an out-pushing gradient/force
−∇gi(x) for every constraint gi(x) > 0 that is violated.
Ideally, the out-pushing gradient mixes with−∇f(x) exactly such
that the result becomes tangential to the constraint!
This idea leads to the augmented Lagrangian approach.
3:13
Augmented Lagrangian
(We can introduce this is a self-contained manner, without yet defining
the “Lagrangian”)
3:14
Augmented Lagrangian (equality constraint)
• We first consider an equality constraint before addressing in-
equalities
• Instead of
minx
f(x) s.t. h(x) = 0
we address
minx
f(x) + µ
m∑i=1
hi(x)2 +∑i=1
λihi(x) (1)
• Note:
– The gradient ∇hi(x) is always orthogonal to the constraint
– By tuning λi we can induce a “virtual gradient” λi∇hi(x)
– The term µ∑mi=1 hi(x)2 penalizes as before
• Here is the trick:
– First minimize (11) for some µ and λi
– This will in general lead to a (slight) penalty µ∑mi=1 hi(x)2
– For the next iteration, choose λi to generate exactly the gradi-
ent that was previously generated by the penalty
3:15
• Optimality condition after an iteration:
x′ = argminx
f(x) + µm∑i=1
hi(x)2 +m∑i=1
λihi(x)
⇒ 0 = ∇f(x′) + µ
m∑i=1
2hi(x′)∇hi(x′) +
m∑i=1
λi∇hi(x′)
Introduction to Optimization, Marc Toussaint—July 11, 2013 9
• Update λ’s for the next iteration:
∑i=1
λnewi ∇hi(x
′) = µ
m∑i=1
2hi(x′)∇hi(x′) +
∑i=1
λoldi ∇hi(x
′)
λnewi = λold
i + 2µhi(x′)
Input: initial x ∈ Rn, functions f(x), h(x),∇f(x),∇h(x),tol. θ, ε
Output: x1: initialize µ = 1, λi = 0
2: repeat3: find x← argminx f(x) + µ
∑i hi(x)2 +
∑i λihi(x)
4: ∀i : λi ← λi + 2µhi(x′)
5: until |∆x| < θ and |hi(x)| < ε
3:16
This adaptation of λi is really elegant:– We do not have to take the penalty limit µ → ∞ but still can have
exact constraints– If f and h were linear (∇f and ∇hi constant), the updated λi is
exactly right : In the next iteration we would exactly hit the constraint(by construction)
– The penalty term is like a measuring device for the necessary “vir-tual gradient”, which is generated by the agumentation term in thenext iteration
– The λi are very meaningful: they give the force/gradient that a con-straint exerts on the solution
3:17
Augmented Lagrangian (inequality constraint)
• Instead of
minx
f(x) s.t. g(x) ≤ 0
we address
minx
f(x) + µ
m∑i=1
[gi(x) ≥ 0 ∨ λi > 0] gi(x)2 +
m∑i=1
λigi(x)
• A constraint is either active or inactive:
– When active (gi(x) ≥ 0∨λi > 0) we aim for equality gi(x) = 0
– When inactive (gi(x) < 0∧λi = 0) we don’t penalize/augment
– λi are zero or positive, but never negative
Input: initial x ∈ Rn, functions f(x), g(x),∇f(x),∇g(x), tol. θ, εOutput: x
1: initialize µ = 1, λi = 0
2: repeat3: find x ← argminx f(x) + µ
∑i[gi(x) ≥ 0 ∨ λi >
0] gi(x)2 +∑i λigi(x)
4: ∀i : λi ← max(λi + 2µgi(x′), 0)
5: until |∆x| < θ and gi(x) < ε
3:18
General approaches
• Penalty & Barriers– Associate a (adaptive) penalty cost with violation of the constraint– Associate an additional “force compensating the gradient into the con-straint” (augmented Lagrangian)– Associate a log-barrier with a constraint, becoming ∞ for violation(interior point method)
• Gradient projection methods (mostly for linear contraints)– For ‘active’ constraints, project the step direction to become tangantial– When checking a step, always pull it back to the feasible region
• Lagrangian & dual methods– Rewrite the constrained problem into an unconstrained one– Or rewrite it as a (convex) dual problem
• Simplex methods (linear constraints)– Walk along the constraint boundaries
3:19
The Lagrangian
3:20
The Lagrangian
• Given a constraint problem
minx
f(x) s.t. g(x) ≤ 0
we define the Lagrangian as
L(x, λ) = f(x) +
m∑i=1
λigi(x)
• The λi ≥ 0 are called dual variables or Lagrange multipliers
3:21
What’s the point of this definition?
• The Lagrangian is useful to compute optima analytically, on pa-
per – that’s why physicist learn it early on
• The Lagrangian implies the KKT conditions of optimality
• Optima are necessarily at saddle points of the Lagrangian
• The Lagrangian implies a dual problem, which is sometimes
easier to solve than the primal
3:22
Example: Some calculus using the Lagrangian
• For x ∈ R2, what is
minxx2 s.t. x1 + x2 = 1
• Solution:
L(x, λ) = x2 + λ(x1 + x2 − 1)
0 = ∇xL(x, λ) = 2x+ λ
11
⇒ x1 = x2 = −λ/2
0 = ∇λL(x, λ) = x1 + x2 − 1 = −λ/2− λ/2− 1 ⇒ λ = −1
⇒ = x1 = x2 = 1/2
10 Introduction to Optimization, Marc Toussaint—July 11, 2013
3:23
The “force” & KKT view on the Lagrangian
• At the optimum there must be a balance between the cost gra-
dient −∇f(x) and the gradient of the active constraints −∇gi(x)
3:24
The “force” & KKT view on the Lagrangian
• At the optimum there must be a balance between the cost gra-
dient −∇f(x) and the gradient of the active constraints −∇gi(x)
• Formally: for optimal x: ∇f(x) ∈ span{∇gi(x)}
• Or: for optimal x there must exist λi such that−∇f(x) = −[∑
i(−λi∇gi(x))]
• For optimal x it must hold (necessary condition): ∃λ s.t.
∇f(x) +
m∑i=1
λi∇gi(x) = 0 (“force balance”)
∀i : gi(x) ≤ 0 (primal feasibility)
∀i : λi ≥ 0 (dual feasibility)
∀i : λigi(x) = 0 (complementary)
The last condition says that λi > 0 only for active constraints.
These are the Karush-Kuhn-Tucker conditions (KKT, neglect-
ing equality constraints)
3:25
The “force” & KKT view on the Lagrangian
• The first condition (“force balance”), ∃λ s.t.
∇f(x) +
m∑i=1
λi∇gi(x) = 0
can be equivalently expressed as, ∃λ s.t.
∇xL(x, λ) = 0
• In that sense, the Lagrangian can be viewed as the “energy
function” that generates (for good choice of λ) the right balance
between cost and constraint gradients
• This is exactly as in the augmented Lagrangian approach, where how-ever we have an additional (“augmented”) squared penalty that is usedto tune the λi
3:26
Saddle point view on the Lagrangian
• Let’s briefly consider the equality case again:
minx
f(x) s.t. h(x) = 0
with the Lagrangian
L(x, λ) = f(x) +
m∑i=1
λihi(x)
• Note:
minxL(x, λ) ⇒ 0 = ∇xL(x, λ) ↔ force balance
maxλ
L(x, λ) ⇒ 0 = ∇λL(x, λ) = hi(x) ↔ constraint
• Optima (x∗, λ∗) are saddle points where
∇xL = 0 ensures force balance and
∇λL = 0 ensures the constraint
3:27
Saddle point view on the Lagrangian
• In the inequality case:
maxλ≥0
L(x, λ) =
{f(x) if g(x) ≤ 0
∞ otherwise
maxλi≥0
L(x, λ)⇒
{λi = 0 if g(x) < 0
0 = ∇λiL(x, λ) = gi(0) otherwise
This implies either (λi = 0 ∧ gi(x) < 0) or gi(0) = 0, which is
exactly equivalent to the KKT conditions
• Again, optima (x∗, λ∗) are saddle points where
minx L enforces force balance and
maxλ L enforces the KKT conditions
3:28
The Lagrange dual problem
• We define the Lagrange dual function as
l(λ) = minxL(x, λ)
• This implies two problems
minxf(x) s.t. g(x) ≤ 0 primal problem
maxλ
l(λ) s.t. λ ≥ 0 dual problem
The dual problem is convex, even if the primal is non-convex!
Introduction to Optimization, Marc Toussaint—July 11, 2013 11
• Written more symmetric:
minx
maxl≥0
L(x, λ) primal problem
maxλ≥0
minxL(x, λ) dual problem
because the maxλ≥0 L(x, λ) ensures the constraints (previous
slide).
3:29
The Lagrange dual problem
• The dual function is always a lower bound (for any λi ≥ 0)
l(λ) = minxL(x, λ) ≤
[minxf(x) s.t. g(x) ≤ 0
]And consequently
maxλ≥0
minxL(x, λ) ≤ min
xmaxl≥0
L(x, λ)
• We say strong duality holds iff
maxλ≥0
minxL(x, λ) = min
xmaxl≥0
L(x, λ)
• If the primal is convex, and there exist an interior point
∃x : ∀i : gi(x) < 0
(which is called Slater condition), then we have strong duality
3:30
And what about algorithms?
• So far we’ve only introduced a whole lot of formalism, and seen
that the Lagrangian sort of represents the constraint problem
– minx L or ∇xL = 0 is related to the force balance
– maxλ L or ∇λL = 0 is related to constraints or KKT conditions
– This implies two dual problems, minx maxλ L and maxλ minx L,
the second (dual) is a lower bound of the first (primal)
• But what are the algorithms we can get out of this?
3:31
Algorithmic implications of the Lagrangianview
• If minx L(x, λ) can be solved analytically, we can alternatively
solve the (convex) dual problem.
• But more generally
Optimization problem −→ Solve KKT conditions
→ Apply standard algos for solving an equation system r(x, λ) =
0:
Newton method ∇r∆x
∆λ
= −r
This leads to primal-dual algorithms that adapt x and λ concur-
rently. Roughly, they use the curvature ∇2f to estimate the right
λ to push out of the constraint. We will discuss this after we’ve
learnt about 2nd order methods.
3:32
Log barrier method revisited
3:33
Log barrier method revisited
• Log barrier method: Instead of
minx
f(x) s.t. g(x) ≤ 0
we address
minx
f(x)− µ∑i
log(−gi(x))
• For given µ the optimality condition is
∇f(x)−∑i
µ
gi(x)∇gi(x) = 0
or equivalently
∇f(x) +∑i
λi∇gi(x) = 0 , λigi(x) = −µ
These are called modified (=approximate) KKT conditions.
3:34
Log barrier method revisited
Centering (the unconstrained minimization) in the log barrier
method is equivalent to solving the modified KKT conditions.
Note also: On the central path, the duality gap is mµ:l(λ∗(µ)) = f(x∗(µ)) +
∑i λigi(x
∗(µ)) = f(x∗(µ))−mµ
3:35
Phase I: Finding a feasible initialization
3:36
Phase I: Finding a feasible initialization
• An elegant method for finding a feasible point x:
min(x,s)∈Rn+1
s s.t. ∀i : gi(x) ≤ s, s ≥ 0
or
min(x,s)∈Rn+m
m∑i=1
si s.t. ∀i : gi(x) ≤ si, si ≥ 0
3:37
12 Introduction to Optimization, Marc Toussaint—July 11, 2013
General approaches
• Penalty & Barriers– Associate a (adaptive) penalty cost with violation of the constraint– Associate an additional “force compensating the gradient into the con-straint” (augmented Lagrangian)– Associate a log barrier with a constraint, becoming ∞ for violation(interior point method)
• Gradient projection methods (mostly for linear contraints)– For ‘active’ constraints, project the step direction to become tangantial– When checking a step, always pull it back to the feasible region
• Lagrangian & dual methods– Rewrite the constrained problem into an unconstrained one– Or rewrite it as a (convex) dual problem
• Simplex methods (linear constraints)– Walk along the constraint boundaries
3:38
Introduction to Optimization, Marc Toussaint—July 11, 2013 13
4 Second-Order Methods
2nd order gives better stepsize & direction, Newton methods, adap-
tive stepsize, Levenberg-Marquardt, Gauss-Newton method, Quasi-
Newton methods, BFGS, primal-dual interior point Newton method
Planned Outline
• Gradient-based optimization (1st order methods)– plain grad., steepest descent, conjugate grad., Rprop, stochastic
grad.– adaptive stepsize heuristics
• Constrained Optimization– squared penalties, augmented Lagrangian, log barrier– Lagrangian, KKT conditions, Lagrange dual, log barrier↔ approx.
KKT
• 2nd order methods– Newton, Gauss-Newton, Quasi-Newton, (L)BFGS– constrained case, primal-dual Newton
• Special convex cases– Linear Programming, (sequential) Quadratic Programming– Simplex algorithm– relation to relaxed discrete optimization
• Black box optimization (“0th order methods”)– blackbox stochastic search– Markov Chain Monte Carlo methods– evolutionary algorithms
4:1
• So far we relied on gradient-based methods only, in the uncon-
strained and constrained case
• Today: 2nd order methods, which approximate f(x) locally– using 2nd order Taylor expansion (Hessian ∇2f(x) given)– estimating the Hessian from data
• 2nd order methods only work if the Hessian is everywhere
positive definite ↔ f(x) is convex or if it is approximated/modified
to be pos def (as in Gauss-Newton)
• Note: Approximating f(x) locally or globally is a core concept
also in black box optimization
4:2
Why can 2nd order optimization be better thangradient?
• Better direction:
Conjugate Gradient
Plain Gradient
2nd Order
• Better stepsize:– a full step jumps directly to the minimum of the local squared ap-
prox.– often this is already a good heuristic– additional stepsize reduction and dampening are straight-forward
4:3
Outline: 2nd order method
• Newton
• Gauss-Newton
• Quasi-Newton
• BFGS, (L)BFGS
• Their application on constrained problems
4:4
2nd order optimization
• Notation:
objective function: f : Rn → R
gradient vector: ∇f(x) =[∂∂xf(x)
]>∈ Rn
Hessian (symmetric matrix):
∇2f(x) =
∂2
∂x1∂x1f(x) ∂2
∂x1∂x2f(x) · · · ∂2
∂x1∂xnf(x)
∂2
∂x1∂x2f(x)
...
......
∂2
∂xn∂x1f(x) · · · · · · ∂2
∂xn∂xnf(x)
∈ Rn×n
Taylor expansion:
f(x′) = f(x) + (x′ − x)>∇f(x) +1
2(x′ − x)>∇2f(x) (x′ − x)
• Problem:
minxf(x)
where we can evaluate f(x), ∇f(x) and ∇2f(x) for any x ∈ Rn
4:5
Newton method
• For finding roots (zero points) of f(x)
x← x− f(x)
f ′(x)
• For finding optima of f(x) in 1D:
x← x− f ′(x)
f ′′(x)
For x ∈ Rn:
x← x−∇2f(x)-1∇f(x)
4:6
14 Introduction to Optimization, Marc Toussaint—July 11, 2013
Newton method with adaptive stepsize α
Input: initial x ∈ Rn, functions f(x),∇f(x),∇2f(x), toler-ance θ
Output: x1: initialize stepsize α = 1 and damping λ = 10−10
2: repeat3: compute ∆ to solve (∇2f(x) + λI) ∆ = −∇f(x)
4: repeat // “line search”5: y ← x+ α∆
6: if f(y) ≤ f(x) then // step is accepted7: x← y
8: α← α0.5 // increase stepsize towards α = 1
9: else // step is rejected10: α← 0.1α // decrease stepsize11: end if12: until step accepted or (in bad case) α||∆||∞ < θ/1000
13: until ||∆||∞ < θ
• Notes:– Line 3 computes the Newton step ∆ = ∇2f(x)-1∇f(x),
use special Lapack routine dposv to solveAx = b (using Choleskydecomposition)
– λ is called damping, makes the parabola more “steep” around cur-rent xfor λ→∞: ∆ becomes colinear with −∇f(x) but |∆| = 0
4:7
Newton method with adaptive damping λ (Levenberg-Marquardt)
(I usually use stepsize adaptation instead of Levenberg-Marquardt)
Input: initial x ∈ Rn, functions f(x),∇f(x),∇2f(x), toler-ance θ
Output: x1: initialize damping λ = 10−10
2: repeat3: compute ∆ to solve (∇2f(x) + λI) ∆ = −∇f(x)
4: if f(x+ ∆) ≤ f(x) then // step is accepted5: x← x+ ∆
6: λ← 0.2λ // decrease damping7: else // step is rejected8: λ← 10λ // increase damping9: end if
10: until λ < 1 and ||∆||∞ < δ
4:8
Computational issues
• Let
Cf be computational cost of evaluating f(x) only
Ceval be computational cost of evaluating f(x),∇f(x),∇2f(x)
C∆ be computational cost of solving (∇2f(x)+λI) ∆ = −∇f(x)
• If Ceval � Cf → proper line search instead of stepsize adapta-
tion
If C∆ � Cf → proper line search instead of stepsize adapta-
tion
• However, in many applications (in robotics at least)Ceval ≈ Cf �C∆
• Often, ∇2f(x) is banded (non-zero around diagonal only)
→ Ax = b becomes super fast using dpbsv (Dynamic Pro-
gramming)
(If∇2f(x) is a “tree”: Dynamic Programming on the “Junction Tree”)
4:9
Demo
4:10
Gauss-Newton method
• Problem:
minxf(x) where f(x) = φ(x)>φ(x)
and we can evaluate φ(x), ∇φ(x) for any x ∈ Rn
• φ(x) ∈ Rd is a vector; each entry contributes a squared cost term tof(x)
• ∇φ(x) is the Jacobian (d× n-matrix)
∇φ(x) =
∂∂x1
φ1(x) ∂∂x2
φ1(x) · · · ∂∂xn
φ1(x)
∂∂x1
φ2(x)...
......
∂∂x1
φd(x) · · · · · · ∂∂xn
φd(x)
∈ Rd×n
with 1st-order Taylor expansion φ(x′) = φ(x) +∇φ(x)(x′ − x)
4:11
Gauss-Newton method
• The gradient and Hessian of f(x) become
f(x) = φ(x)>φ(x)
∇f(x) = 2∇φ(x)>φ(x)
∇2f(x) = 2∇φ(x)>∇φ(x) + 2φ(x)>∇2φ(x)
The Gauss-Newton method is the Newton method for
f(x) = φ(x)>φ(x) with approximating ∇2φ(x) ≈ 0
The approximate Hessian 2∇φ(x)>∇φ(x) is always semi-pos-
def!
• In the Newton algorithm, replace line 3 by
3: compute ∆ to solve (∇φ(x)>∇φ(x) + λI) ∆ = −∇φ(x)>φ(x)
4:12
Quasi-Newton methods
4:13
Introduction to Optimization, Marc Toussaint—July 11, 2013 15
Quasi-Newton methods
• Let’s take a step back: Assume we cannot evaluate ∇2f(x).
Can we still use 2nd order methods?
• Yes: We can approximate∇2f(x) from the data {(xi,∇f(xi))}ki=1
of previous iterations
4:14
Basic example
• We’ve seen already two data points (x1,∇f(x1)) and (x2,∇f(x2))
How can we estimate ∇2f(x)?
• In 1D:
∇2f(x) ≈ ∇f(x2)−∇f(x1)
x2 − x1
• In Rn: let y = ∇f(x2)−∇f(x1), ∆x = x2 − x1
∇2f(x) ∆x!= y ∆x
!= ∇2f(x)−1y
∇2f(x) =y y>
y>∆x∇2f(x)−1 =
∆x∆x>
∆x>y
Convince yourself that the last line solves the desired relations[Left: how to update∇2f (x). Right: how to update directly∇2f(x)-1.]
4:15
BFGS
• Broyden-Fletcher-Goldfarb-Shanno (BFGS) method:
Input: initial x ∈ Rn, functions f(x),∇f(x), tolerance θOutput: x
1: initialize H -1 = In2: repeat3: compute ∆ = −H -1∇f(x)
4: perform a line search minα f(x+ α∆)
5: ∆← α∆
6: y ← ∇f(x+ ∆)−∇f(x)
7: x← x+ ∆
8: update H -1 ←(I− y∆>
∆>y
)>H -1(I− y∆>
∆>y
)+ ∆∆>
∆>y9: until ||∆||∞ < θ
• Notes:– The blue term is the H -1-update as on the previous slide– The red term “deletes” previous H -1-components
4:16
Quasi-Newton methods
• BFGS is the most popular of all Quasi-Newton methods
Others exist, which differ in the exact H -1-update
• L-BFGS (limited memory BFGS) is a version which does not re-
quire to explicitly store H -1 but instead stores the previous data
{(xi,∇f(xi))}ki=1 and manages to compute ∆ = −H -1∇f(x)
directly from this data
• Some thought:
In principle, there are alternative ways to estimate H -1 from the
data {(xi, f(xi),∇f(xi))}ki=1, e.g. using Gaussian Process re-
gression with derivative observations– Not only the derivatives but also the value f(xi) should give infor-
mation on H(x) for non-quadratic functions– Should one weight ‘local’ data stronger than ‘far away’?
(GP covariance function)
4:17
2nd Order Methods for Constrained Optimiza-tion
4:18
2nd Order Methods for Constrained Optimiza-tion
• No changes at all for
– log barrier
– augmented Lagrangian
– squared penalties
Directly use (Gauss-)Newton/BFGS → will boost performance
of these constrained optimization methods!
4:19
Primal-Dual interior-point Newton Method
• Reconsider slide 3:32 (Algorithmic implications of the Lagrangian
view)
• A core outcome of the Lagrangian theory was the shift in prob-
lem formulation:
find x to minx f(x) s.t. g(x) ≤ 0
→ find x to solve the KKT conditions
4:20
Primal-Dual interior-point Newton Method
• The first and last modified (=approximate) KKT conditions
∇f(x) +∑mi=1 λi∇gi(x) = 0 (“force balance”)
∀i : gi(x) ≤ 0 (primal feasibility)
∀i : λi ≥ 0 (dual feasibility)
∀i : λigi(x) = −µ (complementary)
can be written as the n+m-dimensional equation system
r(x, λ) = 0 , r(x, λ) :=
∇f(x) + λ>∇g(x)−diag(λ)g(x)− µ1n
16 Introduction to Optimization, Marc Toussaint—July 11, 2013
• Newton method to find the root r(x, λ) = 0
xλ
←xλ
−∇r(x, λ)-1r(x, λ)
∇r(x, λ) =
∇2f(x) +∑i λi∇
2gi(x) ∇g(x)>
−diag(λ)∇g(x) −diag(g(x))
∈ R(n+m)×(n+m)
4:21
Primal-Dual interior-point Newton Method
• The method requires the Hessians ∇2f(x) and ∇2gi(x)
– One can approximate the constraint Hessians ∇2gi(x) ≈ 0
– Gauss-Newton case: f(x) = φ(x)>φ(x) only requires ∇φ(x)
• This primal-dual method does a joint update of both
– the solution x
– the lagrange multipliers (constraint forces) λ
No need for nested iterations, as with penalty/barrier methods!
• The above formulation allows for a duality gap µ; choose µ = 0
or consult Boyd how to update on the fly (sec 11.7.3)
• The feasibility constraints gi(x) ≤ 0 and λi ≥ 0 need to be
handled explicitly by the root finder (the line search needs to
ensure these constraints)
4:22
Planned Outline
• Gradient-based optimization (1st order methods)– plain grad., steepest descent, conjugate grad., Rprop, stochastic
grad.– adaptive stepsize heuristics
• Constrained Optimization– squared penalties, augmented Lagrangian, log barrier– Lagrangian, KKT conditions, Lagrange dual, log barrier↔ approx.
KKT
• 2nd order methods– Newton, Gauss-Newton, Quasi-Newton, (L)BFGS– constrained case, primal-dual Newton
• Special convex cases– Linear Programming, (sequential) Quadratic Programming– Simplex algorithm– relation to relaxed discrete optimization
• Black box optimization (“0th order methods”)– blackbox stochastic search– Markov Chain Monte Carlo methods– evolutionary algorithms
4:23
Introduction to Optimization, Marc Toussaint—July 11, 2013 17
5 Convex Optimization
Convex, quasiconvex, unimodal, convex optimization problem, lin-
ear program (LP), standard form, simplex algorithm, LP-relaxation
of integer linear programs, quadratic programming (QP), sequential
quadratic programming
Planned Outline
• Gradient-based optimization (1st order methods)– plain grad., steepest descent, conjugate grad., Rprop, stochastic
grad.– adaptive stepsize heuristics
• Constrained Optimization– squared penalties, augmented Lagrangian, log barrier– Lagrangian, KKT conditions, Lagrange dual, log barrier↔ approx.
KKT
• 2nd order methods– Newton, Gauss-Newton, Quasi-Newton, (L)BFGS– constrained case, primal-dual Newton
• Special convex cases– Linear Programming, (sequential) Quadratic Programming– Simplex algorithm– relation to relaxed discrete optimization
• Black box optimization (“0th order methods”)– blackbox stochastic search– Markov Chain Monte Carlo methods– evolutionary algorithms
5:1
Function types
• A function is defined convex iff
f(ax+ (1−a)x) ≤ a f(x) + (1−a) f(y)
for all x, y ∈ Rn and a ∈ [0, 1].
• A function is quasiconvex iff
f(ax+ (1−a)y) ≤ max{f(x), f(y)}
for any x, y ∈ Rm and a ∈ [0, 1].
..alternatively, iff every sublevel set {x|f(x) ≤ α} is convex.
• [Subjective!] I call a function unimodal iff it has only 1 local
minimum, which is the global minimum
Note: in dimensions n > 1 quasiconvexity is stronger than unimodality
• A general non-linear function is unconstrained and can have
multiple local minima
5:2
convex ⊂ quasiconvex ⊂ unimodal ⊂ general
5:3
Local optimization
• So far I avoided making explicit assumptions about problem con-
vexity: To emphasize that all methods we considered – except
for Newton – are applicable also on non-convex problems.
• The methods we considered are local optimization methods,
which can be defined as
– a method that adapts the solution locally
– a method that is guaranteed to converge to a local minimum
only
• Local methods are efficient
– if the problem is (strictly) unimodal (strictly: no plateaux)
– if time is critical and a local optimum is a sufficiently good
solution
– if the algorithm is restarted very often to hit multiple local op-
tima
5:4
Convex problems
• Convexity is a strong assumption!
• Nevertheless, convex problems are important
– theoretically (convergence proofs!)
– for many real world applications
5:5
Convex problems
• A constrained optimization problem
minx
f(x) s.t. g(x) ≤ 0, h(x) = 0
is called convex iff
– f is convex
– each gi, i = 1, ..,m is convex
– h is linear: h(x) = Ax− b, A ∈ Rl×n, b ∈ Rl
• Alternative definition:f convex and feasible region is a convex set
5:6
Linear and Quadratic Programs
• Linear Program (LP)
minx
c>x s.t. Gx ≤ h, Ax = b
LP in standard form
minx
c>x s.t. x ≥ 0, Ax = b
18 Introduction to Optimization, Marc Toussaint—July 11, 2013
• Quadratic Program (QP)
minx
1
2x>Qx+ c>x s.t. Gx ≤ h, Ax = b
where Q is positive definite.
(One also defines Quadratically Constraint Quadratic Programs (QCQP))
5:7
Transforming an LP problem into standardform
• LP problem:
minx
c>x s.t. Gx ≤ h, Ax = b
• Define slack variables:
minx,ξ
c>x s.t. Gx+ ξ = h, Ax = b, ξ ≥ 0
• Express x = x+ − x− with x+, x− ≥ 0:
minx+,x−,ξ
c>(x+ − x−)
s.t. G(x+ − x−) + ξ = h, A(x+ − x−) = b, ξ ≥ 0, x+ ≥ 0, x− ≥ 0
where (x+, x−, ξ) ∈ R2n+m
• Now this is conform with the standard form (replacing (x+, x−, ξ) ≡x, etc)
minx
c>x s.t. x ≥ 0, Ax = b
5:8
Linear Programming
– Algorithms
– Application: LP relaxtion of discret problems
5:9
Algorithms for Linear Programming
• All of which we know!
– augmented Lagrangian (LANCELOT software), penalty
– log barrier (“interior point method”, “[central] path following”)
– primal-dual Newton
• The simplex algorithm, walking on the constraints
(The emphasis in the notion of interior point methods is to dis-
tinguish from constraint walking methods.)
• Interior point and simplex methods are comparably efficient
Which is better depends on the problem
5:10
Simplex Algorithm
Georg Dantzig (1947)
Note: Not to confuse with the NelderMead method (downhill simplex method)
• We consider an LP in standard form
minx
c>x s.t. x ≥ 0, Ax = b
• Note that in a linear program the optimum is always situated at
a corner
5:11
Simplex Algorithm
• The Simplex Algorithm walks along the edges of the polytope,
at every corner choosing the edge that decreases c>x most
• This either terminates at a corner, or leads to an unconstrained
edge (−∞ optimum)
• In practise this procedure is done by “pivoting on the simplex
tableaux”
5:12
Simplex Algorithm
• The simplex algorithm is often efficient, but in worst case expo-
nential in n and m.
• Interior point methods (log barrier) and, more recently again,
augmented Lagrangian methods have become somewhat more
popular than the simplex algorithm
5:13
LP-relaxations of discrete problems
5:14
Introduction to Optimization, Marc Toussaint—July 11, 2013 19
Integer linear programming
• An integer linear program (for simplicity binary) is
minxc>x s.t. Ax = b, xi ∈ {0, 1}
• Examples:– Traveling Salesman: minxij
∑ij cijxij with xij ∈ {0, 1} and sev-
eral more constraints (e.g. rows and columns of x sum to 1)– (max) SAT problem: In conjunctive normal form, each clause con-
tributes an additional variable and a term in the objective function;each clause contributes a constraintGoogle: The Power of Semidefinite Programming Relaxations forMAXSAT
5:15
LP relaxations of integer linear programs
• Instead of solving
minxc>x s.t. Ax = b, xi ∈ {0, 1}
we solve
minxc>x s.t. Ax = b, x ∈ [0, 1]
• Clearly, the relaxed solution will be a lower bound on the integer
solution (sometimes also called “outer bound” because [0, 1] ⊃{0, 1})
• Computing the relaxed solution is interesting
– as an “approximation” or initialization to the integer problem
– to be aware of the lower bound (what is achievable)
– in cases where the optimal relaxed solution happens to be
integer
5:16
Example: MAP inference in MRFs
• Given integer random variables xi, i = 1, .., n, a pairwise Markov
Random Field (MRF) is defined as
f(x) =∑
(ij)∈E
fij(xi, xj) +∑i
fi(xi)
where E denotes the set of edges.(Note: any general (non-pairwise) MRF can be converted into a pair-wise one,blowing up the number of variables)
• Reformulate with different variables
bi(x) = [xi = x] , bij(x, y) = [xi = x] [xj = y]
These are nm+ |E|m2 binary variables
• The indicator variables need to fulfil the constraints
bi(x), bij(x, y) ∈ {0, 1}∑x
bi(x) = 1 because xi takes eactly one value∑y
bij(x, y) = bi(x) consistency between indicators
5:17
Example: MAP inference in MRFs
• Finding maxx f(x) of a MRF is then equivalent to
maxbi(x),bij(x,y)
∑(ij)∈E
∑x,y
bij(x, y) fij(x, y) +∑i
∑x
bi(x) fi(x)
such that
bi(x), bij(x, y) ∈ {0, 1} ,∑x
bi(x) = 1 ,∑y
bij(x, y) = bi(x)
• The LP-relaxation replaces the constraint to be
bi(x), bij(x, y) ∈ [0, 1] ,∑x
bi(x) = 1 ,∑y
bij(x, y) = bi(x)
This set of feasible b’s is called marginal polytope (because
it describes the a space of “probability distributions” that are
marginally consistent (but not necessarily globally normalized!))
5:18
Example: MAP inference in MRFs
• Solving the original MAP problem is NP-hard
Solving the LP-relaxation is really efficient
• If the solution of the LP-relaxation turns out to be integer, we’ve
solved the originally NP-hard problem!
If not, the relaxed problem can be discretized to be a good ini-
tialization for discrete optimization
• For binary attractive MRFs (a common case) the solution will
always be integer
5:19
Quadratic Programming
5:20
Quadratic Programming
minx
1
2x>Qx+ c>x s.t. Gx ≤ h, Ax = b
(The dual of a QP is again a QP)
• Efficient Algorithms:
– Interior point (log barrier)
– Augmented Lagrangian
– Penalty
• Highly relevant applications:
– Support Vector Machines
– Similar types of max-margin modelling methods
5:21
20 Introduction to Optimization, Marc Toussaint—July 11, 2013
Sequential Quadratic Programming
• We considered general non-linear problems
minx
f(x) s.t. g(x) ≤ 0
where we can evaluate f(x), ∇f(x), ∇2f(x) and g(x), ∇g(x),
∇2g(x) for any x ∈ Rn
→ Newton method
• The standard step direction ∆ is (∇2f(x) + λI) ∆ = −∇f(x)
• Sometimes a better step direction ∆ can be found by solving the
local QP-approximation to the problem
min∆
f(x) +∇f(x)>∆ + ∆>∇2f(x)∆ s.t. g(x) +∇g(x)>∆ ≤ 0
This is an optimization problem over ∆ and only requires the
evaluation of f(x),∇f(x),∇2f(x), g(x),∇g(x) once.
5:22
Introduction to Optimization, Marc Toussaint—July 11, 2013 21
6 Stochastic Search & Heuristics
Blackbox optimization, stochastic search, (µ+λ)-ES, CMA-ES, Evo-
lutionary Algorithms, Simulated Annealing, Hill Climing, Nelder-Mead
downhill simplex
“Blackbox Optimization”
• The term is not really well defined
– I use it to express that only f(x) can be evaluated
– ∇f(x) or ∇2f(x) are not (directly) accessible
More common terms:
• Global optimization– This usually emphasizes that methods should not get stuck in local
optima– Very very interesting domain – close analogies to (active) Machine
Learning, bandits, POMDPs, optimal decision making/planning, op-timal experimental design
– Usually mathematically well founded methods
• Stochastic search or Evolutionary Algorithms or Local Search– Usually these are local methods (extensions trying to be “more”
global)– Various interesting heuristics– Some of them (implicitly or explicitly) locally approximating gradi-
ents or 2nd order models
6:1
Blackbox Optimization
• Problem: Let x ∈ Rn, f : Rn → R, find
minx
f(x)
where we can only evaluate f(x) for any x ∈ Rn
• A constrained version: Let x ∈ Rn, f : Rn → R, g : Rn → {0, 1}, find
minx
f(x) s.t. g(x) = 1
where we can only evaluate f(x) and g(x) for any x ∈ Rn
I haven’t seen much work on this. Would be interesting to consider this morerigorously.
6:2
A zoo of approaches
• People with many different backgrounds drawn into thisRanging from heuristics and Evolutionary Algorithms to heavy mathematics
– Evolutionary Algorithms, esp. Evolution Strategies, Covariance Ma-trix Adaptation, Estimation of Distribution Algorithms
– Simulated Annealing, Hill Climing, Downhill Simplex– local modelling (gradient/Hessian), global modelling
6:3
Optimizing and Learning
• Blackbox optimization is often related to learning:
• When we have local a gradient or Hessian, we can take that
local information and run – no need to keep track of the history
or learn (exception: BFGS)
• In the Blackbox case we have no local information directly ac-
cessible
→ one needs to account of the history in some way or another
to have an idea where to continue search
• “Accounting for the history” very often means learning: Learning
a local or global model of f itself, learning which steps have
been successful recently (gradient estimation), or which step
directions, or other heuristics
6:4
Outline
• Stochastic Search– A simple framework that many heuristics and local modelling ap-
proaches fit in– Evolutionary Algorithms, Covariance Matrix Adaptation, EDAs as
special case
• Heuristics– Simulated Annealing– Hill Climing– Downhill Simplex
• Global Optimization– Framing the big problem: The optimal solution to optimization– Mentioning very briefly No Free Lunch Theorems– Greedy approximations, Kriging-type methods
6:5
Stochastic Search
6:6
Stochastic Search
• The general recipe:
– The algorithm maintains a probability distribution pθ(x)
– In each iteration it takes n samples {xi}ni=1 ∼ pθ(x)
– Each xi is evaluated → data {(xi, f(xi))}ni=1
– That data is used to update θ
• Stochastic Search:
Input: initial parameter θ, function f(x), distribution modelpθ(x), update heuristic h(θ,D)
Output: final θ and best point x1: repeat2: Sample {xi}ni=1 ∼ pθ(x)
3: Evaluate samples, D = {(xi, f(xi))}ni=1
4: Update θ ← h(θ,D)
5: until θ converges
6:7
22 Introduction to Optimization, Marc Toussaint—July 11, 2013
Stochastic Search
• The parameter θ is the only “knowledge/information” that is be-
ing propagated between iterations
θ encodes what has been learned from the history
θ defines where to search in the future
• Evolutionary Algorithms: θ is a parent population
Evolution Strategies: θ defines a Gaussian with mean & vari-
ance
Estimation of Distribution Algorithms: θ are parameters of
some distribution model, e.g. Bayesian Network
Simulated Annealing: θ is the “current point” and a temperature
6:8
Example: Gaussian search distribution (µ, λ)-ES
From 1960s/70s. Rechenberg/Schwefel
• Perhaps the simplest type of distribution model
θ = (x) , pt(x) = N(x|x, σ2)
a n-dimenstional isotropic Gaussian with fixed deviation σ
• Update heuristic:
– Given D = {(xi, f(xi))}λi=1, select µ best: D′ = bestOfµ(D)
– Compute the new mean x from D′
• This algorithm is called “Evolution Strategy (µ, λ)-ES”
– The Gaussian is meant to represent a “species”
– λ offspring are generated
– the best µ selected
6:9
Example: “elitarian” selection (µ+ λ)-ES
• θ also stores the µ best previous points
θ = (x, D′) , pt(x) = N(x|x, σ2)
• The θ update:
– Select the µ best from D′ ∪D: D′ = bestOfµ(D′ ∪D)
– Compute the new mean x from D′
• Is called “elitarian” because good parents can survive
• Consider the (1 + 1)-ES: a Hill Climber
• There is considerable theory on convergence of, e.g., (1+λ)-ES
6:10
Evolutionary Algorithms (EAs)
• These were two simple examples of EAs
Generally, I think EAs can well be described/understood as very
special kinds of parameterizing pθ(x) and updating θ
– The θ typically is a set of good points found so far (parents)
– Mutation & Crossover define pθ(x)
– The samples D are called offspring
– The θ-update is often a selection of the best,
or “fitness-proportional” or rank-based
• Categories of EAs:
– Evolution Strategies: x ∈ Rn, often Gaussian pθ(x)
– Genetic Algorithms: x ∈ {0, 1}n, crossover & mutation de-
fine pθ(x)
– Genetic Programming: x are programs/trees, crossover &
mutation
– Estimation of Distribution Algorithms: θ directly defines
pθ(x)
6:11
Covariance Matrix Adaptation (CMA-ES)
• An obvious critique of the simple Evolution Strategies:
– The search distribution N(x|x, σ2) is isotropic
(no going forward, no preferred direction)
– The variance σ is fixed!
• Covariance Matrix Adaptation Evolution Strategy (CMA-ES)
6:12
Covariance Matrix Adaptation (CMA-ES)
• In Covariance Matrix Adaptation
θ = (x, σ, C, pσ, pC) , pθ(x) = N(x|x, σ2C)
where C is the covariance matrix of the search distribution
• The θ maintains two more pieces of information: pσ and pC cap-
ture the “path” (motion) of the mean x in recent iterations
• Rough outline of the θ-update:
– Let D′ = bestOfµ(D) be the set of selected points
– Compute the new mean x from D′
Introduction to Optimization, Marc Toussaint—July 11, 2013 23
– Update pσ and pC proportional to xk+1 − xk
– Update σ depending on |pσ|
– Update C depending on pcp>c (rank-1-update) and Var(D′)
6:13
CMA references
Hansen, N. (2006), ”The CMA evolution strategy: a comparing
review”
Hansen et al.: Evaluating the CMA Evolution Strategy on Multi-
modal Test Functions, PPSN 2004.
• For “large enough” populations local minima are avoided
• A variant:
Igel et al.: A Computational Efficient Covariance Matrix Update
and a (1 + 1)-CMA for Evolution Strategies, GECCO 2006.
6:14
CMA conclusions
• It is a good starting point for an off-the-shelf blackbox algorithm
• It includes components like estimating the local gradient (pσ, pC ),
the local “Hessian” (Var(D′)), smoothing out local minima (large
populations)
6:15
Estimation of Distribution Algorithms (EDAs)
• Generally, EDAs fit the distribution pθ(x) to model the distribu-
tion of previously good search pointsFor instance, if in all previous distributions, the 3. bit equals the 7. bit, then thesearch distribution pθ(x) should put higher probability on such candidates.pθ(x) is meant to capture the structure in previously good points, i.e. the depen-dencies/correlation between variables.
• A rather successful class of EDAs on discrete spaces uses graph-
ical models to learn the dependencies between variables, e.g.
Bayesian Optimization Algorithm (BOA)
• In continuous domains, CMA is an example for an EDA
6:16
Further Ideas
• We could learn a distribution over steps
– which steps have decreased f recently→ model
(Related to “differential evolution”)
• We could learn a distributions over directions only
→ sample one→ line search
6:17
Stochastic search conclusions
Input: initial parameter θ, function f(x), distribution modelpθ(x), update heuristic h(θ,D)
Output: final θ and best point x1: repeat2: Sample {xi}ni=1 ∼ pθ(x)
3: Evaluate samples, D = {(xi, f(xi))}ni=1
4: Update θ ← h(θ,D)
5: until θ converges
• The framework is very general
• The crucial difference between algorithms is their choice of pθ(x)
6:18
Heuristics
– Simulated Annealing
– Hill Climing
– Simplex
6:19
Simulated Annealing
• Must read!: An Introduction to MCMC for Machine Learning
Input: initial x, function f(x), proposal distribution q(x′|x)
Output: final x1: initialilze T = 1
2: repeat3: generate a new sample x′ ∼ q(x′|x)
4: acceptance probability A =
min{
1,e−f(x
′)/T q(x|x′)e−f(x)/T q(x′|x)
}5: With probability A, x← x′ // ACCEPT6: Decrease T7: until x converges
• Typically: q(x′|x) = N(x′|x, σ2) Gaussian transition probabili-
ties
6:20
Simulated Annealing
• Simulated Annealing is a Markov chain Monte Carlo (MCMC)
method.
• These are iterative methods to sample from a distribution, in our
case
p(x) ∝ e−f(x)/T
• For a fixed temperature T , one can show that the set of accepted
points is distributed as p(x) (but non-i.i.d.!)
• The acceptance probability compares the f(x′) and f(x), but
also the reversibility of q(x′|x)
• When cooling the temperature, samples focus at the extrema
24 Introduction to Optimization, Marc Toussaint—July 11, 2013
• Guaranteed to sample all extrema eventually
6:21
Simulated Annealing
[Wikipedia gif amination]
6:22
Hill Climing
• Same as Simulated Annealing with T = 0
• Same as (1 + 1)-ES
There also exists a CMA version of (1+1)-ES, see Igel reference above.
• The role of hill climing should not be underestimated:
Very often it is efficient to repeat hill climing from many random
start points.
• However, no type of learning at all (stepsize, direction)
6:23
Nelder-Mead method – Downhill Simplex Method
6:24
Nelder-Mead method – Downhill Simplex Method
• Let x ∈ Rn
• Maintain n+ 1 points x0, .., xn, sorted by f(x0) < ... < f(xn)
• Compute center c of points
• Reflect: y = c+ α(c− xn)
• If f(y) < f(x0): Expand: y = c+ γ(c− xn)
• If f(y) > f(xn-1): Contract: y = c+ γ(c− xn)
6:25
Introduction to Optimization, Marc Toussaint—July 11, 2013 25
7 Global Optimization
Multi-armed bandits, exploration vs. exploitation, navigation through
belief space, upper confidence bound (UCB), global optimization =
infinite bandits, Gaussian Processes, probability of improvement, ex-
pected improvement, UCB
Global Optimization
• Is there an optimal way to optimize (in the Blackbox case)?
• Is there a way to find the global optimum instead of only local?
7:1
Core references
• Jones, D., M. Schonlau, & W. Welch (1998). Efficient global
optimization of expensive black-box functions. Journal of Global
Optimization 13, 455-492.
• Jones, D. R. (2001). A taxonomy of global optimization methods
based on response surfaces. Journal of Global Optimization 21,
345-383.
• Poland, J. (2004). Explicit local models: Towards optimal opti-
mization algorithms. Technical Report No. IDSIA-09-04.
7:2
More up-to-date – very nice GP-UCB intro-duction
7:3
Outline
• Play a game
• Multi-armed bandits & Upper Confidence Bound (UCB)
• Optimization as infinite bandits; GPs as response surfaces
• Standard criteria:
– Upper Confidence Bound (UCB)– Maximal Probability of Improvement (MPI)– Expected Improvement (EI)
7:4
Multi-armed bandits
• There are n machines.
Each machine has an average reward fi – but you don’t know
the fi’s.
What do you do?
7:5
Multi-armed bandits
• Let at ∈ {1, .., n} be the choice of machine at time t
Let yt ∈ R be outcome with mean 〈yt〉 = fat
• A policy or strategy maps all the history to a new action:
π : [(a1, y1), (a2, y2), ..., (at-1, yt-1)] 7→ at
• Example objectives: find a policy π that
max
⟨T∑t=1
yt
⟩or
max 〈yT 〉
or other variants.
7:6
Exploration vs. Exploitation
• Such kinds of problems appear in many contexts
(Global Optimization, AI, Reinforcement Learning, etc)
• In simple domains (standard MDPs), actions influence the (ex-
ternal) world state→ actions navigate through the state space
In learning domains, actions influence your knowledge→ ac-
tions navigate through state and belief space
In multi-armed bandits, the bandits usually do not have an internal statevariable – they are the same every round.
26 Introduction to Optimization, Marc Toussaint—July 11, 2013
7:7
Exploration vs. Exploitation
• The “knowledge” can be represented as the full history
ht = [(a1, y1), (a2, y2), ..., (at-1, yt-1)]
or, in the Bayesian thinking, as belief
bt = P (X|ht) =P (ht|X)
P (ht)P (X)
where X is all the (unknown) properties of the world
• In the multi-armed bandit case:
X = (f1, .., fn)
bt = P (X|ht) =∏iN(fi|yi,t, σi,t) (if bandits are Gaussian)
7:8
Navigating through Belief Space
b0
a1
y1
b1
a2
y2
a3
y3
b3
X
b2
– Maximizing for 〈y3〉 requires to have a “good” b2– Actions a1 and a2 should be planned to achieve best possible b2– Action a3 then greedily chooses machine with highest yi,2
• Exploration: Choose the next action at to min 〈H(bt)〉
• Exploitation: Choose the next action at to max 〈yt〉
• Maximizing for 〈yT 〉 (or similar) requires exploration and exploita-
tion
Such policies can in principle be computed→ POMDPs (or Lai & Rob-bins)
But in the following we discuss more efficient 1-step criteria
7:9
Upper Confidence Bound (UCB) selection
1: Initializaiton: Play each machine once2: repeat3: Play the machine i that maximizes yi +
√2 lnnni
4: until
yi is the average reward of machine i so far
ni is how often machine i has been played so far
n =∑i ni is the number of rounds so far
(The lnn makes this work also for non-Gaussian bandits, e.g. heavy-tailed.)
See lane.compbio.cmu.edu/courses/slides_ucb.pdf for a summary ofAuer et al.
7:10
UCB algorithms
• UCB algorithms determine a confidence interval such that
yi − σi < fi < yi + σi
with high probability.
UCB chooses the upper bound of this confidence interval
Strong theory on efficiency of this method in comparision to op-
timal
• UCB methods are also used for planning:
Upper Confidence Bounds for Trees (UCT)
7:11
How exactly is this related to global optimization?
7:12
Global Optimization = infinite bandits
• In global optimization f(x) defines a “reward” for every x ∈ Rn
– Instead of a finite number of actions at we now have xt
• Optimal Optimization could be defined as: find a π that
min
⟨T∑t=1
f(xt)
⟩
or
min 〈f(xT )〉
• In principle we know what an optimal optimization algorithm
would have to do – it is just computationally infeasible (in gen-
eral)
7:13
Gaussian Processes as belief
• Assume we have a history
ht = [(x1, y1), (x2, y2), ..., (xt-1, yt-1)]
• Gaussian Processes are a Machine Learning method that
– provides a mean estimate f(x) (response surface)
– provides a variance estimate σ2(x) ↔ confidence intervals
• Caveat: One needs to make assumptions about the kernel
(e.g., how smooth the function is)
7:14
Introduction to Optimization, Marc Toussaint—July 11, 2013 27
1-step criteria based on GPs
• Maximize Probability of Improvement (MPI)
xt = argmaxx
∫ y∗
−∞N(y|f(x), σ(x))
• Maximize Expected Improvement (EI)
xt = argmaxx
∫ y∗
−∞N(y|f(x), σ(x)) (y∗ − y)
• Maximize UCB
xt = argmaxx
f(x) + βtσ(x)
[Often, βt = 1 is chosen. UCB theory allows for better choices. See Srinivas etal.]
7:15
7:16
7:17
7:18
Global Optimization
• Given data, we compute a belief over f(x)
• The belief expresses mean estimate f(x) and confidence σ(x)
– Use Gaussian Processes or other Bayesian ML methods
• Optimal Optimization would imply planning in belief space
• Efficient Global Optimization uses 1-step criteria
– Upper Confidence Bound (UCB)
– Maximal Probability of Improvement (MPI)
– Expected Improvement (EI)
• Global Optimization with gradient information
→ Gaussian Processes with derivative observations
7:19
28 Introduction to Optimization, Marc Toussaint—July 11, 2013
8 Exercises
8.1 Exercise 1
8.1.1 Boyd & Vandenberghe
Read sections 1.1, 1.3 & 1.4 of Boyd & Vandenberghe“Convex Optimization”. This is for you to get an impres-sion of the book. Learn in particular about their cate-gories of convex and non-linear optimization problems.
8.1.2 First steps
Consider the following functions over x ∈ Rn:
fsq(x) = x>x (2)
fhole(x) = 1− exp(−x>x) (3)
These would be fairly simple to optimize. We changethe conditioning (“skewedness of the Hessian”) of thesefunctions to make them a bit more interesting.
Let c ∈ R be the conditioning parameter; let C be thediagonal matrix with entries C(i, i) = c
i−12(n−1) . We define
the test functions
f csq(x) = fsq(Cx) (4)
f chole(x) = fhole(Cx) (5)
In the following, use c = 100.
a) Implement these functions and display them overx ∈ [−1, 1]2. You can use any language, Octave/Matlab,Python, C++, R, whatever. Plotting is usually done byevaluating the function on a grid of points, e.g. in Oc-tave
[X0,X1] = meshgrid(linspace(-1,1,20),linspace(-1,1,20));X = [X0(:),X1(:)];Y = sum(X.*X, 2);Ygrid = reshape(Y,[20,20]);hold on;mesh(X0,X1,Ygrid);hold off;
Or you can store the grid data in a file and use gnuplot,e.g.
splot [-1:1][-1:1] ’datafile’ matrix us ($1/10-1):($2/10-1):3
b) Implement the fixed stepsize gradient descent methodto find optima for these functions in n = 2 dimensions.
Sample the starting point uniformly, x ∈ U([−3, 3]2),and choose α heuristically. (Ideally, display the opti-mization path in the plot.)
c) Implement the adaptive stepsize method.
8.2 Exercise 2
8.2.1 Equality Constraint Penalties and aug-
mented Lagrangian
(We don’t need to know what the Langangian is (yet) tosolving this exercise.)
In the lecture we discussed the squared penalty methodfor inequality constraints. There is a straight-forwardversion for equality constraints: Instead of
minx
f(x) s.t. h(x) = 0 (6)
we address
minx
f(x) + µ
m∑i=1
hi(x)2 (7)
such that the squared penalty pulls the solution ontothe constraint h(x) = 0. Assume that if we minimize(11) we end up at a solution x1 for which each hi(x1) isreasonable small, but not exactly zero.
We also mentioned the idea that we could add an ad-ditional term which counteracts the violation of the con-straint. This can be realized by minimizing
minx
f(x) + µ
m∑i=1
hi(x)2 +
m∑i=1
λihi(x) (8)
for a “good choice” of each λi. It turns we can infer this“good choice” from the solution x1 of (11):
Proof that setting λi = 2µhi(x1) will, if we assume thatthe gradients ∇f(x) and ∇h(x) are (locally) constant,ensure that the minimum of (8) fulfils exactly the con-straints h(x) = 0.
Tip: Think intuitive. Think about how the gradient thatarises from the penalty in (11) is now generated via theλi.
Introduction to Optimization, Marc Toussaint—July 11, 2013 29
8.2.2 Squared Panalties & Log Barriers (worth
2 points)
In the last exercise we defined the “hole function” f chole(x),where we now assume a conditioning c = 4.
Consider the optimization problem
minxf chole(x) s.t. g(x) ≤ 0 (9)
g(x) =
x>x− 1
xn + 1/c
(10)
a) First, assume n = 2 (x ∈ R2 is 2-dimensional), c =
4, and draw on paper what the problem looks like andwhere you expect the optimum.
b) Implement the Squared Penalty Method. Choose asa start point x = ( 1
2 ,12 ). Plot its optimization path and
report on the number of total function/gradient evalua-tions needed.
c) Test the scaling of the method for n = 10 dimensions.
d) Implement the Log Barrier Method and test as inb) and c). Compare the function/gradient evaluationsneeded.
8.3 Exercise 3
8.3.1 Lagrangian and dual function
(Taken roughly from ‘Convex Optimization’, Ex. 5.1)
A simple example. Consider the optimization problem
minx2 + 1 s.t. (x− 2)(x− 4) ≤ 0
with variable x ∈ R.
a) Give the feasible set, the optimal solution x∗, and theoptimal value p∗ = f(x∗).
b) Write down the Lagrangian L(x, λ). Plot (using gnu-plot or so) L(x, λ) over x for various values of λ ≥ 0.You verify the lower bound property minx L(x, λ) ≤ p∗,where p∗ is the optimum value of the primal problem.
c) Derive the dual function l(λ) and plot it (for λ ≥ 0).Derive the dual optimal solution λ∗ = argmaxλ l(λ). Ismaxl l(λ) = p∗ (strong duality)?
8.3.2 Phase I & Log Barriers
We again consider a constraint optimization problemvery similar to the last exercise:
minx
n∑i=1
xi s.t. g(x) ≤ 0 (11)
g(x) =
x>x− 1
−x1
(12)
a) In the last exercise you’ve implemented basic con-straint optimization methods (penalty or log barrier). Usethese to find a feasible initialization (Phase I). Do thisby solving the n+ 1-dimensional problem
min(x,s)∈Rn+1
s s.t. ∀i : gi(x) ≤ s, s ≥ 0
Initialize this with the infeasible point (1, 1) ∈ R2.
b) Once you’ve found a feasible point, use the stan-dard log barrier method to find the solution to the orig-inal problem (11). Start with µ = 1, and decrease it byµ← µ/10 in each iteration. In each iteration also reportλi :=
µgi(x) at the solution to the unconstraint problem
minx f(x)− µ∑i log(−gi(x)).
8.4 Exercise 4
8.4.1 Gauss-Newton
In x ∈ R2 consider the function
f(x) = φ(x)>φ(x) , φ(x) =
sin(ax1)
sin(acx2)
2x1
2cx2
The function is plotted above for a = 4 (left) and a = 5
(right, having local minima), and conditioning c = 1.The function is non-convex.
a) Implement the Gauss-Newton algorithm to solve theunconstrained minimization problem minx f(x) for a ran-dom start point in x ∈ [−1, 1]2. Compare the algorithmfor a = 4 and a = 5 and conditioning c = 3 with gradientdescent.
b) Optimize the function also using the fminunc rou-tine from Octave. (Typically this uses BFGS internally.)
30 Introduction to Optimization, Marc Toussaint—July 11, 2013
8.4.2 Newton method on a constrained prob-
lem
Use the Newton method to solve the same constrainedproblem we considered last time,
minx
n∑i=1
xi s.t. g(x) ≤ 0
g(x) =
x>x− 1
−x1
You are free to choose the squared penalty, log barrier,or augmented Lagrangian method.
8.5 Exercise 5
Solving real-world problems involves 2 subproblems:
1) formulating the problem as an optimization prob-lem (conform to a standard optimization problemcategory) (→ human)
2) the actual optimization problem (→ algorithm)
In the lecture we’ve seen some examples (maxSAT,Travelling Salesman, MRFs) that the first problem isabsolutely non-trivial, especially when trying to formu-late problems as Linear or Quadratic Program. Here issome more training on this. Exercises from Boyd et alhttp://www.stanford.edu/˜boyd/cvxbook/bv_
cvxbook.pdf:
8.5.1 Network flow problem
Solve Exercise 4.12 (pdf page 207) from Boyd & Van-denberghe, Convex Optimization.
8.5.2 Minimum fuel optimal control
Solve Exercise 4.16 (pdf page 208) from Boyd & Van-denberghe, Convex Optimization.
8.5.3 Primal-Dual Newton for Quadratic Pro-
gramming
Derive an explicit equation for the primal-dual Newtonupdate of (x, λ) (slide 04:22) in the case of QuadraticProgramming. Use the special method for solving blockmatrix linear equations using the Schur complements(Wikipedia “Schur complement”).
What is the update for a general Linear Program?
8.6 Exercise 6
8.6.1 CMA vs. your own algo
At https://www.lri.fr/˜hansen/cmaes_inmatlab.html there is code for CMA for all languages (I do notrecommend the C++ versions).
a) Test CMA with a standard parameter setting on therosenbrock function (see Wikipedia). My implementa-tion in C++ is:double rosenbrock(const arr& x) {
double f=0.;for(uint i=1; i<x.N; i++)f += sqr(x(i)-sqr(x(i-1)))
+ .01*sqr(1-x(i-1));return f;
}
where sqr computes the square of a double.
CMA should have no problem in optimizing this function– but as is always samples a whole population of sizeλ, the number of evaluations is rather large.
b) Think of any simple alternative stochastic search method(perhaps including line search, or whatever you comeup with) to beat CMA on this problem.
8.7 Exercise 7
8.7.1 Multi-armed bandits & UCB
Assume there are n = 10 bandits. Each bandit is binary(i.e., yt ∈ {0, 1}) with P (yt=1|at = i) = pi. The agenthas T = 100 rounds to play the machines and aimes tomaximize
∑Tt=1 yt.
Introduction to Optimization, Marc Toussaint—July 11, 2013 31
For simplicify, in the following assume that pi = i/10
for i = 1, .., 10. But the agent does not know this, ofcourse.
a) Implement this bandit scenario using a proper (clock)random seed. (Write a method that receives a at andreturns a yt ∈ {0, 1}.) Simulate a random agent thatchooses actions at ∼ U({1, .., 10}) uniformly. Let theagent play 10 games (each with T = 100 rounds). Whatis the random agent’s average reward?
b) Implement a UCB agent. For this, the agent needs tokeep track how often he has played a machine (ni) andhow often this machine returned y = 1 (let’s call this βi)or y = 0 (let’s call this αi). What is the agent’s averagereward? (Averaged over 10 games, as above.)
c) (Bonus.) Assume the agent knows that the banditsare binary. He can exploit this knowledge: His beliefcan be
bt = P ((p1, .., pn)|ht) =∏i
Beta(pi|αi, βi)
where Beta is the so-called Beta-distribution over theBernoulli parameter pi ∈ [0, 1]. At Wikipedia you canfind information on the mean and variance (and alsothe cumulative distribution function, called regularizedincomplete beta function) of a Beta distribution. Howexactly could an agent use this to perhaps become bet-ter than the agent in b)?
8.7.2 Global optimization on the Rosenbrock
function
On the webpage you’ll find octave code for GP regres-sion from Carl Rasmussen (gp01pred.m). The test.mdemonstrates how to use it.
Use this code to implement a global optimization methodfor 2D problems. Test the method
a) on the 2D Rosenbrock function defined in exercisee06, and
b) on the Rastrigin function as defined in exercise e04with a = 6.
Note that in test.m I’ve chosen hyperparameters thatcorrespond to assuming: smoothness is given by a ker-nel width
√1/10; initial value uncertainty (range) is given
by√10. How does the performance of the method
change with these hyperparameters?
8.7.3 Constrained global optimization?
On slide 6:2 it is speculated that one could considera constrained blackbox optimization problem as well.How could one approach this in the UCB manner?
32 Introduction to Optimization, Marc Toussaint—July 11, 2013
9 Topic list
This list summarizes the lecture’s content and is in-tended as a guide for preparation for the exam. (Go-ing through all exercises is equally important!) Refer-ences to the lectures slides are given in the format (lec-ture:slide).
9.1 Optimization Problems in General
• Types of optimization problems (1:6)
– General constrained optimization problem defi-nition
– Blackbox, gradient-based, 2nd order
– Understand the differences
– “Upgrades”, e.g. quasi-Newton, global optimiza-tion
• Hardly coherent texts that cover all three
– constrained & convex optimization
– stochastic search
– global optimization
• In the lecture we usually only consider inequalityconstraints (for simplicity of presentation)
– Understand in all cases how also equality con-straints could be handled
9.2 Gradient-based Methods
• Plain gradient descent
– Understand the stepsize problem (2:5)
– Stepsize adaptation & monotonicity (2:7)
– Backtracking line search (2:21)
• Steepest descent
– Is the gradient the steepest direction? (2:10,11)
– Covariance =invariance under linear transfor-mations) of the steepest descent direction (2:12)
• Conjugate gradient (2:15)
– New direction d′ should be “orthogonal” to theprevious d, but relative to the local quadratic shape,d′>Ad = 0 (= d′ and d are conjugate)
– On quadratic functions CG converges in n iter-ations (2:16)
• Rprop (2:19)
– Seems awfully hacky
– Every coordinate is treated separately. No in-variance under rotations/transformations. (2:20)
– Change in gradient sign → reduce stepsize;else increase
– Works surprisingly well and robust in practice
• Evaluating optimization costs
– Be aware in differences in convention. Some-times “1 iteration”=many function evaluations (linesearch)
– Best: always report on # function evaluations
9.3 Constrained Optimization
• Overview
– General problem definition (3:1)
– Convert to series of unconstrained problems:penalty and barrier methods
– Convert to larger unconstrained problem: primal-dual Newton method
– Convert to other constrained problem: dual prob-lem
• Log barrier method
– Definition (3:6)
– Understand how the barrier gets steeper withµ→ 0 (not µ→∞!)
– Iterativly decreasing µ generates the centralpath (3:8)
– The gradient of the log barrier generates a La-grange term with λi = − µ
gi(x) !
→ Each iteration solves the modified (approxi-mate) KKT condition
• Squared penalty method
– Definition (3:12)
– Motivates the Augmented Lagrangian (3:13)
• Augmented Lagrangian
– Definition (equality 3:15, inequality 3:18)
– Role of the squared penalty: “measure” howstrong f pushes into the constraint
– Role of the Lagrangian term: generate counterforce
– Unstand that the λ update generates the “de-sired force” (3:16,17)
Introduction to Optimization, Marc Toussaint—July 11, 2013 33
• The Lagrangian
– Definition (3:21)
– Using the L to solve constrained problems onpaper (set both,∇xL(x, λ) = 0 and∇λL(x, λ) = 0
(3:23)
– Force balance and the first KKT condition (3:24)
– Understand in detail the full KKT conditions (3:25)
– Optima are necessarily saddle points of the L(3:27)
– minx L↔ first KKT↔ force balance (3:28)
– maxλ L↔ complementarity KKT↔ constraints(3:28)
• Lagrange dual problem
– primal problem: minxmaxl≥0 L(x, λ)
– dual problem: maxλ≥0 minx L(x, λ)
– Definition of Lagrange dual (3:29)
– Lower bound and strong duality (3:30)
• Primal-dual Newton method to solve KKT condi-tions (3:32)
– Definition description (4:21,22)
• Phase I optimization
– Nice trick to find feasible initialization (3:37)
9.4 Second-Order Methods
• General
– Problem definition (4:5)
– 2nd order information can improve direction &stepsize (4:3)
– Hessian needs to be pos-def (↔ f(x) is con-vex) or modified/approximated as pos-def (Gauss-Newton, damping)
– High relevance within the constrained optimiza-tion iterations (4:19)
• Newton
– Definition (4:6)
– Adaptive stepsize vs. damping (4:7,8)
• Gauss-Newton
– f(x) is a sum of squared cost terms (4:11)
– The approx. Hessian 2∇φ(x)>∇φ(x) is alwayssemi-pos-def! (4:12)
• Quasi-Newton
– Accumulate gradient information to approximatea Hessian (4:14,15)
– BFGS (4:16)
9.5 Convex Optimization
• Definitions
– Convex, quasiconvex, unimodal functions (5:2,3)
– Convex optimization problem (5:6)
• Linear Programming
– General and standard form definition (5:7)
– Converting into standard form (5:8)
– LPs are efficiently solved using 2nd-order logbarrier, augmented Lagrangian or dual-primal meth-ods
– Simplex Algorithm is classical alternative; walkson the constraint edges instead of the interior(5:12)
– Very important application of LPs: LP-relaxationsof integer linear programs (5:16)
• Quadratic Programming
– Definition (5:21)
– QPs are efficiently solved using 2nd-order logbarrier, augmented Lagrangian or dual-primal meth-ods
– Sequential QP solves general (non-quadratic)problems by defining a local QP for the step di-rection followed by a line search in that direction(5:22)
9.6 Stochastic Search & Heuristics
• Generic Stochastic Search Recipe
– Definition (6:7)
– Understand the crucial role of θ (6:8)
– In Optimal Optimization, θ is the belief (7:8,9)
– In EAs, θ may include populations (6:11)
– Categories of EAs: ES, GA, GP, EDA (6:11)
– In ESs, θ are parameters of a Gaussian (6:9,10,13)
– In Simulated Annealing, θ is the current point(6:20)
– In Simples, θ is the n+ 1 points (6:24)
34 Introduction to Optimization, Marc Toussaint—July 11, 2013
• with Gaussian distributions
– (µ, λ)-ES and (µ + λ)-ES: sampling, selection,update (6:9,10)
– CMA: adapting C and σ based on the path ofthe mean (6:13,14)
• Simulated Annealing
– Accept probability: ratio of exp-value correctedby transition reversability (6:20,21)
– Typically Gaussian transition probabilities
– Samples from p(x) ∝ e−f(x)/T (6:21,22)
– Cooling scheme decreases temperature
• Hill Climbing
– Same as Simulated Annealing for T = 0
– Same as (1 + 1)-ES
• Nelder-Mead Downhill Simplex
– Reflect, expand, contract (6:25)
9.7 Global Optimization
• Multi-armed bandit framework
– Problem definition (7:6)
– Understand the concepts of exploration, exploita-tion & belief (7:7,8,9)
– Optimal Optimization would imply to plan (ex-actly) through belief space (7:9)
– Upper Confidence Bound (UCB) and confidenceinterval (7:10,11)
– UCB is optimistic
• Global optimization
– Global optimization = infinite bandits (7:13)
– Locally correlated bandits→Gaussian Processbeliefs (7:14)
– Maximum Probability of Improvement (7:15)
– Expected Improvement (7:15)