introduction to optimization - tu berlinintroduction to optimization marc toussaint july 11, 2013...

Introduction to Optimization

Marc Toussaint

July 11, 2013

This is a direct concatenation and reformatting of all lecture slides and exercises fromthe Optimization course (summer term 2013, U Stuttgart), including a topic list to preparefor exams.

Contents

1 Introduction 2

2 Gradient-based Methods 4Plain gradient descent, stepsize adaptation & monotonicity, steepest descent, conjugate gradient, Rprop

3 Constrained Optimization 7General definition, log barriers, central path, squared penalties, augmented Lagrangian (equalities & inequalities), theLagrangian, force balance view & KKT conditions, saddle point view, dual problem, min-max max-min duality, modifiedKKT & log barriers, Phase I

4 Second-Order Methods 132nd order gives better stepsize & direction, Newton methods, adaptive stepsize, Levenberg-Marquardt, Gauss-Newtonmethod, Quasi-Newton methods, BFGS, primal-dual interior point Newton method

5 Convex Optimization 17Convex, quasiconvex, unimodal, convex optimization problem, linear program (LP), standard form, simplex algorithm,LP-relaxation of integer linear programs, quadratic programming (QP), sequential quadratic programming

6 Stochastic Search & Heuristics 21Blackbox optimization, stochastic search, (µ+λ)-ES, CMA-ES, Evolutionary Algorithms, Simulated Annealing, Hill Climing,Nelder-Mead downhill simplex

7 Global Optimization 25Multi-armed bandits, exploration vs. exploitation, navigation through belief space, upper confidence bound (UCB), globaloptimization = infinite bandits, Gaussian Processes, probability of improvement, expected improvement, UCB

8 Exercises 28

9 Topic list 32

1

2 Introduction to Optimization, Marc Toussaint—July 11, 2013

1 Introduction

Why Optimization is interesting!

• Which science does not use optimality principles to describe

nature & artifacts?

• Endless applications

1:1

The content of an optimization course

• Catholic way: Convex Optimization

• Discrete Optimization (Stefan Funke)

• Exotics: Evolutionary Algorithms, Swarm optimization, etc

• Here:

I asked colleagues “What are the optimization methods one should

know / you use most.” → Everybody gave very different an-

swers.

1:2

log-barriersimplexparticle swarmMCMC (simulated annealing)(L)BFGSblackbox stochastic searchNewtonRpropEMprimal/dualgreedy(conj.) gradientsKKTline searchlinear/quadratic programming

1:3

This is the first time I give the lecture!

• It’ll be improvised

• You can tell me what to include

1:4

Planned Outline

• Gradient-based optimization (1st order methods)– plain grad., steepest descent, conjugate grad., Rprop, stochastic

grad.– adaptive stepsize heuristics

• Constrained Optimization– squared penalties, augmented Lagrangian, log barrier– Lagrangian, KKT conditions, Lagrange dual, log barrier↔ approx.

KKT

• 2nd order methods– Newton, Gauss-Newton, Quasi-Newton, (L)BFGS

– constrained case, primal-dual Newton

• Special convex cases– Linear Programming, (sequential) Quadratic Programming– Simplex algorithm– relation to relaxed discrete optimization

• Blackbox optimization (“0th order methods”)– blackbox stochastic search– Markov Chain Monte Carlo methods– evolutionary algorithms

1:5

Rough Types of Optimization Problems

• Generic optimization problem:

Let x ∈ Rn, f : Rn → R, g : Rn → Rm, find

minx

f(x)

s.t. g(x) ≤ 0

• Blackbox: only f(x) can be evaluated

• Gradient: ∇f(x) can be evaluated

• Gauss-Newton type: f(x) = φ(x)>φ(x) and∇φ(x) can be evaluated

• 2nd order: ∇2f(x) can be evaluated

• “Approximate upgrade”:

– Use samples of f(x) to approximate ∇f(x) locally

– Use samples of ∇f(x) to approximate ∇2f(x) locally

1:6

Books

Boyd and Vandenberghe: ConvexOptimization.http://www.stanford.edu/

˜boyd/cvxbook/

(this course will not go to the full depth in math of Boyd et al.)

1:7

Organisation

• Vorlesungs-Webpage:

http://ipvs.informatik.uni-stuttgart.de/mlr/marc/teaching/13-Optimization/

– Slides, Ubungen & Software (C++)– Links zu Buchern und anderen Ressourcen

• Sekretariat/Organisatorische Fragen:

Carola Stahl, [email protected], Raum 2.217

http://www.stanford.edu/~boyd/cvxbook/

http://www.stanford.edu/~boyd/cvxbook/

http://ipvs.informatik.uni-stuttgart.de/mlr/marc/teaching/13-Optimization/

Introduction to Optimization, Marc Toussaint—July 11, 2013 3

• 1 geplante Ubung: Dienstag 17:30-19:00, 0.453

• Regelung zu Ubungen:– Bearbeitung der Ubungen ist wichtig!

– Zu Beginn jeder Ubung in Liste eintragen:– Teilnahme– Welche Aufgaben wurden bearbeitet

– Zufallige Auswahl zur Prasentation der Losung– 50% bearbeitete Aufgaben notwendig fur aktive Teilnahme

1:8


2 Gradient-based Methods

Plain gradient descent, stepsize adaptation & monotonicity, steepest

descent, conjugate gradient, Rprop

Gradient descent methods – outline

• Plain gradient descent (with adaptive stepsize)

• Steepest descent (w.r.t. a known metric)

• Conjugate gradient (requires line search)

• Rprop (heuristic, but quite efficient)

2:1

Gradient descent

• Notation:

objective function: f : Rn → R

gradient vector: ∇f(x) =[∂∂xf(x)

]>∈ Rn

• Problem:

minxf(x)

where we can evaluate f(x) and ∇f(x) for any x ∈ Rn

• Gradient descent:

Make iterative steps in the direction −∇f(x).

2:2

Plain Gradient Descent

2:3

Fixed stepsize

BAD! gradient descent:

Input: initial x ∈ Rn, function ∇f(x), stepsize α, toleranceθ

Output: x1: repeat2: x← x− α∇f(x)

3: until |∆x| < θ [perhaps for 10 iterations in sequence]

2:4

Making steps proportional to ∇f(x)??

large gradient large step?

small gradient small step?

NO!

We need methods indep. of |∇f(x)|, invariant of scaling of f and

x!

2:5

How can we become independent of |∇f(x)|?

• Line search — which we’ll discuss briefly later

• Stepsize adaptation

2:6

Gradient descent with stepsize adaptation

Input: initial x ∈ Rn, functions f(x) and ∇f(x), initial step-size α, tolerance θ

Output: x1: repeat2: y ← x− α ∇f(x)

|∇f(x)|3: if [ thenstep is accepted]f(y) ≤ f(x)

4: x← y

5: α← 1.2α // increase stepsize6: else[step is rejected]7: α← 0.5α // decrease stepsize8: end if9: until |y − x| < θ [perhaps for 10 iterations in sequence]

(“magic numbers”)

α determins the absolute stepsize

stepsize is automatically adapted

2:7

• Guaranteed monotonicity (by construction)

If f is convex⇒ convergence

For typical non-convex bounded f ⇒ convergence to local opti-

mum

2:8

Steepest Descent

2:9


Steepest Descent

• The gradient∇f(x) is sometimes called steepest descent direc-

tion

Is it really?

• Here is a possible definition:

The steepest descent direction is the one where, when I make

a step of length 1, I get the largest decrease of f in its linear

approximation.

argminδ∇f(x)>δ s.t. ||δ|| = 1

2:10

Steepest Descent

• But the norm ||δ||2 = δ>Aδ depends on the metric A!

Let A = B>B (Cholesky decomposition) and z = Bδ

δ∗ = argminδ∇f>δ s.t. δ>Aδ = 1

= B-1 argminz

(B-1z)>∇f s.t. z>z = 1

= B-1 argminz

z>B->∇f s.t. z>z = 1

= B-1[−B->∇f ] = −A-1∇f

The steepest descent direction is δ = −A-1∇f2:11

Behavior under linear coordinate transforma-tions

• Let B be a matrix that describes a linear transformation in coor-

dinates

• A coordinate vector x transforms as z = Bx

• The gradient vector ∇xf(x) transforms as ∇zf(z) = B->∇xf(x)

• The metric A transforms as Az = B->AxB-1

• The steepest descent transforms as A-1z∇zf(z) = BA-1

x∇xf(x)

The steepest descent transforms like a normal coordinate vector

(covariant)

2:12

(Nonlinear) Conjugate Gradient

2:13

Conjugate Gradient

• The “Conjugate Gradient Method” is a method for solving large

linear eqn. systems Ax+ b = 0

We mention its extension for optimizing nonlinear functions f(x)

• A key insight:

– at xk we computed ∇f(xk)

– we made a (line-search) step to xk+1

– at xk+1 we computed ∇f(xk+1)

What conclusions can we draw about the “local quadratic shape” of f?

2:14

Conjugate Gradient

Input: initial x ∈ Rn, functions f(x),∇f(x), tolerance θOutput: x

1: initialize descent direction d = g = −∇f(x)

2: repeat3: α← argminα f(x+ αd) // line search4: x← x+ αd

5: g′ ← g, g = −∇f(x) // store and compute grad

6: β ← max

{g>(g−g′)g′>g′

, 0

}7: d← g + βd // conjugate descent direction8: until |∆x| < θ

• Notes:– β > 0: The new descent direction always adds a bit of the old direc-tion!– This essentially provides 2nd order information– The equation for β is by Polak-Ribiere: On a quadratic function f(x) =x>Ax this leads to conjugate search directions, d′>Ad = 0.

– All this really only works with line search

2:15

Conjugate Gradient

• For quadratic functions CG converges in n iterations. But each

iteration does line search!

2:16

Conjugate Gradient

• Useful tutorial on CG and line search:

J. R. Shewchuk: An Introduction to the Conjugate Gradient Method

Without the Agonizing Pain


2:17

Rprop

2:18

Rprop

“Resilient Back Propagation” (outdated name from NN times...)

Input: initial x ∈ Rn, function f(x),∇f(x), initial stepsize α, toler-ance θ

Output: x1: initialize x = x0, all αi = α, all gi = 0

2: repeat3: g ← ∇f(x)

4: x′ ← x

5: for i = 1 : n do6: if [ thensame direction as last time]gig′i > 0

7: αi ← 1.2αi8: xi ← xi − αi sign(gi)

9: g′i ← gi10: else if [ thenchange of direction]gig′i < 0

11: αi ← 0.5αi12: xi ← xi − αi sign(gi)

13: g′i ← 0 // force last case next time14: else15: xi ← xi − αi sign(gi)

16: g′i ← gi17: end if18: optionally: cap αi ∈ [αmin xi, αmax xi]

19: end for20: until |x′ − x| < θ for 10 iterations in sequence

2:19

Rprop

• Rprop is a bit crazy:

– stepsize adaptation in each dimension separately

– it not only ignores |∇f | but also its exact direction

step directions may differ up to < 90◦ from ∇f

– Often works very robustly

– Guarantees? See work by Ch. Igel

• If you like, have a look at:Christian Igel, Marc Toussaint, W. Weishui (2005): Rprop using the nat-ural gradient compared to Levenberg-Marquardt optimization. In Trendsand Applications in Constructive Approximation. International Series ofNumerical Mathematics, volume 151, 259-272.

2:20

Backtracking line search

• Line search in general denotes the problem

minα≥0

f(x+ α∆)

for some step direction ∆

• The most common line search on convex functions is backtracking

Input: start point x, direction ∆, function f(x), parametersa ∈ (0, 1

2), b ∈ (0, 1)

Output: x1: initialize α = 1

2: while f(x+ α∆) > f(x) + a∇f(x)>(α∆) do3: t← bt

4: end while

b describes the stepsize decrement in case of a rejected stepa describes a minimum desired decrease in f(x)

• In the 2nd order methods we described, we chose a = 0:We did not invest into further line search steps if f(x+ α∆) ≤ f(x)

• Boyd at al: typically a ∈ [0.01, 0.3] and b ∈ [0.1, 0.8]

2:21

Backtracking line search for convex functions

(From Boyd et al.; notation differs from previous slide.)

2:22

Appendix

Two little comments on stopping criteria & costs...

2:23

Appendix: Stopping Criteria

• Standard references (Boyd) define stopping criteria based on

the “change” in f(x), e.g. |∆f(x)| < θ or |∇f(x)| < θ.

• Throughout I will define stopping criteria based on the change

in x, e.g. |∆x| < θ! In my experience this is in many problems

more meaningful, and invariant of the scaling of f .

2:24

Appendix: Evaluating optimization costs

• Standard references (Boyd) assume line search is cheap and

measure optimization costs as the number of iterations (count-

ing 1 per line search).

• Throughout I will assume that every evaluation of f(x) or (f(x),∇f(x))

or (f(x),∇f(x),∇2f(x)) is equally expensive!

2:25


3 Constrained Optimization

General definition, log barriers, central path, squared penalties, aug-

mented Lagrangian (equalities & inequalities), the Lagrangian, force

balance view & KKT conditions, saddle point view, dual problem, min-

max max-min duality, modified KKT & log barriers, Phase I

Constrained Optimization

• General constrained optimization problem:

Let x ∈ Rn, f : Rn → R, g : Rn → Rm, h : Rn → Rl find

minx

f(x) s.t. g(x) ≤ 0, h(x) = 0

In this lecture I’ll focus (mostly) on inequality constraints g!

• Applications

– Find an optimal, non-colliding trajectory in robotics

– Optimize the shape of a turbine blade, s.t. it must not break

– Optimize the train schedule, s.t. consistency/possibility

3:1

General approaches

• Try to somehow transform the constraint problem to

a series of unconstraint problems

a single but larger unconstraint problem

another constraint problem, hopefully simpler (dual, con-

vex)

3:2

General approaches

• Penalty & Barriers– Associate a (adaptive) penalty cost with violation of the constraint– Associate an additional “force compensating the gradient into the con-straint” (augmented Lagrangian)– Associate a log barrier with a constraint, becoming ∞ for violation(interior point method)

• Gradient projection methods (mostly for linear contraints)– For ‘active’ constraints, project the step direction to become tangantial– When checking a step, always pull it back to the feasible region

• Lagrangian & dual methods– Rewrite the constrained problem into an unconstrained one– Or rewrite it as a (convex) dual problem

• Simplex methods (linear constraints)– Walk along the constraint boundaries

3:3

Penalties & Barries

• Convention:

A barrier is really∞ for g(x) > 0

A penalty is zero for g(x) ≤ 0 and increases with g(x) > 0

3:4

Log barrier method or Interior Point method

3:5

Log barrier method

• Instead of

minx

f(x) s.t. g(x) ≤ 0

we address

minx

f(x)− µ∑i

log(−gi(x))

3:6

Log barrier

• For µ→ 0, −µ log(−g) converges to∞[g > 0]

Notation: [boolean expression] ∈ {0, 1}

• The barriers gradient ∇− log(−g) = ∇gg

pushes away from the

constraint

• Eventually we want to have a very small µ – but choosing small

µ makes the barrier very non-smooth, which is bad for Gradient

and 2nd order methods

3:7

Central Path

• Every µ defines a different optimal x∗(µ)

x∗(µ) = argminx

f(x)− µ∑i

log(−gi(x))


• Each point on the path can be understood as the optimal com-

promise of minimizing f(x) and a repelling force of the con-

straints. (Which corresponds to dual variables λ∗(µ).)

3:8

Log barrier method

Input: initial x ∈ Rn, functions f(x), g(x),∇f(x),∇g(x),tolerances θ, ε

Output: x1: initialize µ = 1

2: repeat3: find x ← argminx f(x) − µ

∑i log(−gi(x)) with tol-

erance 10θ

4: decrease µ← µ/10

5: until |∆x| < θ and ∀i : gi(x) < ε

Note: See Boyd & Vandenberghe for stopping criteria based on

f precision (duality gap) and better choice of initial µ (which is

called t there).

3:9

We will revisit the log barrier method later, once we introduced

the Langrangian...

3:10

Squared Penalty Method

3:11


• This is perhaps the simplest approach

• Instead of

minx

f(x) s.t. g(x) ≤ 0

we address

minx

f(x) + µ

m∑i=1

[gi(x) > 0] gi(x)2

Input: initial x ∈ Rn, functions f(x), g(x),∇f(x),∇g(x), tol.θ, ε

Output: x1: initialize µ = 1

2: repeat3: find x ← argminx f(x) + µ

∑i[gi(x) > 0] g(x)2 with

tolerance 10θ

4: µ← 10µ

5: until |∆x| < θ and ∀i : gi(x) < ε

3:12


• The method is ok, but will always lead to some violation of con-

straints

• A better idea would be to add an out-pushing gradient/force

−∇gi(x) for every constraint gi(x) > 0 that is violated.

Ideally, the out-pushing gradient mixes with−∇f(x) exactly such

that the result becomes tangential to the constraint!

This idea leads to the augmented Lagrangian approach.

3:13

Augmented Lagrangian

(We can introduce this is a self-contained manner, without yet defining

the “Lagrangian”)

3:14

Augmented Lagrangian (equality constraint)

• We first consider an equality constraint before addressing in-

equalities

• Instead of

minx

f(x) s.t. h(x) = 0

we address

minx

f(x) + µ

m∑i=1

hi(x)2 +∑i=1

λihi(x) (1)

• Note:

– The gradient ∇hi(x) is always orthogonal to the constraint

– By tuning λi we can induce a “virtual gradient” λi∇hi(x)

– The term µ∑mi=1 hi(x)2 penalizes as before

• Here is the trick:

– First minimize (11) for some µ and λi

– This will in general lead to a (slight) penalty µ∑mi=1 hi(x)2

– For the next iteration, choose λi to generate exactly the gradi-

ent that was previously generated by the penalty

3:15

• Optimality condition after an iteration:

x′ = argminx

f(x) + µm∑i=1

hi(x)2 +m∑i=1

λihi(x)

⇒ 0 = ∇f(x′) + µ

m∑i=1

2hi(x′)∇hi(x′) +

m∑i=1

λi∇hi(x′)


• Update λ’s for the next iteration:

∑i=1

λnewi ∇hi(x

′) = µ

m∑i=1

2hi(x′)∇hi(x′) +

∑i=1

λoldi ∇hi(x

′)

λnewi = λold

i + 2µhi(x′)

Input: initial x ∈ Rn, functions f(x), h(x),∇f(x),∇h(x),tol. θ, ε

Output: x1: initialize µ = 1, λi = 0

2: repeat3: find x← argminx f(x) + µ

∑i hi(x)2 +

∑i λihi(x)

4: ∀i : λi ← λi + 2µhi(x′)

5: until |∆x| < θ and |hi(x)| < ε

3:16

This adaptation of λi is really elegant:– We do not have to take the penalty limit µ → ∞ but still can have

exact constraints– If f and h were linear (∇f and ∇hi constant), the updated λi is

exactly right : In the next iteration we would exactly hit the constraint(by construction)

– The penalty term is like a measuring device for the necessary “vir-tual gradient”, which is generated by the agumentation term in thenext iteration

– The λi are very meaningful: they give the force/gradient that a con-straint exerts on the solution

3:17

Augmented Lagrangian (inequality constraint)

• Instead of

minx

f(x) s.t. g(x) ≤ 0

we address

minx

f(x) + µ

m∑i=1

[gi(x) ≥ 0 ∨ λi > 0] gi(x)2 +

m∑i=1

λigi(x)

• A constraint is either active or inactive:

– When active (gi(x) ≥ 0∨λi > 0) we aim for equality gi(x) = 0

– When inactive (gi(x) < 0∧λi = 0) we don’t penalize/augment

– λi are zero or positive, but never negative

Input: initial x ∈ Rn, functions f(x), g(x),∇f(x),∇g(x), tol. θ, εOutput: x

1: initialize µ = 1, λi = 0

2: repeat3: find x ← argminx f(x) + µ

∑i[gi(x) ≥ 0 ∨ λi >

0] gi(x)2 +∑i λigi(x)

4: ∀i : λi ← max(λi + 2µgi(x′), 0)

5: until |∆x| < θ and gi(x) < ε

3:18

General approaches

• Penalty & Barriers– Associate a (adaptive) penalty cost with violation of the constraint– Associate an additional “force compensating the gradient into the con-straint” (augmented Lagrangian)– Associate a log-barrier with a constraint, becoming ∞ for violation(interior point method)




3:19

The Lagrangian

3:20

The Lagrangian

• Given a constraint problem

minx

f(x) s.t. g(x) ≤ 0

we define the Lagrangian as

L(x, λ) = f(x) +

m∑i=1

λigi(x)

• The λi ≥ 0 are called dual variables or Lagrange multipliers

3:21

What’s the point of this definition?

• The Lagrangian is useful to compute optima analytically, on pa-

per – that’s why physicist learn it early on

• The Lagrangian implies the KKT conditions of optimality

• Optima are necessarily at saddle points of the Lagrangian

• The Lagrangian implies a dual problem, which is sometimes

easier to solve than the primal

3:22

Example: Some calculus using the Lagrangian

• For x ∈ R2, what is

minxx2 s.t. x1 + x2 = 1

• Solution:

L(x, λ) = x2 + λ(x1 + x2 − 1)

0 = ∇xL(x, λ) = 2x+ λ

11

⇒ x1 = x2 = −λ/2

0 = ∇λL(x, λ) = x1 + x2 − 1 = −λ/2− λ/2− 1 ⇒ λ = −1

⇒ = x1 = x2 = 1/2


3:23

The “force” & KKT view on the Lagrangian

• At the optimum there must be a balance between the cost gra-

dient −∇f(x) and the gradient of the active constraints −∇gi(x)

3:24


• At the optimum there must be a balance between the cost gra-

dient −∇f(x) and the gradient of the active constraints −∇gi(x)

• Formally: for optimal x: ∇f(x) ∈ span{∇gi(x)}

• Or: for optimal x there must exist λi such that−∇f(x) = −[∑

i(−λi∇gi(x))]

• For optimal x it must hold (necessary condition): ∃λ s.t.

∇f(x) +

m∑i=1

λi∇gi(x) = 0 (“force balance”)

∀i : gi(x) ≤ 0 (primal feasibility)

∀i : λi ≥ 0 (dual feasibility)

∀i : λigi(x) = 0 (complementary)

The last condition says that λi > 0 only for active constraints.

These are the Karush-Kuhn-Tucker conditions (KKT, neglect-

ing equality constraints)

3:25


• The first condition (“force balance”), ∃λ s.t.

∇f(x) +

m∑i=1

λi∇gi(x) = 0

can be equivalently expressed as, ∃λ s.t.

∇xL(x, λ) = 0

• In that sense, the Lagrangian can be viewed as the “energy

function” that generates (for good choice of λ) the right balance

between cost and constraint gradients

• This is exactly as in the augmented Lagrangian approach, where how-ever we have an additional (“augmented”) squared penalty that is usedto tune the λi

3:26

Saddle point view on the Lagrangian

• Let’s briefly consider the equality case again:

minx

f(x) s.t. h(x) = 0

with the Lagrangian

L(x, λ) = f(x) +

m∑i=1

λihi(x)

• Note:

minxL(x, λ) ⇒ 0 = ∇xL(x, λ) ↔ force balance

maxλ

L(x, λ) ⇒ 0 = ∇λL(x, λ) = hi(x) ↔ constraint

• Optima (x∗, λ∗) are saddle points where

∇xL = 0 ensures force balance and

∇λL = 0 ensures the constraint

3:27

Saddle point view on the Lagrangian

• In the inequality case:

maxλ≥0

L(x, λ) =

{f(x) if g(x) ≤ 0

∞ otherwise

maxλi≥0

L(x, λ)⇒

{λi = 0 if g(x) < 0

0 = ∇λiL(x, λ) = gi(0) otherwise

This implies either (λi = 0 ∧ gi(x) < 0) or gi(0) = 0, which is

exactly equivalent to the KKT conditions

• Again, optima (x∗, λ∗) are saddle points where

minx L enforces force balance and

maxλ L enforces the KKT conditions

3:28

The Lagrange dual problem

• We define the Lagrange dual function as

l(λ) = minxL(x, λ)

• This implies two problems

minxf(x) s.t. g(x) ≤ 0 primal problem

maxλ

l(λ) s.t. λ ≥ 0 dual problem

The dual problem is convex, even if the primal is non-convex!


• Written more symmetric:

minx

maxl≥0

L(x, λ) primal problem

maxλ≥0

minxL(x, λ) dual problem

because the maxλ≥0 L(x, λ) ensures the constraints (previous

slide).

3:29

The Lagrange dual problem

• The dual function is always a lower bound (for any λi ≥ 0)

l(λ) = minxL(x, λ) ≤

[minxf(x) s.t. g(x) ≤ 0

]And consequently

maxλ≥0

minxL(x, λ) ≤ min

xmaxl≥0

L(x, λ)

• We say strong duality holds iff

maxλ≥0

minxL(x, λ) = min

xmaxl≥0

L(x, λ)

• If the primal is convex, and there exist an interior point

∃x : ∀i : gi(x) < 0

(which is called Slater condition), then we have strong duality

3:30

And what about algorithms?

• So far we’ve only introduced a whole lot of formalism, and seen

that the Lagrangian sort of represents the constraint problem

– minx L or ∇xL = 0 is related to the force balance

– maxλ L or ∇λL = 0 is related to constraints or KKT conditions

– This implies two dual problems, minx maxλ L and maxλ minx L,

the second (dual) is a lower bound of the first (primal)

• But what are the algorithms we can get out of this?

3:31

Algorithmic implications of the Lagrangianview

• If minx L(x, λ) can be solved analytically, we can alternatively

solve the (convex) dual problem.

• But more generally

Optimization problem −→ Solve KKT conditions

→ Apply standard algos for solving an equation system r(x, λ) =

0:

Newton method ∇r∆x

∆λ

= −r

This leads to primal-dual algorithms that adapt x and λ concur-

rently. Roughly, they use the curvature ∇2f to estimate the right

λ to push out of the constraint. We will discuss this after we’ve

learnt about 2nd order methods.

3:32

Log barrier method revisited

3:33


• Log barrier method: Instead of

minx

f(x) s.t. g(x) ≤ 0

we address

minx

f(x)− µ∑i

log(−gi(x))

• For given µ the optimality condition is

∇f(x)−∑i

µ

gi(x)∇gi(x) = 0

or equivalently

∇f(x) +∑i

λi∇gi(x) = 0 , λigi(x) = −µ

These are called modified (=approximate) KKT conditions.

3:34


Centering (the unconstrained minimization) in the log barrier

method is equivalent to solving the modified KKT conditions.

Note also: On the central path, the duality gap is mµ:l(λ∗(µ)) = f(x∗(µ)) +

∑i λigi(x

∗(µ)) = f(x∗(µ))−mµ

3:35

Phase I: Finding a feasible initialization

3:36

Phase I: Finding a feasible initialization

• An elegant method for finding a feasible point x:

min(x,s)∈Rn+1

s s.t. ∀i : gi(x) ≤ s, s ≥ 0

or

min(x,s)∈Rn+m

m∑i=1

si s.t. ∀i : gi(x) ≤ si, si ≥ 0

3:37


General approaches

• Penalty & Barriers– Associate a (adaptive) penalty cost with violation of the constraint– Associate an additional “force compensating the gradient into the con-straint” (augmented Lagrangian)– Associate a log barrier with a constraint, becoming ∞ for violation(interior point method)




3:38


4 Second-Order Methods

2nd order gives better stepsize & direction, Newton methods, adap-

tive stepsize, Levenberg-Marquardt, Gauss-Newton method, Quasi-

Newton methods, BFGS, primal-dual interior point Newton method

Planned Outline




KKT

• 2nd order methods– Newton, Gauss-Newton, Quasi-Newton, (L)BFGS– constrained case, primal-dual Newton


• Black box optimization (“0th order methods”)– blackbox stochastic search– Markov Chain Monte Carlo methods– evolutionary algorithms

4:1

• So far we relied on gradient-based methods only, in the uncon-

strained and constrained case

• Today: 2nd order methods, which approximate f(x) locally– using 2nd order Taylor expansion (Hessian ∇2f(x) given)– estimating the Hessian from data

• 2nd order methods only work if the Hessian is everywhere

positive definite ↔ f(x) is convex or if it is approximated/modified

to be pos def (as in Gauss-Newton)

• Note: Approximating f(x) locally or globally is a core concept

also in black box optimization

4:2

Why can 2nd order optimization be better thangradient?

• Better direction:

Conjugate Gradient

Plain Gradient

2nd Order

• Better stepsize:– a full step jumps directly to the minimum of the local squared ap-

prox.– often this is already a good heuristic– additional stepsize reduction and dampening are straight-forward

4:3

Outline: 2nd order method

• Newton

• Gauss-Newton

• Quasi-Newton

• BFGS, (L)BFGS

• Their application on constrained problems

4:4

2nd order optimization

• Notation:

objective function: f : Rn → R

gradient vector: ∇f(x) =[∂∂xf(x)

]>∈ Rn

Hessian (symmetric matrix):

∇2f(x) =

∂2

∂x1∂x1f(x) ∂2

∂x1∂x2f(x) · · · ∂2

∂x1∂xnf(x)

∂2

∂x1∂x2f(x)

...

......

∂2

∂xn∂x1f(x) · · · · · · ∂2

∂xn∂xnf(x)

∈ Rn×n

Taylor expansion:

f(x′) = f(x) + (x′ − x)>∇f(x) +1

2(x′ − x)>∇2f(x) (x′ − x)

• Problem:

minxf(x)

where we can evaluate f(x), ∇f(x) and ∇2f(x) for any x ∈ Rn

4:5

Newton method

• For finding roots (zero points) of f(x)

x← x− f(x)

f ′(x)

• For finding optima of f(x) in 1D:

x← x− f ′(x)

f ′′(x)

For x ∈ Rn:

x← x−∇2f(x)-1∇f(x)

4:6


Newton method with adaptive stepsize α

Input: initial x ∈ Rn, functions f(x),∇f(x),∇2f(x), toler-ance θ

Output: x1: initialize stepsize α = 1 and damping λ = 10−10

2: repeat3: compute ∆ to solve (∇2f(x) + λI) ∆ = −∇f(x)

4: repeat // “line search”5: y ← x+ α∆

6: if f(y) ≤ f(x) then // step is accepted7: x← y

8: α← α0.5 // increase stepsize towards α = 1

9: else // step is rejected10: α← 0.1α // decrease stepsize11: end if12: until step accepted or (in bad case) α||∆||∞ < θ/1000

13: until ||∆||∞ < θ

• Notes:– Line 3 computes the Newton step ∆ = ∇2f(x)-1∇f(x),

use special Lapack routine dposv to solveAx = b (using Choleskydecomposition)

– λ is called damping, makes the parabola more “steep” around cur-rent xfor λ→∞: ∆ becomes colinear with −∇f(x) but |∆| = 0

4:7

Newton method with adaptive damping λ (Levenberg-Marquardt)

(I usually use stepsize adaptation instead of Levenberg-Marquardt)

Input: initial x ∈ Rn, functions f(x),∇f(x),∇2f(x), toler-ance θ

Output: x1: initialize damping λ = 10−10

2: repeat3: compute ∆ to solve (∇2f(x) + λI) ∆ = −∇f(x)

4: if f(x+ ∆) ≤ f(x) then // step is accepted5: x← x+ ∆

6: λ← 0.2λ // decrease damping7: else // step is rejected8: λ← 10λ // increase damping9: end if

10: until λ < 1 and ||∆||∞ < δ

4:8

Computational issues

• Let

Cf be computational cost of evaluating f(x) only

Ceval be computational cost of evaluating f(x),∇f(x),∇2f(x)

C∆ be computational cost of solving (∇2f(x)+λI) ∆ = −∇f(x)

• If Ceval � Cf → proper line search instead of stepsize adapta-

tion

If C∆ � Cf → proper line search instead of stepsize adapta-

tion

• However, in many applications (in robotics at least)Ceval ≈ Cf �C∆

• Often, ∇2f(x) is banded (non-zero around diagonal only)

→ Ax = b becomes super fast using dpbsv (Dynamic Pro-

gramming)

(If∇2f(x) is a “tree”: Dynamic Programming on the “Junction Tree”)

4:9

Demo

4:10

Gauss-Newton method

• Problem:

minxf(x) where f(x) = φ(x)>φ(x)

and we can evaluate φ(x), ∇φ(x) for any x ∈ Rn

• φ(x) ∈ Rd is a vector; each entry contributes a squared cost term tof(x)

• ∇φ(x) is the Jacobian (d× n-matrix)

∇φ(x) =

∂∂x1

φ1(x) ∂∂x2

φ1(x) · · · ∂∂xn

φ1(x)

∂∂x1

φ2(x)...

......

∂∂x1

φd(x) · · · · · · ∂∂xn

φd(x)

∈ Rd×n

with 1st-order Taylor expansion φ(x′) = φ(x) +∇φ(x)(x′ − x)

4:11

Gauss-Newton method

• The gradient and Hessian of f(x) become

f(x) = φ(x)>φ(x)

∇f(x) = 2∇φ(x)>φ(x)

∇2f(x) = 2∇φ(x)>∇φ(x) + 2φ(x)>∇2φ(x)

The Gauss-Newton method is the Newton method for

f(x) = φ(x)>φ(x) with approximating ∇2φ(x) ≈ 0

The approximate Hessian 2∇φ(x)>∇φ(x) is always semi-pos-

def!

• In the Newton algorithm, replace line 3 by

3: compute ∆ to solve (∇φ(x)>∇φ(x) + λI) ∆ = −∇φ(x)>φ(x)

4:12

Quasi-Newton methods

4:13



• Let’s take a step back: Assume we cannot evaluate ∇2f(x).

Can we still use 2nd order methods?

• Yes: We can approximate∇2f(x) from the data {(xi,∇f(xi))}ki=1

of previous iterations

4:14

Basic example

• We’ve seen already two data points (x1,∇f(x1)) and (x2,∇f(x2))

How can we estimate ∇2f(x)?

• In 1D:

∇2f(x) ≈ ∇f(x2)−∇f(x1)

x2 − x1

• In Rn: let y = ∇f(x2)−∇f(x1), ∆x = x2 − x1

∇2f(x) ∆x!= y ∆x

!= ∇2f(x)−1y

∇2f(x) =y y>

y>∆x∇2f(x)−1 =

∆x∆x>

∆x>y

Convince yourself that the last line solves the desired relations[Left: how to update∇2f (x). Right: how to update directly∇2f(x)-1.]

4:15

BFGS

• Broyden-Fletcher-Goldfarb-Shanno (BFGS) method:

Input: initial x ∈ Rn, functions f(x),∇f(x), tolerance θOutput: x

1: initialize H -1 = In2: repeat3: compute ∆ = −H -1∇f(x)

4: perform a line search minα f(x+ α∆)

5: ∆← α∆

6: y ← ∇f(x+ ∆)−∇f(x)

7: x← x+ ∆

8: update H -1 ←(I− y∆>

∆>y

)>H -1(I− y∆>

∆>y

)+ ∆∆>

∆>y9: until ||∆||∞ < θ

• Notes:– The blue term is the H -1-update as on the previous slide– The red term “deletes” previous H -1-components

4:16


• BFGS is the most popular of all Quasi-Newton methods

Others exist, which differ in the exact H -1-update

• L-BFGS (limited memory BFGS) is a version which does not re-

quire to explicitly store H -1 but instead stores the previous data

{(xi,∇f(xi))}ki=1 and manages to compute ∆ = −H -1∇f(x)

directly from this data

• Some thought:

In principle, there are alternative ways to estimate H -1 from the

data {(xi, f(xi),∇f(xi))}ki=1, e.g. using Gaussian Process re-

gression with derivative observations– Not only the derivatives but also the value f(xi) should give infor-

mation on H(x) for non-quadratic functions– Should one weight ‘local’ data stronger than ‘far away’?

(GP covariance function)

4:17

2nd Order Methods for Constrained Optimiza-tion

4:18

2nd Order Methods for Constrained Optimiza-tion

• No changes at all for

– log barrier

– augmented Lagrangian

– squared penalties

Directly use (Gauss-)Newton/BFGS → will boost performance

of these constrained optimization methods!

4:19

Primal-Dual interior-point Newton Method

• Reconsider slide 3:32 (Algorithmic implications of the Lagrangian

view)

• A core outcome of the Lagrangian theory was the shift in prob-

lem formulation:

find x to minx f(x) s.t. g(x) ≤ 0

→ find x to solve the KKT conditions

4:20


• The first and last modified (=approximate) KKT conditions

∇f(x) +∑mi=1 λi∇gi(x) = 0 (“force balance”)

∀i : gi(x) ≤ 0 (primal feasibility)

∀i : λi ≥ 0 (dual feasibility)

∀i : λigi(x) = −µ (complementary)

can be written as the n+m-dimensional equation system

r(x, λ) = 0 , r(x, λ) :=

∇f(x) + λ>∇g(x)−diag(λ)g(x)− µ1n


• Newton method to find the root r(x, λ) = 0

xλ

←xλ

−∇r(x, λ)-1r(x, λ)

∇r(x, λ) =

∇2f(x) +∑i λi∇

2gi(x) ∇g(x)>

−diag(λ)∇g(x) −diag(g(x))

∈ R(n+m)×(n+m)

4:21


• The method requires the Hessians ∇2f(x) and ∇2gi(x)

– One can approximate the constraint Hessians ∇2gi(x) ≈ 0

– Gauss-Newton case: f(x) = φ(x)>φ(x) only requires ∇φ(x)

• This primal-dual method does a joint update of both

– the solution x

– the lagrange multipliers (constraint forces) λ

No need for nested iterations, as with penalty/barrier methods!

• The above formulation allows for a duality gap µ; choose µ = 0

or consult Boyd how to update on the fly (sec 11.7.3)

• The feasibility constraints gi(x) ≤ 0 and λi ≥ 0 need to be

handled explicitly by the root finder (the line search needs to

ensure these constraints)

4:22

Planned Outline




KKT




4:23


5 Convex Optimization

Convex, quasiconvex, unimodal, convex optimization problem, lin-

ear program (LP), standard form, simplex algorithm, LP-relaxation

of integer linear programs, quadratic programming (QP), sequential

quadratic programming

Planned Outline




KKT




5:1

Function types

• A function is defined convex iff

f(ax+ (1−a)x) ≤ a f(x) + (1−a) f(y)

for all x, y ∈ Rn and a ∈ [0, 1].

• A function is quasiconvex iff

f(ax+ (1−a)y) ≤ max{f(x), f(y)}

for any x, y ∈ Rm and a ∈ [0, 1].

..alternatively, iff every sublevel set {x|f(x) ≤ α} is convex.

• [Subjective!] I call a function unimodal iff it has only 1 local

minimum, which is the global minimum

Note: in dimensions n > 1 quasiconvexity is stronger than unimodality

• A general non-linear function is unconstrained and can have

multiple local minima

5:2

convex ⊂ quasiconvex ⊂ unimodal ⊂ general

5:3

Local optimization

• So far I avoided making explicit assumptions about problem con-

vexity: To emphasize that all methods we considered – except

for Newton – are applicable also on non-convex problems.

• The methods we considered are local optimization methods,

which can be defined as

– a method that adapts the solution locally

– a method that is guaranteed to converge to a local minimum

only

• Local methods are efficient

– if the problem is (strictly) unimodal (strictly: no plateaux)

– if time is critical and a local optimum is a sufficiently good

solution

– if the algorithm is restarted very often to hit multiple local op-

tima

5:4

Convex problems

• Convexity is a strong assumption!

• Nevertheless, convex problems are important

– theoretically (convergence proofs!)

– for many real world applications

5:5

Convex problems

• A constrained optimization problem

minx

f(x) s.t. g(x) ≤ 0, h(x) = 0

is called convex iff

– f is convex

– each gi, i = 1, ..,m is convex

– h is linear: h(x) = Ax− b, A ∈ Rl×n, b ∈ Rl

• Alternative definition:f convex and feasible region is a convex set

5:6

Linear and Quadratic Programs

• Linear Program (LP)

minx

c>x s.t. Gx ≤ h, Ax = b

LP in standard form

minx

c>x s.t. x ≥ 0, Ax = b


• Quadratic Program (QP)

minx

1

2x>Qx+ c>x s.t. Gx ≤ h, Ax = b

where Q is positive definite.

(One also defines Quadratically Constraint Quadratic Programs (QCQP))

5:7

Transforming an LP problem into standardform

• LP problem:

minx

c>x s.t. Gx ≤ h, Ax = b

• Define slack variables:

minx,ξ

c>x s.t. Gx+ ξ = h, Ax = b, ξ ≥ 0

• Express x = x+ − x− with x+, x− ≥ 0:

minx+,x−,ξ

c>(x+ − x−)

s.t. G(x+ − x−) + ξ = h, A(x+ − x−) = b, ξ ≥ 0, x+ ≥ 0, x− ≥ 0

where (x+, x−, ξ) ∈ R2n+m

• Now this is conform with the standard form (replacing (x+, x−, ξ) ≡x, etc)

minx

c>x s.t. x ≥ 0, Ax = b

5:8

Linear Programming

– Algorithms

– Application: LP relaxtion of discret problems

5:9

Algorithms for Linear Programming

• All of which we know!

– augmented Lagrangian (LANCELOT software), penalty

– log barrier (“interior point method”, “[central] path following”)

– primal-dual Newton

• The simplex algorithm, walking on the constraints

(The emphasis in the notion of interior point methods is to dis-

tinguish from constraint walking methods.)

• Interior point and simplex methods are comparably efficient

Which is better depends on the problem

5:10

Simplex Algorithm

Georg Dantzig (1947)

Note: Not to confuse with the NelderMead method (downhill simplex method)

• We consider an LP in standard form

minx

c>x s.t. x ≥ 0, Ax = b

• Note that in a linear program the optimum is always situated at

a corner

5:11

Simplex Algorithm

• The Simplex Algorithm walks along the edges of the polytope,

at every corner choosing the edge that decreases c>x most

• This either terminates at a corner, or leads to an unconstrained

edge (−∞ optimum)

• In practise this procedure is done by “pivoting on the simplex

tableaux”

5:12

Simplex Algorithm

• The simplex algorithm is often efficient, but in worst case expo-

nential in n and m.

• Interior point methods (log barrier) and, more recently again,

augmented Lagrangian methods have become somewhat more

popular than the simplex algorithm

5:13

LP-relaxations of discrete problems

5:14


Integer linear programming

• An integer linear program (for simplicity binary) is

minxc>x s.t. Ax = b, xi ∈ {0, 1}

• Examples:– Traveling Salesman: minxij

∑ij cijxij with xij ∈ {0, 1} and sev-

eral more constraints (e.g. rows and columns of x sum to 1)– (max) SAT problem: In conjunctive normal form, each clause con-

tributes an additional variable and a term in the objective function;each clause contributes a constraintGoogle: The Power of Semidefinite Programming Relaxations forMAXSAT

5:15

LP relaxations of integer linear programs

• Instead of solving

minxc>x s.t. Ax = b, xi ∈ {0, 1}

we solve

minxc>x s.t. Ax = b, x ∈ [0, 1]

• Clearly, the relaxed solution will be a lower bound on the integer

solution (sometimes also called “outer bound” because [0, 1] ⊃{0, 1})

• Computing the relaxed solution is interesting

– as an “approximation” or initialization to the integer problem

– to be aware of the lower bound (what is achievable)

– in cases where the optimal relaxed solution happens to be

integer

5:16

Example: MAP inference in MRFs

• Given integer random variables xi, i = 1, .., n, a pairwise Markov

Random Field (MRF) is defined as

f(x) =∑

(ij)∈E

fij(xi, xj) +∑i

fi(xi)

where E denotes the set of edges.(Note: any general (non-pairwise) MRF can be converted into a pair-wise one,blowing up the number of variables)

• Reformulate with different variables

bi(x) = [xi = x] , bij(x, y) = [xi = x] [xj = y]

These are nm+ |E|m2 binary variables

• The indicator variables need to fulfil the constraints

bi(x), bij(x, y) ∈ {0, 1}∑x

bi(x) = 1 because xi takes eactly one value∑y

bij(x, y) = bi(x) consistency between indicators

5:17


• Finding maxx f(x) of a MRF is then equivalent to

maxbi(x),bij(x,y)

∑(ij)∈E

∑x,y

bij(x, y) fij(x, y) +∑i

∑x

bi(x) fi(x)

such that

bi(x), bij(x, y) ∈ {0, 1} ,∑x

bi(x) = 1 ,∑y

bij(x, y) = bi(x)

• The LP-relaxation replaces the constraint to be

bi(x), bij(x, y) ∈ [0, 1] ,∑x

bi(x) = 1 ,∑y

bij(x, y) = bi(x)

This set of feasible b’s is called marginal polytope (because

it describes the a space of “probability distributions” that are

marginally consistent (but not necessarily globally normalized!))

5:18


• Solving the original MAP problem is NP-hard

Solving the LP-relaxation is really efficient

• If the solution of the LP-relaxation turns out to be integer, we’ve

solved the originally NP-hard problem!

If not, the relaxed problem can be discretized to be a good ini-

tialization for discrete optimization

• For binary attractive MRFs (a common case) the solution will

always be integer

5:19

Quadratic Programming

5:20

Quadratic Programming

minx

1

2x>Qx+ c>x s.t. Gx ≤ h, Ax = b

(The dual of a QP is again a QP)

• Efficient Algorithms:

– Interior point (log barrier)

– Augmented Lagrangian

– Penalty

• Highly relevant applications:

– Support Vector Machines

– Similar types of max-margin modelling methods

5:21


Sequential Quadratic Programming

• We considered general non-linear problems

minx

f(x) s.t. g(x) ≤ 0

where we can evaluate f(x), ∇f(x), ∇2f(x) and g(x), ∇g(x),

∇2g(x) for any x ∈ Rn

→ Newton method

• The standard step direction ∆ is (∇2f(x) + λI) ∆ = −∇f(x)

• Sometimes a better step direction ∆ can be found by solving the

local QP-approximation to the problem

min∆

f(x) +∇f(x)>∆ + ∆>∇2f(x)∆ s.t. g(x) +∇g(x)>∆ ≤ 0

This is an optimization problem over ∆ and only requires the

evaluation of f(x),∇f(x),∇2f(x), g(x),∇g(x) once.

5:22


6 Stochastic Search & Heuristics

Blackbox optimization, stochastic search, (µ+λ)-ES, CMA-ES, Evo-

lutionary Algorithms, Simulated Annealing, Hill Climing, Nelder-Mead

downhill simplex

“Blackbox Optimization”

• The term is not really well defined

– I use it to express that only f(x) can be evaluated

– ∇f(x) or ∇2f(x) are not (directly) accessible

More common terms:

• Global optimization– This usually emphasizes that methods should not get stuck in local

optima– Very very interesting domain – close analogies to (active) Machine

Learning, bandits, POMDPs, optimal decision making/planning, op-timal experimental design

– Usually mathematically well founded methods

• Stochastic search or Evolutionary Algorithms or Local Search– Usually these are local methods (extensions trying to be “more”

global)– Various interesting heuristics– Some of them (implicitly or explicitly) locally approximating gradi-

ents or 2nd order models

6:1

Blackbox Optimization

• Problem: Let x ∈ Rn, f : Rn → R, find

minx

f(x)

where we can only evaluate f(x) for any x ∈ Rn

• A constrained version: Let x ∈ Rn, f : Rn → R, g : Rn → {0, 1}, find

minx

f(x) s.t. g(x) = 1

where we can only evaluate f(x) and g(x) for any x ∈ Rn

I haven’t seen much work on this. Would be interesting to consider this morerigorously.

6:2

A zoo of approaches

• People with many different backgrounds drawn into thisRanging from heuristics and Evolutionary Algorithms to heavy mathematics

– Evolutionary Algorithms, esp. Evolution Strategies, Covariance Ma-trix Adaptation, Estimation of Distribution Algorithms

– Simulated Annealing, Hill Climing, Downhill Simplex– local modelling (gradient/Hessian), global modelling

6:3

Optimizing and Learning

• Blackbox optimization is often related to learning:

• When we have local a gradient or Hessian, we can take that

local information and run – no need to keep track of the history

or learn (exception: BFGS)

• In the Blackbox case we have no local information directly ac-

cessible

→ one needs to account of the history in some way or another

to have an idea where to continue search

• “Accounting for the history” very often means learning: Learning

a local or global model of f itself, learning which steps have

been successful recently (gradient estimation), or which step

directions, or other heuristics

6:4

Outline

• Stochastic Search– A simple framework that many heuristics and local modelling ap-

proaches fit in– Evolutionary Algorithms, Covariance Matrix Adaptation, EDAs as

special case

• Heuristics– Simulated Annealing– Hill Climing– Downhill Simplex

• Global Optimization– Framing the big problem: The optimal solution to optimization– Mentioning very briefly No Free Lunch Theorems– Greedy approximations, Kriging-type methods

6:5

Stochastic Search

6:6

Stochastic Search

• The general recipe:

– The algorithm maintains a probability distribution pθ(x)

– In each iteration it takes n samples {xi}ni=1 ∼ pθ(x)

– Each xi is evaluated → data {(xi, f(xi))}ni=1

– That data is used to update θ

• Stochastic Search:

Input: initial parameter θ, function f(x), distribution modelpθ(x), update heuristic h(θ,D)

Output: final θ and best point x1: repeat2: Sample {xi}ni=1 ∼ pθ(x)

3: Evaluate samples, D = {(xi, f(xi))}ni=1

4: Update θ ← h(θ,D)

5: until θ converges

6:7


Stochastic Search

• The parameter θ is the only “knowledge/information” that is be-

ing propagated between iterations

θ encodes what has been learned from the history

θ defines where to search in the future

• Evolutionary Algorithms: θ is a parent population

Evolution Strategies: θ defines a Gaussian with mean & vari-

ance

Estimation of Distribution Algorithms: θ are parameters of

some distribution model, e.g. Bayesian Network

Simulated Annealing: θ is the “current point” and a temperature

6:8

Example: Gaussian search distribution (µ, λ)-ES

From 1960s/70s. Rechenberg/Schwefel

• Perhaps the simplest type of distribution model

θ = (x) , pt(x) = N(x|x, σ2)

a n-dimenstional isotropic Gaussian with fixed deviation σ

• Update heuristic:

– Given D = {(xi, f(xi))}λi=1, select µ best: D′ = bestOfµ(D)

– Compute the new mean x from D′

• This algorithm is called “Evolution Strategy (µ, λ)-ES”

– The Gaussian is meant to represent a “species”

– λ offspring are generated

– the best µ selected

6:9

Example: “elitarian” selection (µ+ λ)-ES

• θ also stores the µ best previous points

θ = (x, D′) , pt(x) = N(x|x, σ2)

• The θ update:

– Select the µ best from D′ ∪D: D′ = bestOfµ(D′ ∪D)


• Is called “elitarian” because good parents can survive

• Consider the (1 + 1)-ES: a Hill Climber

• There is considerable theory on convergence of, e.g., (1+λ)-ES

6:10

Evolutionary Algorithms (EAs)

• These were two simple examples of EAs

Generally, I think EAs can well be described/understood as very

special kinds of parameterizing pθ(x) and updating θ

– The θ typically is a set of good points found so far (parents)

– Mutation & Crossover define pθ(x)

– The samples D are called offspring

– The θ-update is often a selection of the best,

or “fitness-proportional” or rank-based

• Categories of EAs:

– Evolution Strategies: x ∈ Rn, often Gaussian pθ(x)

– Genetic Algorithms: x ∈ {0, 1}n, crossover & mutation de-

fine pθ(x)

– Genetic Programming: x are programs/trees, crossover &

mutation

– Estimation of Distribution Algorithms: θ directly defines

pθ(x)

6:11

Covariance Matrix Adaptation (CMA-ES)

• An obvious critique of the simple Evolution Strategies:

– The search distribution N(x|x, σ2) is isotropic

(no going forward, no preferred direction)

– The variance σ is fixed!

• Covariance Matrix Adaptation Evolution Strategy (CMA-ES)

6:12

Covariance Matrix Adaptation (CMA-ES)

• In Covariance Matrix Adaptation

θ = (x, σ, C, pσ, pC) , pθ(x) = N(x|x, σ2C)

where C is the covariance matrix of the search distribution

• The θ maintains two more pieces of information: pσ and pC cap-

ture the “path” (motion) of the mean x in recent iterations

• Rough outline of the θ-update:

– Let D′ = bestOfµ(D) be the set of selected points



– Update pσ and pC proportional to xk+1 − xk

– Update σ depending on |pσ|

– Update C depending on pcp>c (rank-1-update) and Var(D′)

6:13

CMA references

Hansen, N. (2006), ”The CMA evolution strategy: a comparing

review”

Hansen et al.: Evaluating the CMA Evolution Strategy on Multi-

modal Test Functions, PPSN 2004.

• For “large enough” populations local minima are avoided

• A variant:

Igel et al.: A Computational Efficient Covariance Matrix Update

and a (1 + 1)-CMA for Evolution Strategies, GECCO 2006.

6:14

CMA conclusions

• It is a good starting point for an off-the-shelf blackbox algorithm

• It includes components like estimating the local gradient (pσ, pC ),

the local “Hessian” (Var(D′)), smoothing out local minima (large

populations)

6:15

Estimation of Distribution Algorithms (EDAs)

• Generally, EDAs fit the distribution pθ(x) to model the distribu-

tion of previously good search pointsFor instance, if in all previous distributions, the 3. bit equals the 7. bit, then thesearch distribution pθ(x) should put higher probability on such candidates.pθ(x) is meant to capture the structure in previously good points, i.e. the depen-dencies/correlation between variables.

• A rather successful class of EDAs on discrete spaces uses graph-

ical models to learn the dependencies between variables, e.g.

Bayesian Optimization Algorithm (BOA)

• In continuous domains, CMA is an example for an EDA

6:16

Further Ideas

• We could learn a distribution over steps

– which steps have decreased f recently→ model

(Related to “differential evolution”)

• We could learn a distributions over directions only

→ sample one→ line search

6:17

Stochastic search conclusions

Input: initial parameter θ, function f(x), distribution modelpθ(x), update heuristic h(θ,D)

Output: final θ and best point x1: repeat2: Sample {xi}ni=1 ∼ pθ(x)

3: Evaluate samples, D = {(xi, f(xi))}ni=1

4: Update θ ← h(θ,D)

5: until θ converges

• The framework is very general

• The crucial difference between algorithms is their choice of pθ(x)

6:18

Heuristics

– Simulated Annealing

– Hill Climing

– Simplex

6:19

Simulated Annealing

• Must read!: An Introduction to MCMC for Machine Learning

Input: initial x, function f(x), proposal distribution q(x′|x)

Output: final x1: initialilze T = 1

2: repeat3: generate a new sample x′ ∼ q(x′|x)

4: acceptance probability A =

min{

1,e−f(x

′)/T q(x|x′)e−f(x)/T q(x′|x)

}5: With probability A, x← x′ // ACCEPT6: Decrease T7: until x converges

• Typically: q(x′|x) = N(x′|x, σ2) Gaussian transition probabili-

ties

6:20

Simulated Annealing

• Simulated Annealing is a Markov chain Monte Carlo (MCMC)

method.

• These are iterative methods to sample from a distribution, in our

case

p(x) ∝ e−f(x)/T

• For a fixed temperature T , one can show that the set of accepted

points is distributed as p(x) (but non-i.i.d.!)

• The acceptance probability compares the f(x′) and f(x), but

also the reversibility of q(x′|x)

• When cooling the temperature, samples focus at the extrema


• Guaranteed to sample all extrema eventually

6:21

Simulated Annealing

[Wikipedia gif amination]

6:22

Hill Climing

• Same as Simulated Annealing with T = 0

• Same as (1 + 1)-ES

There also exists a CMA version of (1+1)-ES, see Igel reference above.

• The role of hill climing should not be underestimated:

Very often it is efficient to repeat hill climing from many random

start points.

• However, no type of learning at all (stepsize, direction)

6:23

Nelder-Mead method – Downhill Simplex Method

6:24

Nelder-Mead method – Downhill Simplex Method

• Let x ∈ Rn

• Maintain n+ 1 points x0, .., xn, sorted by f(x0) < ... < f(xn)

• Compute center c of points

• Reflect: y = c+ α(c− xn)

• If f(y) < f(x0): Expand: y = c+ γ(c− xn)

• If f(y) > f(xn-1): Contract: y = c+ γ(c− xn)

6:25


7 Global Optimization

Multi-armed bandits, exploration vs. exploitation, navigation through

belief space, upper confidence bound (UCB), global optimization =

infinite bandits, Gaussian Processes, probability of improvement, ex-

pected improvement, UCB

Global Optimization

• Is there an optimal way to optimize (in the Blackbox case)?

• Is there a way to find the global optimum instead of only local?

7:1

Core references

• Jones, D., M. Schonlau, & W. Welch (1998). Efficient global

optimization of expensive black-box functions. Journal of Global

Optimization 13, 455-492.

• Jones, D. R. (2001). A taxonomy of global optimization methods

based on response surfaces. Journal of Global Optimization 21,

345-383.

• Poland, J. (2004). Explicit local models: Towards optimal opti-

mization algorithms. Technical Report No. IDSIA-09-04.

7:2

More up-to-date – very nice GP-UCB intro-duction

7:3

Outline

• Play a game

• Multi-armed bandits & Upper Confidence Bound (UCB)

• Optimization as infinite bandits; GPs as response surfaces

• Standard criteria:

– Upper Confidence Bound (UCB)– Maximal Probability of Improvement (MPI)– Expected Improvement (EI)

7:4

Multi-armed bandits

• There are n machines.

Each machine has an average reward fi – but you don’t know

the fi’s.

What do you do?

7:5

Multi-armed bandits

• Let at ∈ {1, .., n} be the choice of machine at time t

Let yt ∈ R be outcome with mean 〈yt〉 = fat

• A policy or strategy maps all the history to a new action:

π : [(a1, y1), (a2, y2), ..., (at-1, yt-1)] 7→ at

• Example objectives: find a policy π that

max

⟨T∑t=1

yt

⟩or

max 〈yT 〉

or other variants.

7:6

Exploration vs. Exploitation

• Such kinds of problems appear in many contexts

(Global Optimization, AI, Reinforcement Learning, etc)

• In simple domains (standard MDPs), actions influence the (ex-

ternal) world state→ actions navigate through the state space

In learning domains, actions influence your knowledge→ ac-

tions navigate through state and belief space

In multi-armed bandits, the bandits usually do not have an internal statevariable – they are the same every round.


7:7

Exploration vs. Exploitation

• The “knowledge” can be represented as the full history

ht = [(a1, y1), (a2, y2), ..., (at-1, yt-1)]

or, in the Bayesian thinking, as belief

bt = P (X|ht) =P (ht|X)

P (ht)P (X)

where X is all the (unknown) properties of the world

• In the multi-armed bandit case:

X = (f1, .., fn)

bt = P (X|ht) =∏iN(fi|yi,t, σi,t) (if bandits are Gaussian)

7:8

Navigating through Belief Space

b0

a1

y1

b1

a2

y2

a3

y3

b3

X

b2

– Maximizing for 〈y3〉 requires to have a “good” b2– Actions a1 and a2 should be planned to achieve best possible b2– Action a3 then greedily chooses machine with highest yi,2

• Exploration: Choose the next action at to min 〈H(bt)〉

• Exploitation: Choose the next action at to max 〈yt〉

• Maximizing for 〈yT 〉 (or similar) requires exploration and exploita-

tion

Such policies can in principle be computed→ POMDPs (or Lai & Rob-bins)

But in the following we discuss more efficient 1-step criteria

7:9

Upper Confidence Bound (UCB) selection

1: Initializaiton: Play each machine once2: repeat3: Play the machine i that maximizes yi +

√2 lnnni

4: until

yi is the average reward of machine i so far

ni is how often machine i has been played so far

n =∑i ni is the number of rounds so far

(The lnn makes this work also for non-Gaussian bandits, e.g. heavy-tailed.)

See lane.compbio.cmu.edu/courses/slides_ucb.pdf for a summary ofAuer et al.

7:10

UCB algorithms

• UCB algorithms determine a confidence interval such that

yi − σi < fi < yi + σi

with high probability.

UCB chooses the upper bound of this confidence interval

Strong theory on efficiency of this method in comparision to op-

timal

• UCB methods are also used for planning:

Upper Confidence Bounds for Trees (UCT)

7:11

How exactly is this related to global optimization?

7:12

Global Optimization = infinite bandits

• In global optimization f(x) defines a “reward” for every x ∈ Rn

– Instead of a finite number of actions at we now have xt

• Optimal Optimization could be defined as: find a π that

min

⟨T∑t=1

f(xt)

⟩

or

min 〈f(xT )〉

• In principle we know what an optimal optimization algorithm

would have to do – it is just computationally infeasible (in gen-

eral)

7:13

Gaussian Processes as belief

• Assume we have a history

ht = [(x1, y1), (x2, y2), ..., (xt-1, yt-1)]

• Gaussian Processes are a Machine Learning method that

– provides a mean estimate f(x) (response surface)

– provides a variance estimate σ2(x) ↔ confidence intervals

• Caveat: One needs to make assumptions about the kernel

(e.g., how smooth the function is)

7:14

lane.compbio.cmu.edu/courses/slides_ucb.pdf


1-step criteria based on GPs

• Maximize Probability of Improvement (MPI)

xt = argmaxx

∫ y∗

−∞N(y|f(x), σ(x))

• Maximize Expected Improvement (EI)

xt = argmaxx

∫ y∗

−∞N(y|f(x), σ(x)) (y∗ − y)

• Maximize UCB

xt = argmaxx

f(x) + βtσ(x)

[Often, βt = 1 is chosen. UCB theory allows for better choices. See Srinivas etal.]

7:15

7:16

7:17

7:18

Global Optimization

• Given data, we compute a belief over f(x)

• The belief expresses mean estimate f(x) and confidence σ(x)

– Use Gaussian Processes or other Bayesian ML methods

• Optimal Optimization would imply planning in belief space

• Efficient Global Optimization uses 1-step criteria

– Upper Confidence Bound (UCB)

– Maximal Probability of Improvement (MPI)

– Expected Improvement (EI)

• Global Optimization with gradient information

→ Gaussian Processes with derivative observations

7:19


8 Exercises

8.1 Exercise 1

8.1.1 Boyd & Vandenberghe

Read sections 1.1, 1.3 & 1.4 of Boyd & Vandenberghe“Convex Optimization”. This is for you to get an impres-sion of the book. Learn in particular about their cate-gories of convex and non-linear optimization problems.

8.1.2 First steps

Consider the following functions over x ∈ Rn:

fsq(x) = x>x (2)

fhole(x) = 1− exp(−x>x) (3)

These would be fairly simple to optimize. We changethe conditioning (“skewedness of the Hessian”) of thesefunctions to make them a bit more interesting.

Let c ∈ R be the conditioning parameter; let C be thediagonal matrix with entries C(i, i) = c

i−12(n−1) . We define

the test functions

f csq(x) = fsq(Cx) (4)

f chole(x) = fhole(Cx) (5)

In the following, use c = 100.

a) Implement these functions and display them overx ∈ [−1, 1]2. You can use any language, Octave/Matlab,Python, C++, R, whatever. Plotting is usually done byevaluating the function on a grid of points, e.g. in Oc-tave

[X0,X1] = meshgrid(linspace(-1,1,20),linspace(-1,1,20));X = [X0(:),X1(:)];Y = sum(X.*X, 2);Ygrid = reshape(Y,[20,20]);hold on;mesh(X0,X1,Ygrid);hold off;

Or you can store the grid data in a file and use gnuplot,e.g.

splot [-1:1][-1:1] ’datafile’ matrix us ($1/10-1):($2/10-1):3

b) Implement the fixed stepsize gradient descent methodto find optima for these functions in n = 2 dimensions.

Sample the starting point uniformly, x ∈ U([−3, 3]2),and choose α heuristically. (Ideally, display the opti-mization path in the plot.)

c) Implement the adaptive stepsize method.

8.2 Exercise 2

8.2.1 Equality Constraint Penalties and aug-

mented Lagrangian

(We don’t need to know what the Langangian is (yet) tosolving this exercise.)

In the lecture we discussed the squared penalty methodfor inequality constraints. There is a straight-forwardversion for equality constraints: Instead of

minx

f(x) s.t. h(x) = 0 (6)

we address

minx

f(x) + µ

m∑i=1

hi(x)2 (7)

such that the squared penalty pulls the solution ontothe constraint h(x) = 0. Assume that if we minimize(11) we end up at a solution x1 for which each hi(x1) isreasonable small, but not exactly zero.

We also mentioned the idea that we could add an ad-ditional term which counteracts the violation of the con-straint. This can be realized by minimizing

minx

f(x) + µ

m∑i=1

hi(x)2 +

m∑i=1

λihi(x) (8)

for a “good choice” of each λi. It turns we can infer this“good choice” from the solution x1 of (11):

Proof that setting λi = 2µhi(x1) will, if we assume thatthe gradients ∇f(x) and ∇h(x) are (locally) constant,ensure that the minimum of (8) fulfils exactly the con-straints h(x) = 0.

Tip: Think intuitive. Think about how the gradient thatarises from the penalty in (11) is now generated via theλi.


8.2.2 Squared Panalties & Log Barriers (worth

2 points)

In the last exercise we defined the “hole function” f chole(x),where we now assume a conditioning c = 4.

Consider the optimization problem

minxf chole(x) s.t. g(x) ≤ 0 (9)

g(x) =

x>x− 1

xn + 1/c

(10)

a) First, assume n = 2 (x ∈ R2 is 2-dimensional), c =

4, and draw on paper what the problem looks like andwhere you expect the optimum.

b) Implement the Squared Penalty Method. Choose asa start point x = ( 1

2 ,12 ). Plot its optimization path and

report on the number of total function/gradient evalua-tions needed.

c) Test the scaling of the method for n = 10 dimensions.

d) Implement the Log Barrier Method and test as inb) and c). Compare the function/gradient evaluationsneeded.

8.3 Exercise 3

8.3.1 Lagrangian and dual function

(Taken roughly from ‘Convex Optimization’, Ex. 5.1)

A simple example. Consider the optimization problem

minx2 + 1 s.t. (x− 2)(x− 4) ≤ 0

with variable x ∈ R.

a) Give the feasible set, the optimal solution x∗, and theoptimal value p∗ = f(x∗).

b) Write down the Lagrangian L(x, λ). Plot (using gnu-plot or so) L(x, λ) over x for various values of λ ≥ 0.You verify the lower bound property minx L(x, λ) ≤ p∗,where p∗ is the optimum value of the primal problem.

c) Derive the dual function l(λ) and plot it (for λ ≥ 0).Derive the dual optimal solution λ∗ = argmaxλ l(λ). Ismaxl l(λ) = p∗ (strong duality)?

8.3.2 Phase I & Log Barriers

We again consider a constraint optimization problemvery similar to the last exercise:

minx

n∑i=1

xi s.t. g(x) ≤ 0 (11)

g(x) =

x>x− 1

−x1

(12)

a) In the last exercise you’ve implemented basic con-straint optimization methods (penalty or log barrier). Usethese to find a feasible initialization (Phase I). Do thisby solving the n+ 1-dimensional problem

min(x,s)∈Rn+1

s s.t. ∀i : gi(x) ≤ s, s ≥ 0

Initialize this with the infeasible point (1, 1) ∈ R2.

b) Once you’ve found a feasible point, use the stan-dard log barrier method to find the solution to the orig-inal problem (11). Start with µ = 1, and decrease it byµ← µ/10 in each iteration. In each iteration also reportλi :=

µgi(x) at the solution to the unconstraint problem

minx f(x)− µ∑i log(−gi(x)).

8.4 Exercise 4

8.4.1 Gauss-Newton

In x ∈ R2 consider the function

f(x) = φ(x)>φ(x) , φ(x) =

sin(ax1)

sin(acx2)

2x1

2cx2

The function is plotted above for a = 4 (left) and a = 5

(right, having local minima), and conditioning c = 1.The function is non-convex.

a) Implement the Gauss-Newton algorithm to solve theunconstrained minimization problem minx f(x) for a ran-dom start point in x ∈ [−1, 1]2. Compare the algorithmfor a = 4 and a = 5 and conditioning c = 3 with gradientdescent.

b) Optimize the function also using the fminunc rou-tine from Octave. (Typically this uses BFGS internally.)


8.4.2 Newton method on a constrained prob-

lem

Use the Newton method to solve the same constrainedproblem we considered last time,

minx

n∑i=1

xi s.t. g(x) ≤ 0

g(x) =

x>x− 1

−x1

You are free to choose the squared penalty, log barrier,or augmented Lagrangian method.

8.5 Exercise 5

Solving real-world problems involves 2 subproblems:

1) formulating the problem as an optimization prob-lem (conform to a standard optimization problemcategory) (→ human)

2) the actual optimization problem (→ algorithm)

In the lecture we’ve seen some examples (maxSAT,Travelling Salesman, MRFs) that the first problem isabsolutely non-trivial, especially when trying to formu-late problems as Linear or Quadratic Program. Here issome more training on this. Exercises from Boyd et alhttp://www.stanford.edu/˜boyd/cvxbook/bv_

cvxbook.pdf:

8.5.1 Network flow problem

Solve Exercise 4.12 (pdf page 207) from Boyd & Van-denberghe, Convex Optimization.

8.5.2 Minimum fuel optimal control

Solve Exercise 4.16 (pdf page 208) from Boyd & Van-denberghe, Convex Optimization.

8.5.3 Primal-Dual Newton for Quadratic Pro-

gramming

Derive an explicit equation for the primal-dual Newtonupdate of (x, λ) (slide 04:22) in the case of QuadraticProgramming. Use the special method for solving blockmatrix linear equations using the Schur complements(Wikipedia “Schur complement”).

What is the update for a general Linear Program?

8.6 Exercise 6

8.6.1 CMA vs. your own algo

At https://www.lri.fr/˜hansen/cmaes_inmatlab.html there is code for CMA for all languages (I do notrecommend the C++ versions).

a) Test CMA with a standard parameter setting on therosenbrock function (see Wikipedia). My implementa-tion in C++ is:double rosenbrock(const arr& x) {

double f=0.;for(uint i=1; i<x.N; i++)f += sqr(x(i)-sqr(x(i-1)))

+ .01*sqr(1-x(i-1));return f;

}

where sqr computes the square of a double.

CMA should have no problem in optimizing this function– but as is always samples a whole population of sizeλ, the number of evaluations is rather large.

b) Think of any simple alternative stochastic search method(perhaps including line search, or whatever you comeup with) to beat CMA on this problem.

8.7 Exercise 7

8.7.1 Multi-armed bandits & UCB

Assume there are n = 10 bandits. Each bandit is binary(i.e., yt ∈ {0, 1}) with P (yt=1|at = i) = pi. The agenthas T = 100 rounds to play the machines and aimes tomaximize

∑Tt=1 yt.

http://www.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf

http://www.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf

https://www.lri.fr/~hansen/cmaes_inmatlab.html

https://www.lri.fr/~hansen/cmaes_inmatlab.html


For simplicify, in the following assume that pi = i/10

for i = 1, .., 10. But the agent does not know this, ofcourse.

a) Implement this bandit scenario using a proper (clock)random seed. (Write a method that receives a at andreturns a yt ∈ {0, 1}.) Simulate a random agent thatchooses actions at ∼ U({1, .., 10}) uniformly. Let theagent play 10 games (each with T = 100 rounds). Whatis the random agent’s average reward?

b) Implement a UCB agent. For this, the agent needs tokeep track how often he has played a machine (ni) andhow often this machine returned y = 1 (let’s call this βi)or y = 0 (let’s call this αi). What is the agent’s averagereward? (Averaged over 10 games, as above.)

c) (Bonus.) Assume the agent knows that the banditsare binary. He can exploit this knowledge: His beliefcan be

bt = P ((p1, .., pn)|ht) =∏i

Beta(pi|αi, βi)

where Beta is the so-called Beta-distribution over theBernoulli parameter pi ∈ [0, 1]. At Wikipedia you canfind information on the mean and variance (and alsothe cumulative distribution function, called regularizedincomplete beta function) of a Beta distribution. Howexactly could an agent use this to perhaps become bet-ter than the agent in b)?

8.7.2 Global optimization on the Rosenbrock

function

On the webpage you’ll find octave code for GP regres-sion from Carl Rasmussen (gp01pred.m). The test.mdemonstrates how to use it.

Use this code to implement a global optimization methodfor 2D problems. Test the method

a) on the 2D Rosenbrock function defined in exercisee06, and

b) on the Rastrigin function as defined in exercise e04with a = 6.

Note that in test.m I’ve chosen hyperparameters thatcorrespond to assuming: smoothness is given by a ker-nel width

√1/10; initial value uncertainty (range) is given

by√10. How does the performance of the method

change with these hyperparameters?

8.7.3 Constrained global optimization?

On slide 6:2 it is speculated that one could considera constrained blackbox optimization problem as well.How could one approach this in the UCB manner?


9 Topic list

This list summarizes the lecture’s content and is in-tended as a guide for preparation for the exam. (Go-ing through all exercises is equally important!) Refer-ences to the lectures slides are given in the format (lec-ture:slide).

9.1 Optimization Problems in General

• Types of optimization problems (1:6)

– General constrained optimization problem defi-nition

– Blackbox, gradient-based, 2nd order

– Understand the differences

– “Upgrades”, e.g. quasi-Newton, global optimiza-tion

• Hardly coherent texts that cover all three

– constrained & convex optimization

– stochastic search

– global optimization

• In the lecture we usually only consider inequalityconstraints (for simplicity of presentation)

– Understand in all cases how also equality con-straints could be handled

9.2 Gradient-based Methods

• Plain gradient descent

– Understand the stepsize problem (2:5)

– Stepsize adaptation & monotonicity (2:7)

– Backtracking line search (2:21)

• Steepest descent

– Is the gradient the steepest direction? (2:10,11)

– Covariance =invariance under linear transfor-mations) of the steepest descent direction (2:12)

• Conjugate gradient (2:15)

– New direction d′ should be “orthogonal” to theprevious d, but relative to the local quadratic shape,d′>Ad = 0 (= d′ and d are conjugate)

– On quadratic functions CG converges in n iter-ations (2:16)

• Rprop (2:19)

– Seems awfully hacky

– Every coordinate is treated separately. No in-variance under rotations/transformations. (2:20)

– Change in gradient sign → reduce stepsize;else increase

– Works surprisingly well and robust in practice

• Evaluating optimization costs

– Be aware in differences in convention. Some-times “1 iteration”=many function evaluations (linesearch)

– Best: always report on # function evaluations

9.3 Constrained Optimization

• Overview

– General problem definition (3:1)

– Convert to series of unconstrained problems:penalty and barrier methods

– Convert to larger unconstrained problem: primal-dual Newton method

– Convert to other constrained problem: dual prob-lem

• Log barrier method

– Definition (3:6)

– Understand how the barrier gets steeper withµ→ 0 (not µ→∞!)

– Iterativly decreasing µ generates the centralpath (3:8)

– The gradient of the log barrier generates a La-grange term with λi = − µ

gi(x) !

→ Each iteration solves the modified (approxi-mate) KKT condition

• Squared penalty method


– Motivates the Augmented Lagrangian (3:13)

• Augmented Lagrangian

– Definition (equality 3:15, inequality 3:18)

– Role of the squared penalty: “measure” howstrong f pushes into the constraint

– Role of the Lagrangian term: generate counterforce

– Unstand that the λ update generates the “de-sired force” (3:16,17)


• The Lagrangian


– Using the L to solve constrained problems onpaper (set both,∇xL(x, λ) = 0 and∇λL(x, λ) = 0

(3:23)

– Force balance and the first KKT condition (3:24)

– Understand in detail the full KKT conditions (3:25)

– Optima are necessarily saddle points of the L(3:27)

– minx L↔ first KKT↔ force balance (3:28)

– maxλ L↔ complementarity KKT↔ constraints(3:28)

• Lagrange dual problem

– primal problem: minxmaxl≥0 L(x, λ)

– dual problem: maxλ≥0 minx L(x, λ)

– Definition of Lagrange dual (3:29)

– Lower bound and strong duality (3:30)

• Primal-dual Newton method to solve KKT condi-tions (3:32)

– Definition description (4:21,22)

• Phase I optimization

– Nice trick to find feasible initialization (3:37)

9.4 Second-Order Methods

• General

– Problem definition (4:5)

– 2nd order information can improve direction &stepsize (4:3)

– Hessian needs to be pos-def (↔ f(x) is con-vex) or modified/approximated as pos-def (Gauss-Newton, damping)

– High relevance within the constrained optimiza-tion iterations (4:19)

• Newton


– Adaptive stepsize vs. damping (4:7,8)

• Gauss-Newton

– f(x) is a sum of squared cost terms (4:11)

– The approx. Hessian 2∇φ(x)>∇φ(x) is alwayssemi-pos-def! (4:12)

• Quasi-Newton

– Accumulate gradient information to approximatea Hessian (4:14,15)

– BFGS (4:16)

9.5 Convex Optimization

• Definitions

– Convex, quasiconvex, unimodal functions (5:2,3)

– Convex optimization problem (5:6)

• Linear Programming

– General and standard form definition (5:7)

– Converting into standard form (5:8)

– LPs are efficiently solved using 2nd-order logbarrier, augmented Lagrangian or dual-primal meth-ods

– Simplex Algorithm is classical alternative; walkson the constraint edges instead of the interior(5:12)

– Very important application of LPs: LP-relaxationsof integer linear programs (5:16)

• Quadratic Programming


– QPs are efficiently solved using 2nd-order logbarrier, augmented Lagrangian or dual-primal meth-ods

– Sequential QP solves general (non-quadratic)problems by defining a local QP for the step di-rection followed by a line search in that direction(5:22)

9.6 Stochastic Search & Heuristics

• Generic Stochastic Search Recipe


– Understand the crucial role of θ (6:8)

– In Optimal Optimization, θ is the belief (7:8,9)

– In EAs, θ may include populations (6:11)

– Categories of EAs: ES, GA, GP, EDA (6:11)

– In ESs, θ are parameters of a Gaussian (6:9,10,13)

– In Simulated Annealing, θ is the current point(6:20)

– In Simples, θ is the n+ 1 points (6:24)


• with Gaussian distributions

– (µ, λ)-ES and (µ + λ)-ES: sampling, selection,update (6:9,10)

– CMA: adapting C and σ based on the path ofthe mean (6:13,14)

• Simulated Annealing

– Accept probability: ratio of exp-value correctedby transition reversability (6:20,21)

– Typically Gaussian transition probabilities

– Samples from p(x) ∝ e−f(x)/T (6:21,22)

– Cooling scheme decreases temperature

• Hill Climbing

– Same as Simulated Annealing for T = 0

– Same as (1 + 1)-ES

• Nelder-Mead Downhill Simplex

– Reflect, expand, contract (6:25)

9.7 Global Optimization

• Multi-armed bandit framework

– Problem definition (7:6)

– Understand the concepts of exploration, exploita-tion & belief (7:7,8,9)

– Optimal Optimization would imply to plan (ex-actly) through belief space (7:9)

– Upper Confidence Bound (UCB) and confidenceinterval (7:10,11)

– UCB is optimistic

• Global optimization

– Global optimization = infinite bandits (7:13)

– Locally correlated bandits→Gaussian Processbeliefs (7:14)

– Maximum Probability of Improvement (7:15)

– Expected Improvement (7:15)

introduction to optimization - tu berlinintroduction to optimization marc toussaint july 11, 2013...

Documents