introduction to global optimization - eeci · introduction to global optimization – p. complexity...

Introduction to Global OptimizationFabio Schoen

2008

http://gol.dsi.unifi.it/users/schoen

Introduction to Global Optimization – p.

http://gol.dsi.unifi.it/users/schoen

Global Optimization Problems

minx∈S⊆Rn

f(x)

What is it meant by global optimization? Of course we sould liketo find

f ∗ = minx∈S⊆Rn

f(x)

andx∗ = arg min f(x) : f(x∗) ≤ f(x) ∀ x ∈ S


This definition in unsatisfactory:

the problem is “ill posed” in x (two objective functions whichdiffer only slightly might have global optima which arearbitrarily far)

it is however well posed in the optimal values: ||f − g|| ≤ δ⇒|f ∗ − g∗| ≤ ε


Quite often we are satisfied in looking for f ∗ and search one ormore feasible solutions suche that

f(x) ≤ f(x∗) + ε

Frequently, however, this is too ambitious a task!


Research in Global Optimization

the problem is highly relevant, especially in applications

the problem is very hard (perhaps too much) to solve

there are plenty of publications on global optimizationalgorithms for specific problem classes

there are only relatively few papers with relevant theoreticalcontents

often from elegant theories, weak algorithms have beenproduced and viceversa, the best computational methodsoften lack a sound theoretical support


many global optimization papers get published on appliedresearch journals

Bazaraa, Sherali, Shetty “Nonlinear Programming: theoryand algorithms”, 1993:the word “global optimum” appears for the first time on page99, the second time at page 132, then at page 247:“A desirable property of an algorithm for solving [anoptimization] problem is that it generates a sequence ofpoints converging to a global optimal solution. In manycases however we may have to be satisfied with lessfavorable outcomes.”after this (in 638 pages) it never appears anymore. “Globaloptimization” is never cited.


Similar situation in Bertsekas, Nonlinear Programming (1999):777 pages, but only the definition of global minima and maximais given!Nocedal & Wrigth, “Numerical Optimization”, 2nd edition, 2006:Global solutions are needed in some applications, but for manyproblems they are difficult to recognize and even more difficultto locate. . .many successful global optimization algorithms require thesolution of many local optimization problems, to which thealgorithms described in this book can be applied


Complexity

Global optimization is “hopeless”: without “global” informationno algorithm will find a certifiable global optimum unless itgenerates a dense sample.There exists a rigorous definition of “global” information – someexamples:

number of local optima

global optimum value

for global optimization problems over a box, (an upperbound on) the Lipschitz constant

|f(y) − f(x)| ≤ L‖x− y‖ ∀ x, y

Concavity of the objective function + convexity of thefeasible region

an explicit representation of the objective function as thedifference between two convex functions (+ convexity of thefeasible set)


Complexity

Global optimization is computationally intractable alsoaccording to classical complexity theory. Special cases:Quadratic programming:

minl≤Ax≤u

1

2xTQx+ cTx

is NP–hard [Sahni, 1974] and, when considered as a decisionproblem, NP -complete [Vavasis, 1990].


Many special cases are still NP–hard:

norm maximization on a parallelotope:

max ‖x‖b ≤ Ax ≤ c

Quadratic optimization on a hyper-rectangle (A = I) wheneven only one eigenvalue of Q is negative

quadratic minimization over a simplex

minx≥0

1

2xTQx+ cTx

∑

j

xj = 1

Even checking that a point is a local optimum is NP -hardIntroduction to Global Optimization – p. 10

Applications of global optimization

concave minimization – quantity discounts, scaleeconomies

fixed charge

combinatorial optimization - binary linear programming:

min cTx+KxT (1 − x)

Ax = b

x ∈ [0, 1]

or:

min cTx

Ax = b

x ∈ [0, 1]

xT (1 − x) = 0Introduction to Global Optimization – p. 11

Minimization of cost functions which are neither convex norconcave. E.g.: finding the minimum conformation ofcomplex molecules – Lennard-Jones micro-cluster, proteinfolding, protein-ligand docking,Example: Lennard-Jones: pair potential due to two atoms atX1, X2 ∈ R

3:

v(r) =1

r12− 2

r6

where r = ‖X1 −X2‖. The total energy of a cluster of Natoms located at X1, . . . , XN ∈ R

3 is defined as:∑

i=1,...,N

∑

j<i

v(||Xi −Xj||)

This function has a number of local (non global) minimawhich grows like exp(N)

Introduction to Global Optimization – p. 12

Lennard-Jones potential

-3

-2

-1

0

1

2

3

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

attractive(x)repulsive(x)

lennard-jones(x)


Protein folding and docking

Potential energy model:E = El + Ea + Ed + Ev + Ee where:

El =∑

i∈L

1

2Kb

i (ri − r0i )

2

(contribution of pairs of bonded atoms):

Ea =∑

i∈A

1

2Kθ

i (θi − θ0i )

2

(angle between 3 bonded atoms)

Ed =∑

i∈T

1

2Kφ

i [1 + cos(nφi − γ)]

(dihedrals)Introduction to Global Optimization – p. 14

Ev =∑

(i,j)

∑

∈C

(

Aij

R12ij

− Bij

R6ij

)

(van der Waals)

Ee =1

2

∑

(i,j)

∑

∈C

qiqjεRij

(Coulomb interaction)


Docking

Given two macro-molecules M1,M2, find their minimal energycouplingIf no bonds are changed ⇒to find the optimal docking it issufficient to minimized:

Ev + Ee =∑

i∈M1,j∈M2

(

Aij

R12ij

− Bij

R6ij

)

+1

2

∑

i∈M1,j∈M2

qiqjεRij


Main algorithmic strategies

Two main families:

1. with global information (“structured problems”)

2. without global information (“unstructured problems”)

Structured problems ⇒stochastic and deterministic methodsUnstructured problems ⇒typically stochastic algorithmsEvery global optimization method should try to find a balancebetween

exploration of the feasible region

approximations of the optimum


Example: Lennard Jones

LJN = minLJ(X) = minN−1∑

i=1

N∑

j=i+1

1

‖Xi −Xj‖12− 2

‖Xi −Xj‖6

This is a highly structured problem. But is it easy/convenient touse its structure?And how?


LJ

The map

F1 : R3N 7→ R

N(N−1)/2+

F1(X1, . . . , XN ) 7→

‖X1 −X2‖2, . . . , ‖XN−1 −XN‖2

is convex and the function

F2 : RN(N−1)/2+ 7→ R

F2(r12, . . . , rN−1,N ) 7→∑ 1

r6ij

− 2∑ 1

r3ij

is the difference between two convex functions. Thus LJ(X)can be seen as the difference between two convex function (ad.c. programming problem)


NB: every C2 function is d.c., but often its d.c. decomposition isnot known.D.C. optimization is very elegant, there exists a nice dualitytheory, but algorithms are typically very inefficient.


A primal method for d.c. optimization

“cutting plane” method (just an example, not particularlyefficient, useless for high dimensional problems).Any unconstrained d.c. problem can be represented as anequivalent problem with linear objective, a convex constraintand a reverse convex constraint. If g, h ar convex, thenmin g(x) − h(x) is equivalent to:

min z

g(x) − h(x) ≤ z

which is equivalent to

min z

g(x) ≤ w

h(x) + z ≥ w


D.C. canonical form

min cTx

g(x) ≤ 0

h(x) ≥ 0

where h, g: convex. Let

Ω = x : g(x) ≤ 0C = x : h(x)≤0

Hp:0 ∈ intΩ ∩ intC, cTx > 0∀x ∈ Ω \ intC

Fundamental property: if a D.C. problem admits an optimum, atleast one optimum belongs to

∂Ω ∩ ∂C Introduction to Global Optimization – p. 22

Discussion of the assumptions

g(0) < 0, h(0) < 0, cTx > 0∀ feasible x. Let x be a solution to theconvex problem

min cTx g(x) ≤ 0

If h(x) ≥ 0 then x solves the d.c. problem. Otherwise cTx > cT xfor all feasible x. Coordinate transformation: y = x− x:

min cTy

g(y) ≤ 0

h(y) ≥ 0

where g(y) = g(y + x). Then cTy > 0 for all feasible solutionsand h(0) > 0; by continuity it is possible to choose x so thatg(0) < 0.


-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2-4

-3

-2

-1

0

1

2

3

4

Ω

C

0

cTx = 0


Let x best known solution.Let

D(x) = x ∈ Ω : cTx ≤ cT xIf D(x) ⊆ C then x is optimal;Check: a polytope P (with known vertices) is built whichcontains D(x)If all vertices of P are in C ⇒optimal solution. Otherwise let v:best feasible vertex;the intersection of the segment [0, v] with ∂C (if feasible) is animproving point x. Otherwise a cut is introduced in P which istangent to Ω in x.


-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2-4

-3

-2

-1

0

1

2

3

4

Ω

C

cTx = 0

x

D(x) = x ∈ Ω : cTx ≤ cT x


Initialization

Given a feasible solution x, take a polytope P such that

P ⊇ D(x)

i.e.

y : cTy ≤ cT x

y feasible

⇒y ∈ P

If P ⊂ C, i.e. if y ∈ P ⇒h(y) ≤ 0 then x is optimal.Checking is easy if we know the vertices of P .


-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2-4

-3

-2

-1

0

1

2

3

4

Ω

C

cTx = 0

x

P : D(x) ⊆ P with vertices V1, . . . , Vk. V ⋆ := arg maxh(Vj)

V ⋆


Step 1

Let V ⋆ the vertex with largest h() value. Surely h(V ⋆) > 0(otherwise we stop with an optimal solution)Moreover: h(0) < 0 (0 is in the interior of C). Thus the line fromV ⋆ to 0 must intersect the boundary of CLet xk be the intersection point. It might be feasible(⇒improving) or not.


-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2-4

-3

-2

-1

0

1

2

3

4

Ω

C

cTx = 0

x

xk = ∂C ∩ [V ⋆, 0]

V ⋆

xk


-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2-4

-3

-2

-1

0

1

2

3

4

Ω

C

cTx = 0

If xk ∈ Ω, set x := xk

x


-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2-4

-3

-2

-1

0

1

2

3

4

Ω

C

cTx = 0

Otherwise if xk 6∈ Ω, the polytope is divided


Duality for d.c. problems

minx∈S

g(x) − h(x)

where f, g: convex. Let

h⋆(u) := supuTx− h(x) : x ∈ Rn

g⋆(u) := supuTx− g(x) : x ∈ Rn

the conjugate functions of h e g. The problem

infh⋆(u) − g⋆(u) : u : h⋆(u) < +∞

is the Fenchel-Rockafellar dual. If min g(x) − h(x) admits anoptimum, then Fenchel dual is a strong dual.


If x⋆ ∈ arg min g(x) − h(x) then

u⋆ ∈ ∂h(x⋆)

(∂ denotes subdifferential) is dual optimal and ifu⋆ ∈ arg minh⋆(u) − g⋆(u) then

x⋆ ∈ ∂g⋆(u⋆)

is an optimal primal solution.


A primal/dual algorithm

Pk : min g(x) − (h(xk) + (x− xk)Tyk)

andDk : minh⋆(y) − (g⋆(yk−1) + xT

k (y − yk−1)


Exact Global Optimization


GlobOpt - relaxations

Consider the global optimization problem (P):

min f(x)

x ∈ X

and assume the min exists and is finite and that we can use arelaxation (R):

min g(y)

y ∈ Y

Usually both X and Y are subsets of the same space Rn.

Recall: (R) is a relaxation of (P) iff:

X ⊆ Y

g(x) ≤ f(x) for all x ∈ XIntroduction to Global Optimization – p. 37

Branch and Bound

1. Solve the relaxation (R) and let L be the (global) optimumvalue (assume it is feasible for (R))

2. (Heuristically) solve the original problem (P) (or, moregenerally, find a “good” feasible solution to (P) in X). Let Ube the best feasible function value known

3. if U − L ≤ ε then stop: U is a certified ε–optimum for (P)

4. otherwise split X and Y into two parts and apply to each ofthem the same method


Tools

“good relaxations”: easy yet accurate

good upper bounding, i.e., good heuristics for (P)

Good relaxations can be obtained, e.g., through:

convex relaxations

domain reduction


Convex relaxations

Assume X is convex and Y = X. If g is the convex envelop of fon X, then solving the convex relaxation (R), in one step givesthe certified global optimum for (P).g(x) is a convex under-estimator of f on X if:

g(x)is convex

g(x) ≤ f(x) ∀x ∈ X

g is the convex envelop of f on X if:

gis a convex under-estimator off

g(x) ≥ h(x) ∀x ∈ X

∀h : convex under-estimator of f


A 1-D example


Convex under-estimator


Branching


Bounding

fathomed

Upper bound

lower boundsIntroduction to Global Optimization – p. 44

Relaxation of the feasible domain

Let

minx∈S

f(x)

be a GlobOpt problem where f is convex, while S is non convex.A relaxation (outer approximation) is obtained replacing S with alarger set Q. If Q is convex ⇒convex optimization problem.If the optimal solution to

minx∈Q

f(x)

belongs to S ⇒optimal solution to the original problem.


Example

minx∈[0,5],y∈[0,3]

−x− 2y

xy ≤ 3

0 1 2 3 4 5 60

1

2

3

4


Relaxation

minx∈[0,5],y∈[0,3]

−x− 2y

xy ≤ 3

We know that:

(x+ y)2 = x2 + y2 + 2xy

thus

xy = ((x+ y)2 − x2 − y2)/2

and, as x and y are non-negative, x2 ≤ 5x, y2 ≤ 3y, thus a(convex) relaxation of xy ≤ 3 is

(x+ y)2 − 5x− 3y ≤ 6

(a convex constraint)


Relaxation

0 1 2 3 4 5 60

1

2

3

4

Optimal solution of the relaxed convex problem: (2, 3) (value:−8)


Stronger Relaxation

minx∈[0,5],y∈[0,3]

−x− 2y

xy ≤ 3

Thus:

(5 − x)(3 − y) ≥ 0 ⇒15 − 3x− 5y + xy ≥ 0 ⇒

xy ≥ 3x+ 5y − 15

Thus a (convex) relaxation of xy ≤ 3 is

3x+ 5y − 15 ≤ 3

i.e.: 3x+ 5y ≤ 18Introduction to Global Optimization – p. 49

Relaxation

0 1 2 3 4 5 60

1

2

3

4

The optimal solution of the convex (linear) relaxation is (1, 3)which is feasible ⇒optimal for the original problem


Convex (concave) envelopes

How to build convex envelopes of a function or how to relax anon convex constraint?Convex envelopes ⇒lower boundsConvex envelopes of −f(x) ⇒upper boundsConstraint: g(x) ≤ 0 ⇒if h(x) is a convex underestimator of gthen h(x) ≤ 0 is a convex relaxations.Constraint: g(x) ≥ 0 ⇒if h(x) is concave and h(x) ≥ g(x), thenh(x) ≥ 0 is a “convex” constraint


Convex envelopes

Definition: a function is polyhedral if it is the pointwise maximumof a finite number of linear functions.(NB: in general, the convex envelope is the pointwisesupremum of affine minorants)The generating set X of a function f over a convex set P is theset

X = x ∈ Rn : (x, f(x))is a vertex of epi(convP (f))

I.e., given f we first build its convex envelop in P and thendefine its epigraph (x, y) : x ∈ P, y ≥ f(x). This is a convexset whose extreme points can be denoted by V . X are the xcoordinates of V


Generating sets

* *

*

*


bbb


Characterization

Let f(x) be continuously differentiable in a polytope P . Theconvex envelope of f on P is polyhedral if and only if

X(f) = Vert(P )

(the generating set is the vertex set of P )Corollary: let f1, . . . , fm ∈ C1(P ) and

∑

i fi(x) possesspolyhedral convex envelopes on P . Then

Conv(∑

i

fi(x)) =∑

i

Convfi(x)

iff the generating set of∑

i Conv(fi(x)) is Vert(P )


Characterization

If a f(x) is such that Convf(x) is polyhedral, than an affinefunction h(x) such that

1. h(x) ≤ f(x) for all x ∈ Vert(P )

2. there exist n+ 1 affinely independent vertices of P ,V1, . . . , Vn+1 such that

f(Vi) = h(Vi) i = 1, . . . , n + 1

belongs to the polyhedral description of Convf(x) and

h(x) = convf(x)

for any x ∈ Conv(V1, . . . , Vn+1).


Characterization

The condition may be reversed: given m affine functionsh1, . . . , hm such that, for each of them

1. hj(x) ≤ f(x) for all x ∈ Vert(P )

2. there exist n+ 1 affinely independent vertices of P ,V1, . . . , Vn+1 such that

f(Vi) = hj(Vi) i = 1, . . . , n+ 1

Then the function ψ(x) = maxj φj(x) is the convex envelope of apolyhedral function f iff

the generating set of ψ is Vert(P)

for every vertex Vi we have ψ(Vi) = f(Vi)


Sufficient condition

If f(x) is lower semi-continuous in P and for all x 6∈ Vert(P ) thereexists a line ℓx: x ∈ interior of P ∩ ℓx and f(x) is concave in aneighborhood of x on ℓx,then Convf(x) is polyhedralApplication: let

f(x) =∑

i,j

αijxixj

The sufficient condition holds for f in [0, 1]n ⇒bilinear forms arepolyhedral in an hypercube


Application: a bilinear term

(Al-Khayyal, Falk (1983)): let x ∈ [ℓx, ux], y ∈ [ℓy, uy]. Then theconvex envelope of xy in [ℓx, ux] × [ℓy, uy is

φ(x, y) = maxℓyx+ ℓxy − ℓxℓy;uyx+ uxy − uxuy

In fact: φ(x, y) is a under-estimate of xy:

(x− ℓx)(y − ℓy) ≥ 0

xy ≥ ℓyx+ ℓxy − ℓxℓy

and analogously for xy ≥ uyx+ uxy − uxuy


Bilinear terms

xy ≥ φ(x, y) = maxℓyx+ ℓxy − ℓxℓy;uyx+ uxy − uxuyNo other (polyhedral) function underestimating xy is tighter.In fact ℓyx+ ℓxy − ℓxℓy belongs to the convex envelope: itunderestimates xy and coincides with xy at 3 vertices((ℓx, ℓy), (ℓx, uy), (ux, ℓy)).Analogously for the other affine function.All vertices are interpolated by these 2 underestimatinghyperplanes ⇒they form the convex envelop of xy


All easy then?

Of course no!Many things can go wrong . . .

It is true that, on the hypercube, a bilinear form:∑

i<j

αijxixj

is polyhedral (easy to see) but we cannot guarantee ingeneral that the generating set of the envelope are thevertices of the hypercube! (in particular, if α’s have oppositesigns)

if the set is not an hypercube, even a bilinear term might benon polyhedral: e.g. xy on the triangle 0 ≤ x ≤ y ≤ 1

Finding the (polyhedral) convex envelope of a bilinear form on ageneric polytope P is NP–hard!


Fractional terms

A convex underestimate of a fractional term x/y over a box canbe obtained through

w ≥ ℓx/y + x/uy − ℓx/uy if ℓx ≥ 0

w ≥ x/uy − ℓxy/ℓyuy + ℓx/ℓy if ℓx < 0

w ≥ ux/y + x/ℓy − ux/ℓy if ℓx ≥ 0

w ≥ x/ℓy − uxy/ℓyuy + ux/uy if ℓx < 0

(a better underestimate exists)


Univariate concave terms

If f(x), x ∈ [ℓx, ux], is concave, then the convex envelope issimply its linear interpolation at the extremes of the interval:

f(ℓx) +f(ux) − f(ℓx)

ux − ℓx(x− ℓx)


Underestimating a general nonconvex function

Let f(x) ∈ C2 be general non convex. Than a convexunderestimate on a box can be defined as

φ(x) = f(x) −n∑

i=1

αi(xi − ℓi)(ui − xi)

where αi > 0 are parameters. The Hessian of φ is

∇2φ(x) = ∇2f(x) + 2diag(α)

φ is convex iff ∇2φ(x) is positive semi-definite.


How to choose αi’s? One possibility: uniform choice: αi = α. Inthis case convexity of φ is obtained iff

α ≥ max

0,−1

2min

x∈[ℓ,u]λmin(x)

where λmin(x) is the minimum eigenvalue of ∇2f(x)


Key properties

φ(x) ≤ f(x)

φ interpolates f at all vertices of [ℓ, u]

φ is convex

Maximum separation:

max(f(x) − φ(x)) =1

4α∑

i

(ui − ℓi)2

Thus the error in underestimation decreases when the boxis split.


Estimation of α

Compute an interval Hessian [H] : [H(x)]ij = [hLij(x), h

Uij(x)] in

[ℓ, u]Find α such that [H] + 2diag(α) < 0.Gerschgorin theorem for real matrices:

λmin ≥ mini

hii −∑

j 6=i

|hij|

Extension to interval matrices:

λmin ≥ mini

hLii −

∑

j 6=i

max|hLij |, |hU

ij |uj − ℓjui − ℓi


Improvements

new relaxation functions (other than quadratic). Example

Φ(x; γ) = −n∑

i=1

(1 − eγi(xi−ℓi))(1 − eγi(ui−xi))

gives a tighter underestimate than the quadratic function

partitioning: partition the domain into a small number ofregions (hyper-rectangules); evaluate a convexunderestimator in each region; join the underestimators toform a single convex function in the whole domain


Domain (range) reduction

Techniques for cutting the feasible region without cutting theglobal optimum solution.Simplest approaches: feasibility-based and optimality-basedrange reduction (RR).Let the problem be:

minx∈S

f(x)

Feasibility based RR asks for solving

ℓi = min xi ui = maxxi

x ∈ S x ∈ S

for all i ∈ 1, . . . , n and then adding the constraints x ∈ [ℓ, u] tothe problem (or to the sub-problems generated during Branch &Bound)


Feasibility Based RR

If S is a polyhedron, RR requires the solution of LP’s:

[ℓ, u] = min /maxx

Ax ≤ b

x ∈ [L,U ]

“Poor man’s” L.P. based RR: from every constraint∑

j aijxj ≤ biin which ai > 0 then

x ≤1

ai

(

bi −∑

j 6=

aijxj

)

⇒

x ≤1

ai

(

bi −∑

j 6=

minaijLj, aijUj)


Optimality Based RR

Given an incumbent solution x ∈ S, ranges are updated bysolving the sequence:

ℓi = min xi ui = maxxi

f(x) ≤ f(x) f(x) ≤ f(x)

x ∈ S x ∈ S

where f(x) is a convex underestimate of f in the currentdomain.RR can be applied iteratively (i.e., at the end of a complete RRsequence, we might start a new one using the new bounds)


generalization

minx∈X

f(x) (P )

g(x) ≤ 0

a (non convex) problem; let

minx∈X

f(x) (R)

g(x) ≤ 0

be a convex relaxation of (P ):

x ∈ X : g(x) ≤ 0 ⊆ x ∈ X : g(x) ≤ 0 and

x ∈ X : g(x) ≤ 0⇒f(x) ≤ f(x)


R.H.S. perturbation

Let

φ(y) = minx∈X

f(x) (Ry)

g(x) ≤ y

be a perturbation of (R). (R) convex ⇒(Ry) convex for any y.Let x: an optimal solution of (R) and assume that the i–thconstraint is active:

g(x) = 0

Then, if xy is an optimal solution of (Ry) ⇒gi(x) ≤ yi is active at

xy if yi ≤ 0


Duality

Assume (R) has a finite optimum at x with value φ(0) andLagrange multipliers µ. Then the hyperplane

H(y) = φ(0) − µTy

is a supporting hyperplane of the graph of φ(y) at y = 0, i.e.

φ(y) ≥ φ(0) − µTy ∀ y ∈ Rm


Main result

If (R) is convex with optimum value φ(0), constraint i is active atthe optimum and the Lagrange multiplier is µi > 0 then, if U isan upper bound for the original problem (P ) the constraint:

gi(x) ≥ −(U − L)/µi

(where L = φ(0)) is valid for the original problem (P ), i.e. it doesnot exclude any feasible solution with value better than U .


proof

Problem (Ry) can be seen as a convex relaxation of theperturbed non convex problem

Φ(y) = minx∈X

f(x)

g(x) ≤ y

and thus φ(y) ≤ Φ(y). Thus underestimating (Ry) produces anunderestimate of Φ(y). Let y := eiyi; From duality:L− µT eiyi ≤ φ(eiyi) ≤ Φ(eiyi)If yi < 0 then U is an upper bound also for Φ(eiyi), thusL− µiyi ≤ U . But if yi < 0 then constraint i is active. For anyfeasible x there exists a yi < 0 such that g(x) ≤ yi is active ⇒wemay substitute yi with g

i(x) and deduce L− µigi

(x) ≤ U


Applications

Range reduction: let x ∈ [ℓ, u] in the convex relaxed problem. Ifvariable xi is at its upper bound in the optimal solution, them wecan deduce

xi ≥ maxℓi, ui − (U − L)/λi

where λi is the optimal multiplier associated to the i–th upperbound. Analogously for active lower bounds:

xi ≤ minui, ℓi + (U − L)/λi


Let the constraint

aTi x ≤ bi

be active in an optimal solution of the convex relaxation (R).Then we can deduce the valid inequality

aiTx ≥ bi − (U − L)/µi


Methods based on “merit functions”

Bayesian algorithm: the objective function is considered as arealization of a stochastic process

f(x) = F (x;ω)

A loss function is defined, e.g.:

L(x1, ..., xn;ω) = mini=1,n

F (xi;ω) − minxF (x;ω)

and the next point to sample is placed in order to minimize theexpected loss (or risk)

xn+1 = arg minE (L(x1, ..., xn, xn+1) | x1, ..., xn)

= arg minE (min(F (xn+1;ω) − F (x;ω)) | x1, ..., xn)


Radial basis method

Given k observations (x1, f1), . . . , (xk, fk), an interpolant is built:

s(x) =n∑

i=1

λiΦ(‖x− xi‖) + p(x)

p: polynomial of a (prefixed) small degree m. Φ: radial functionlike, e.g.:

Φ(r) = r linear

Φ(r) = r3 cubic

Φ(r) = r2 log r thin plate spline

Φ(r) = e−γr2

gaussian

Polynomial p is necessary to guarantee existence of a uniqueinterpolant (i.e. when the matrix Φij = Φ(‖xi −xj‖) is singular)


“Bumpiness”

Let f ⋆k an estimate of the value of the global optimum after k

observations. Let syk the (unique) interpolant of the data points

(xi, fi)i = 1, . . . , k

(y, f ⋆k )

Idea: the most likely location of y is such that the resultinginterpolant has minimum “bumpiness”Bumpiness measure:

σ(sk) = (−1)m+1∑

λisyk(xi)


TO BE DONE


Stochastic methods

Pure Random Search - random uniform sampling over thefeasible region

Best start: like Pure Random Search, but a local search isstarted from the best observation

Multistart: Local searches started from randomly generatedstarting points


-3

-2

-1

0

1

2

3

0 1 2 3 4 5

rsrsrs rs rsrs rs rsrsrs

+

++

+

+

+ + +++


Clustering methods

Given a uniform sample, evaluate the objective function

Sample Transformation (or concentration): either a fractionof “worst” points are discarded, or a few steps of a gradientmethod are performed

Remaining points are clustered

from the best point in each cluster a single local search isstarted


Uniform sample

−1

−3

0

−5

rs

rs rs

rs

rs

rsrs

rs

rs

rs

rsrs

rs

rs

rs

rs

rs

rs

rs

rsrsrs

rs

rs

rs

rs

rs

rsrs

rs

rs

0

1

2

3

4

5

0 1 2 3 4 5


Sample concentration

−1

−3

0

−5

rs

rsrs

rs

rs

rs

rs

rs

rs

rs

rs

rs

rsrs

rs

+ + +

+

+

+

+

++

+++

+

+ ++0

1

2

3

4

5

0 1 2 3 4 5


Clustering

−1

−3

0

−5

r

rr

rr

r

r

r

r

r

u

r

u

r

r

0

1

2

3

4

5

0 1 2 3 4 5


Local optimization

−1

−3

0

−5

r

rr

rr

r

r

r

r

r

u

r

u

r

r

0

1

2

3

4

5

0 1 2 3 4 5


Clustering: MLSL

Sampling proceed in batches of N points. Given sample pointsX1, . . . , Xk ∈ [0, 1]n, label Xj as “clustered” iff ∃Y ∈ X1, . . . , Xk:

||Xj − Y || ≤ ∆k :=1√2π

(

log k

kσΓ(

1 +n

2

)

)1

n

andf(Y ) ≤ f(Xj)


Simple Linkage

A sequential sample is generated (batches consist of a singleobservation). A local search is started only from the lastsampled point (i.e. there is no “recall”) unless there exists asufficiently near sampled point with better function valure


Smoothing methods

Given f : Rn → R, the Gaussian transform is defined as:

〈f〉λ(x) =1

πn/2λn

∫

Rn

f(y) exp(

−‖y − x‖2/λ2)

When λ is sufficiently large ⇒〈f〉λ is convex. Idea: starting witha large enough λ, minimize the smoothed function and slowlydecrease λ towards 0.


Smoothing methods

-10-5

05

10 -10

-5

0

5

10

0

0.5

1

1.5

2

2.5

3


-10-5

05

10 -10

-5

0

5

10

0

0.5

1

1.5

2

2.5

3


-10-5

05

10 -10

-5

0

5

10

0.60.8

11.21.41.61.8

22.22.4


-10-5

05

10 -10

-5

0

5

10

0.8

1

1.2

1.4

1.6

1.8

2

2.2


Transformed function landscape

Elementary idea: local optimization smooths out many “highfrequency” oscillations


0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10


Monotonic Basin-Hopping

k := 0; f⋆ := +∞;while k < MaxIter do

Xk: random initial solutionX⋆

k= arg min f(x; Xk);

(local minimization started at Xk)fk = f(X⋆

k);

if fk < f⋆ =⇒ f⋆ := fk

NoImprove := 0;while NoImprove < MaxImprove do

X = random perturbation of Xk

Y = arg minf(x; X) ;if f(Y ) < f⋆ =⇒ Xk := Y ; NoImprove := 0; f⋆ := f(Y )

otherwise NoImprove + +

end while

end while


0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10


introduction to global optimization - eeci · introduction to global optimization – p. complexity...

Documents