advanced algorithms in nonlinear optimization

8/10/2019 Advanced Algorithms in Nonlinear Optimization

1/429

Advanced Algorithms in Nonlinear Optimization

Philippe Toint

Department of Mathematics, University of Namur, Belgium

( [email protected] )

Belgian Francqui Chair, Leuven, April 2009
http://find/


2/429

Outline

1 Nonlinear optimization: motivation, past and perspectives

2 Trust region methods for unconstrained problems

3 Derivative free optimization, filters and other topics

4 Convex constraints and interior-point methods

5 The use of problem structure for large-scale applications

6 Regularization methods and nonlinear step control

7 Conclusions

Philippe Toint (Namur) April 2009 2 / 323
http://find/


3/429


4/429

Nonlinear optimization: motivation, past and perspectives Definition and examples

What is optimization?

The best choice subject to constraints

best criterion,objective functionchoice

variableswhose value may be chosen

constraints restrictionson allowed values of the variables


N li i i i i i d i D fi i i d l
http://find/


5/429


More formally

variables x= (x1, x2, . . . , xn)objective function minimize/maximize f(x)constraints c(x)0

Note: maximize f(x) equivalent to minimize

f(x).

minx

f(x)

such thatc

(x

)0(the general nonlinear optimization problem)

(+ conditions on x, f and c)


N li ti i ti ti ti t d ti D fi iti d l
http://find/


6/429


Nature optimizes

http://find/


7/429

Nonlinear optimization: motivation past and perspectives Definition and examples


8/429


Applications: PAL design (1)

Design of modern Progressive Adaptive Lenses:

vary optical power of lenses while minimizing astigmatism

Loos, Greiner, Seidel (1997)


Nonlinear optimization: motivation past and perspectives Definition and examples
http://find/


9/429



Achievements: Loos, Greiner, Seidel (1997)

uncorrected short distance

long distance PALPhilippe Toint (Namur) April 2009 9 / 323

http://find/


10/429



Is this nonlinear (

difficult)?

Assume the lens surface is z=z(x, y). Theoptical poweris

p(x, y) = N3

2

1 +

z

x2

2z

y2+

1 +

z

y2

2z

x2 2 z

x

z

y

2z

xy

where

N=N(x, y) = 1

1 +

zx

2+

zy

2.

Thesurface astigmatismis then

a(x, y) =2

p(x, y) N4

z

x

z

y

2z

xy

2


http://find/http://goback/


11/429

p , p p p p

Applications: Food sterilization (1)

A common problem in the food processing industry:

keep a max of vitamins while killing a prescribed fraction of the bacteria

heating in steam/hot water autoclaves

Sachs (2003)




12/429

p , p p p p

Applications: Food sterilization (2)

Model: coupled PDEs

Concentrationof micro-organisms and other nutrients:

C

t(x, t) =K[(x, t)]C(x, t),

where (x, t) is the temperature, and where

K[] =K1eK2

1

1r

(Arrhenius equation)

Evolution oftemperature:

c()

t = [k()],

(with suitableboundary conditions: coolant, initial temperature,. . . )


http://find/


13/429

Applications: biological parameter estimation (1)

K-channel in a the model of a neuron membrane:

Sansom (2001)

Doyle et al. (1998)


http://find/


14/429


Where are these neurons?

in a Pacific spiny lobster!

Simmers, Meyrand and Moulin (1995)


http://find/


15/429


After gathering experimental data (applying a current to the cell):

estimate the biological model parameters that best fit experiments

Model:

Activation: pindependentgates

Deactivation: nh gates with differentdynamics

nh+ 2coupled ODEsfor the voltage, the activation level, the partialinactivations levels

5-points BDF for50000time steps very nonlinear!


http://find/


16/429

Applications: data assimilation for weather forecasting (1)

(Attempt to) predict. . .

tomorrows weather

the oceans average temperaturenext month

future gravity field

future currents in the ionosphere

. . .


http://find/


17/429


Data: temperature, wind, pressure, . . . everywhere and at all times!

May involve up to250000000variables!


http://find/


18/429


The principle:

2.5 2 1.5 1 0.5 0 0.5 1

0.4

0.2

0

0.2

0.4

0.6

0.8

1

temp. vs. days

Knownsituation2.5 days agoand background prediction



f f ( )
http://find/


19/429


The principle:

2.5 2 1.5 1 0.5 0 0.5 1

0.4

0.2

0

0.2

0.4

0.6

0.8

1

temp. vs. days

Knownsituation2.5 days ago

and background prediction Record temperaturefor the past 2.5 days



A li i d i il i f h f i (3)
http://find/


20/429


The principle:

Minimize deviation between model and past observations

2.5 2 1.5 1 0.5 0 0.5 1

0.4

0.2

0

0.2

0.4

0.6

0.8

1

temp. vs. days


Record temperaturefor the past 2.5 days Run the model tominimizedifference

Ibetween model and observations

minx0

1

2x0 xb2B1+

1

2

Ni=0

HM(ti, x0) bi2R1i .



A li i d i il i f h f i (3)
http://find/


21/429


The principle:

Minimize deviation between model and past observations

2.5 2 1.5 1 0.5 0 0.5 1

0.4

0.2

0

0.2

0.4

0.6

0.8

1

temp. vs. days


Record temperaturefor the past 2.5 days Run the model tominimizedifferenceIbetween model and observations

Predicttemperature for the next day



A li ti d t i il ti f th f ti (4)
http://find/


22/429


Analysis of the oceans heat content: CERFACS (2009)

Much better fit!



A li ti ti l t t d i
http://find/


23/429

Applications: aeronautical structure design

minimize weight while maintaining structural integrity

mass reduction during optimization

SAMTECH (2009)



A li ti s st id t j t t hi
http://find/


24/429

Applications: asteroid trajectory matching

find todays asteroid whose orbital parameters

match best one observed 50 years ago

Milani, Sansaturio et al. (2005)



Applications: discrete choice modelling (1)
http://find/


25/429


Context: simulation of individual choices inTransportation(or other)

(mode, route, time of departure,. . . )

Random utility theory

An individual iassigns to alternative j theutility

Uij = [parametersexplaining factors] +[ random error ]

Illustration :

Ubus=distance 1.2 price of ticket 2.1 delay wrt to car travel+



http://find/


26/429


Probability that individual i chooses alternative j rather thanalternative kgiven by

prob(Uij Uik for all k)

Data: mobility surveys (MOBEL)

find the parameters in the utility function to

maximize likelihood of observed behaviours



http://find/


27/429


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5 10 15 20 25 30 35 40 45

NormalLognormal

Spline

Using advanced optimization

Using standard statistics

Estimation of thevalue of time lost in congested trafic(with and without advanced optimization)



Applications: Poisson image denoising (1)
http://find/


28/429


Consider a two dimensional image with noise proportional to signal

zij=uij+nf(uij)

where n is a random Gaussian noise. How to recover the original uij?

use the pixel values as much as possiblewhile minimizing sharp transitions (gradients)

This leads to the optimization problem

minu

ij

(uij zijlog(uij)) +

u



http://find/


29/429


Somespectacular results: a 512512 picture with95% noise



http://find/


30/429


Somespectacular results: a 512512 picture with95% noise

Chan and Chen (2007)



Applications: shock simulation in video games
http://find/


31/429

Applications: shock simulation in video games

Optimize the realism of the motion of multiple rigid bodies in space

complementarity problem

q[q(t)]v(t)0(q(t))0

(q(t) = positions, v(t) = dqdt

(t) = velocities)

system of inequalities and equalities

used in realtime for video animation

Anitescu and Potra (1996)



Applications: finance
http://find/


32/429

Applications: finance

1 risk management

2 portofolio analysis

3 FX markets

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

-1.5 -1 -0.5 0 0.5 1 1.5

NormalSpline

OptimizedStandard

Investment distributionfor the BoJ 1991-2004

4 . . .

Everybody lovesanoptimizer!


Nonlinear optimization: motivation, past and perspectives History

Where does optimization come from?
http://find/


33/429

Where does optimization come from?

Nous sommes comme des nains juches sur des epaules de geants, de tellesorte que nous puissions voir plus de choses et de plus eloignees que nenvoyaient ces derniers. Et cela, non point parce que notre vue seraitpuissante ou notre taille avantageuse, mais parce que nous sommes porteset exhausses par la haute stature des geants.

We are like dwarfs standing on the shoulders of giants, such that we cansee more things and further away than they could. And this, not becauseour sight would be more powerful or our height more advantageous, butbecause we are carried and heigthened by the high stature of the giants.

Bernard de Chartres (1130-1160)



Euclid (300 BC) Al-Khwarizmi (783-850)
http://find/


34/429

(300 C) ( 83 850)



Isaac Newton (1642-1727) Leonhardt Euler (1707-1783)
http://find/


35/429

( ) ( )



J. de Lagrange (1735-1813) Friedrich Gauss (1777-1855)
http://find/


36/429

g g ( ) ( )



Augustin Cauchy (1789-1857) George Dantzig (1914-2005)
http://find/


37/429

g y ( ) g g ( )



Michael Powell Roger Fletcher
http://find/


38/429


Nonlinear optimization: motivation, past and perspectives Basic concepts

Return to the mathematical problem
http://find/


39/429

minx

f(x)

such thatc(x)0

Difficulties:

the objective function f(x) is typically complicated(nonlinear)

it is also oftencostly to compute

there may bemany variables

the constraints c(x) may defined acomplicated geometry



An example unconstrained problem
http://find/


40/429

minimize: f(, ) =102 + 102 + 4 sin() 2+4

3 2 1 0 1 2 3

3

2

1

0

1

2

3

Two local minima: (2.20, 0.32) and (2.30, 0.34)

How to find them?



Trust-region methods
http://find/


41/429

iterative algorithms

findlocal solutionsonly

Algorithm 1.1: The trust-region framework

Until an (approximate) solution is found:Step 1: use a model of the nonlinear function(s)

within region where it can betrusted

Step 2: notion of sufficientdecrease

Step 3: measureachieved and predicted reductionsStep 4: decrease the region radius ifunsuccessful


Nonlinear optimization: motivation, past and perspectives Illustration
http://find/


42/429

minimize: f(, ) =102 + 102 + 4 sin() 2+4

3 2 1 0 1 2 3

3

2

1

0

1

2

3

Two local minima: (2.20, 0.32) and (2.30, 0.34)


http://find/


43/429

x0 = (0.71, 3.27) and f(x0)= 97.630Contours off Contours ofm0 around x0

(quadratic model)

3 2 1 0 1 2 3

3

2

1

0

1

2

3

3 2 1 0 1 2 3

3

2

1

0

1

2

3



k k sk f(xk+sk) f/mk xk+1
http://find/


44/429

0 1 (0.05, 0.93) 43.742 0.998 x0+s0

3 2 1 0 1 2 3

3

2

1

0

1

2

3

3 2 1 0 1 2 3

3

2

1

0

1

2

3



http://find/


45/429

0 1 (0.05, 0.93) 43.742 0.998 x0+s01 2 (0.62, 1.78) 2.306 1.354 x1+s1

3 2 1 0 1 2 3

3

2

1

0

1

2

3

3 2 1 0 1 2 3

3

2

1

0

1

2

3



http://find/


46/429

0 1 (0.05, 0.93) 43.742 0.998 x0+s01 2 (0.62, 1.78) 2.306 1.354 x1+s12 4 (3.21, 0.00) 6.295 0.004 x2

3 2 1 0 1 2 3

3

2

1

0

1

2

3

3 2 1 0 1 2 3

3

2

1

0

1

2

3



http://find/


47/429

0 1 (0.05, 0.93) 43.742 0.998 x0+s01 2 (0.62, 1.78) 2.306 1.354 x1+s12 4 (3.21, 0.00) 6.295 0.004 x23 2 (1.90, 0.08) 29.392 0.649 x2+s2

3 2 1 0 1 2 3

3

2

1

0

1

2

3

3 2 1 0 1 2 3

3

2

1

0

1

2

3



http://find/


48/429

0 1 (0.05, 0.93) 43.742 0.998 x0+s01 2 (0.62, 1.78) 2.306 1.354 x1+s12 4 (3.21, 0.00) 6.295 0.004 x23 2 (1.90, 0.08) 29.392 0.649 x2+s24 2 (0.32, 0.15) 31.131 0.857 x3+s3

3 2 1 0 1 2 3

3

2

1

0

1

2

3

3 2 1 0 1 2 3

3

2

1

0

1

2

3



http://find/


49/429

0 1 (0.05, 0.93) 43.742 0.998 x0+s01 2 (0.62, 1.78) 2.306 1.354 x1+s12 4 (3.21, 0.00) 6.295 0.004 x23 2 (1.90, 0.08) 29.392 0.649 x2+s24 2 (0.32, 0.15) 31.131 0.857 x3+s35 4 (0.03, 0.02) 31.176 1.009 x4+s4

3 2 1 0 1 2 3

3

2

1

0

1

2

3

3 2 1 0 1 2 3

3

2

1

0

1

2

3

http://find/


50/429



51/429

Path of iterates: From another x0:

3 2 1 0 1 2 3

3

2

1

0

1

2

3

3 2 1 0 1 2 3

3

2

1

0

1

2

3



And then...
http://find/


52/429

Does it (always) work?

The answertomorrow!(and subsequent days for a (biased) survey of new optimization methods)

Thank you to you for your attention


Trust region methods for unconstrained problems
http://find/


53/429



H(
http://find/


54/429

Lesson 2:

Trust-region methodsfor unconstrained problems



The basic text for this course
http://find/


55/429

A. R. Conn, N. I. M. Gould and Ph. L. Toint,Trust-Region Methods,

Nr 01 in the MPS-SIAM Series on Optimization,SIAM, Philadelphia, USA, 2000.


Trust region methods for unconstrained problems Background material
http://find/


56/429

2.1: Background material



Scalar mean-value theorems
http://goforward/http://find/http://goback/


57/429

LetS be an open subset of IRn, and suppose f :S IR iscontinuously differentiable throughout S. Then, if the segmentx+s S for all [0, 1],

f(x+s) =f(x) + xf(x+s), s

for some

[0, 1].

Let Sbe an open subset of IRn, and supposef :S IR istwicecontinuously differentiable throughout S. Then, if the segmentx+s

S for all

[0, 1],

f(x+s) =f(x) + xf(x), s + 12s, xxf(x+s)s

for some [0, 1].



Vector mean-value theorem
http://find/


58/429

LetS be an open subset of IRn, and suppose F :S IRm iscontinuously differentiable throughout S. Then, if the segmentx+s

S for all

[0, 1],

F(x+s) =F(x) +

10

xF(x+s)s d.



Taylors scalar approximation theorems (1)
http://goforward/http://find/http://goback/


59/429

LetS be an open subset of IRn, and suppose f :S IR iscontinuously differentiable throughoutS. Suppose further thatxf(x) is Lipschitz continuous at x, with Lipschitz constant(x) in some appropriate vector norm. Then, if the segment

x+s S for all [0, 1],|f(x+s) m(x+s)| 1

2(x)s2,

wherem(x+s) =f(x) +

xf(x), s

.

http://find/


60/429


61/429


Newtons method


62/429

Solve F(x) = 0

Idea: solve linear approximation

F(x) +J(x)s= 0

quadratic local convergence

. . .butnotglobally convergent

Yet the basis of everything that follows



Unconstrained optimality conditions
http://find/


63/429

Suppose that f

C1, and that x is a local minimizer off(x).

Thenxf(x) = 0.

Suppose that f

C2, and that x is a local minimizer off(x).Then the above holds and the objective functions Hessian atx is positive semi-definite, that is

s, xxf(x)s 0 for all sIRn.

s, xxf(x)s> 0 for all s= 0IRn

strictlocal solution



Constrained optimality conditions (1)
http://find/


64/429

minimize f(x)subject to ci(x) = 0, for i E,and ci(x) 0, for i I,

Active set:

x1

x2

x3

C

F

F{3}

F{1,3}

F{2,3}

F{2}

F{1}F{1,2}

c1(x) = 0

c2(x) = 0

c3(x) = 0

A(x1) ={1, 2}A(x2) =A(x3) ={3}



Constrained optimality conditions (2): first order

I i lifi i !


65/429

Ignore constraint qualification!

Suppose that f, cC1, and that x is a local solution. Thenthere exist a vector ofLagrange multipliers y such that

xf(x) =

iEI[y]ixci(x)

ci(x) = 0 for all i Eci(x) 0 and [y]i 0 for all i I

and ci(x)[y]i = 0 for all i

I.

Lagrangian: (x, y) =f(x)

iEI

yici(x)



Constrained optimality conditions (3): second order

2
http://find/


66/429

Suppose that f, c C2, and that x is a local minimizer of f(x).Then there exist a vector of Lagrange multipliers y such that first-

order conditions hold and

s, xx(x, y)s 0 for all s N+where

N+ is the set of vectors s such that

s, xci(x)= 0 for all i E{j A(x)I | [y]j>0}

and

s, xci(x) 0 for all i {j A(x)I | [y]j= 0}strict complementarity:s, xx(x, y)s>0 for all s N+ (s= 0)

strictlocal solutionPhilippe Toint (Namur) April 2009 55 / 323


Optimatity conditions (convex 1)
http://find/


67/429

Assume now thatC is convexnormal cone ofC at x C,

N(x) def={yIRn | y, u x 0,u C}

tangent cone ofC at x C

T(x) def=N(x)0 =cl{(u x)| 0 and u C}



Optimality conditions (convex 2)
http://find/


68/429

00000000

00000000

00000000

00000000

00000000

00000000

00000000

00000000

00000000

00000000

00000000

00000000

00000000

11111111

11111111

11111111

11111111

11111111

11111111

11111111

11111111

11111111

11111111

11111111

11111111

11111111

00000000000000

00000000000000

00000000000000

00000000000000

00000000000000

00000000000000

00000000000000

00000000000000

00000000000000

00000000000000

00000000000000

00000000000000

00000000000000

11111111111111

11111111111111

11111111111111

11111111111111

11111111111111

11111111111111

11111111111111

11111111111111

11111111111111

11111111111111

11111111111111

11111111111111

11111111111111

000000000000

000000000000

000000000000

000000000000

000000000000

000000000000

000000000000

000000000000

000000000000

000000000000

111111111111

111111111111

111111111111

111111111111

111111111111

111111111111

111111111111

111111111111

111111111111

111111111111

00000000000000000000000000000

00000000000000000000000000000

00000000000000000000000000000

00000000000000000000000000000

00000000000000000000000000000

00000000000000000000000000000

11111111111111111111111111111

11111111111111111111111111111

11111111111111111111111111111

11111111111111111111111111111

11111111111111111111111111111

11111111111111111111111111111

The Moreau decomposition



Optimatity conditions (convex 2)
http://find/


69/429

Suppose thatC = is convex, closed, that f is continuouslydifferentiable inC, and that x is a first-order critical point forthe minimization of f overC. Then, provided that constraintqualification holds,

xf(x) N(x).



Conjugate gradients
http://find/


70/429

Idea: minimize a convex quadratic on successivenested Krylov subspaces

Algorithm 2.1: Conjugate-gradients (CG)

Given x0, set g0 =Hx0+cand let p0 =g0.For k= 0, 1, . . . , until convergence, perform the iteration

k = gk22/pk, Hpkxk+1 = xk+kpk

gk+1 = gk+kHpk

k =

gk+1

22/

gk

22

pk+1 = gk+1+kpk



Preconditioning

Idea: change the variablesx=Rxand define M=RTR.
http://find/


71/429

g

Algorithm 2.2: Preconditioned CG

Given x0, set g0 =Hx0+c, and let v0 =M1g0 and p0 =v0.

For k= 0, 1, . . . , until convergence, perform the iteration

k = gk, vk/pk, Hpkxk+1 = xk+kpk

gk+1 = gk+kHpk

vk+1 = M1gk+1

k = gk+1, vk+1/gk, vkpk+1 = vk+1+kpk



Lanczos method

Idea: compute an orthonormal basis of the successive nested Krylov
http://find/


72/429

Idea: compute an orthonormal basis of the successive nested Krylovsubspaces

makes QTkHQk tridiagonal

Algorithm 2.3: Lanczos

Given r0, set y0 =r0, q1 = 0.

For k= 0, 1, . . ., perform the iteration,

k = yk2qk = yk/k

k = qk,

Hqk

yk+1 = Hqk kqk kqk1



Another view on the Conjugate-Gradients method
http://find/


73/429

Conjugate Gradients = Lanczos + LDLT (Cholesky)

Lanczos

Cholesky

| {z }

Conjugate gradients in one of the Krylov subspaces


Trust region methods for unconstrained problems The Trust-region algorithm
http://find/


74/429

2.2: The trust-region algorithm



The trust-region idea
http://find/


75/429

use amodelof the objective function

define atrust-regionwhere it is thought adequate

Bk={xIRn | x xkk k}

find atrial pointbysufficiently decreasingthe model inBkcompute the objective function at the trial point

compare achived vs. predicted reductions

reduce k if unsatisfactory



The basic trust-region algorithm

Algorithm 2 4: Basic trust-region algorithm (BTR)
http://find/


76/429

Algorithm 2.4: Basic trust region algorithm (BTR)

Step 0: Initialization. x0 and 0 given, compute f(x0) and set k= 0.

Step 1: Model definition. Choose kand define a model mk inBk.Step 2: Step calculation. Compute sk thatsufficiently reduces the

model mk with xk+sk Bk.Step 3: Acceptance of the trial point. Compute f(xk+sk) and define

k= f(xk)

f(xk+sk)

mk(xk) mk(xk+sk) .

Ifk1, then define xk+1 =xk+sk; otherwise define xk+1=xk.Step 4: Trust-region radius update.

k+1 [k, ) if k 2,[2k, k] if k[1, 2),

[1k, 2k] if k< 1.

Increment kby one and go to Step 1.


Trust region methods for unconstrained problems Basic convergence theory
http://find/


77/429

2.3: Basic convergence theory



Assumptions

f C 2
http://find/


78/429

fC2

f(x)lbfxxf(x) ufh

mk

C2(

Bk)

mk(xk) =f(xk)

gkdef=xmk(xk) =xf(xk)

xxmk(x) umh 1 for all x Bk

1une

xk x unexk. . . but use k= 2 in what follows!



The Cauchy step

Id i i i th C h
http://find/


79/429

Idea: minimize mkon theCauchy arc

xCk(t) def={x| x=xk tgk, t 0 and x Bk}.

4 3 2 1 0 1 2 3 44

3

2

1

0

1

2

3

4

1 1.5 2 2.5 3 3.5 4 4.50

0.5

1

1.5

2

2.5

3

3.5

theCauchy point



The Cauchy point for quadratic models
http://find/


80/429

Three cases when minimizing thequadratic mkalong the Cauchy arc:

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 17

6

5

4

3

2

1

0

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 115

10

5

0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 17

6

5

4

3

2

1

0

1

mk(xk) mk(xC

k) 1

2gk min gkk , k



The Cauchy point for general models

Th h i i i i h l l h C h
http://find/


81/429

Three cases when minimizing thegeneral mkalong the Cauchy arc:

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 17

6

5

4

3

2

1

0

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 115

10

5

0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 17

6

5

4

3

2

1

0

1

mk(xk) mk(xAC

k )dcpgk min gkk , k

Philippe Toint (Namur) April 2009 70 / 323 Trust region methods for unconstrained problems Basic convergence theory

The meaning of sufficient decrease
http://find/


82/429

In both cases, we get:

Sufficient decrease condition:

mk(xk) mk(xk+sk)mdcgk mingk

k, k

,

Immediate consequence:

Suppose thatxf(xk)= 0. Then mk(xk+sk)< mk(xk) andsk= 0.

k is well defined!


The exact minimizer is OK

Suppose that for all k s ensures that
http://find/


83/429

Suppose that, for all k, skensures that

mk(xk) mk(xk+sk)amm[mk(xk) mk(xMk)],

Then sufficient decrease is obtained.

21

5.2

5

20

30

45

60

80

5.2

5

20

30

45

60

5.2

21

5

1.5 1 0.5 0 0.5 1 1.51.5

1

0.5

0

0.5

1

1.5


Taylor and minimum radius

For all k, |f(xk +sk) mk(xk +sk)| ubh2k ,
http://find/


84/429

, | ( k + k) k( k + k)| ubh k,

Suppose that gk= 0 and that

k mdcgk(1 2)ubh

.

Then iteration kis very successful and

k+1k.

Suppose thatgk lbg > 0 for all k. Then is a constantlbd >0 such that, for all k

k lbd.


First-order convergence (1)
http://find/


85/429

Suppose that there are only finitely many successful iterations.Then xk = x for all sufficiently large k and x is first-ordercritical.

Suppose that there are infinitely many successful iterations.Then

lim infk

xf(xk)= 0.

idea: infinite descent if not critical


First-order convergence (2)

Suppose that there are infinitely many successful iterations.
http://find/


86/429

Then limk

xf(xk)

= 0.

For 1>0gk

2

kS {ti}{i} K


Convex models (1)

Suppose that min[xxmk(x)] for all x[xk, xk+sk] andf 0 Th
http://find/


87/429

for some >0. Then

sk 2gk.

idea: mkcurves upwards!

Suppose that{xki} x and x is first-order critical, and thatthere is a constant smh>0 such that

minxBk

min[xxmk(x)]smhwhenever xk is sufficiently close to x Suppose finally thatxxf(x) is nonsingular. Then the complete sequence of it-erates{xk} converges to x.

idea: steps too short to escapelocalbasin


Convex models (2)
http://find/


88/429

But...

4 3 2 1 0 1 2 3 44

3

2

1

0

1

2

3

4

4 3 2 1 0 1 2 3 44

3

2

1

0

1

2

3

4


Asymptotically exact Hessians

Assume also that
http://find/


89/429

limk xxf(xk) xxmk(xk)= 0 whenever limk gk= 0

Suppose that{xki} x and x is first-order critical, thatsk= 0 for all ksufficiently large, and thatxxf(x) is positivedefinite. Then the complete sequence of iterates{xk} con-verges to x, all iterations are eventually very successful andthe trust-region radius k is bounded away from zero.

idea: sufficient decrease implies that

mk(xk) mk(xk+sk)mqdsk2 >0.

Then k1.Philippe Toint (Namur) April 2009 78 / 323


Second order: the eigen point

Assume 0> k (Hk).Then fine the eigen direction uk such that
http://find/


90/429

Then fine theeigen direction uk such that

uk, gk 0, ukk= k uk, Hkuk snck2k,

Minimize the model along ukto compute theeigen point:mk(x

Ek) =mk(xk+t

Ekuk) = min

t(0,1]mk(xk+tuk)

2 1.5 1 0.5 0 0.5 1 1.5 22

1.5

1

0.5

0

0.5

1

1.5

2


Model decrease at the eigen point

Suppose: 0> k(Hk), uk is an eigen direction and
http://find/


91/429

pp k ( k) k g

xxm

k(x

) xxm

k(y

) lchx

yfor all x, y Bk. Then

mk(xk) mk(xEk) sodkmin[2k, 2k].

(quadratic or general model)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 110

9

8

7

6

5

4

3

2

1

0

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 110

9

8

7

6

5

4

3

2

1

0

1


Second order: convergence theorems

li [ f ( )] 0
http://find/


92/429

lim sup

k

min[

xxf(xk)]

0.

Suppose that x is an isolated limit point of the sequence ofiterates{xk}. Then x is a second-order critical point.

Assume also that, for 3>1,

k2 and kmaxk+1[3k, 4k]

Let x be any limit point of the sequence of iterates. Then xis a second-order critical point.


Different trust-region norms
http://find/


93/429

2 1.5 1 0.5 0 0.5 1 1.5 22

1.5

1

0.5

0

0.5

1

1.5

2

2 1.5 1 0.5 0 0.5 1 1.5 22

1.5

1

0.5

0

0.5

1

1.5

2

2 1.5 1 0.5 0 0.5 1 1.5 22

1.5

1

0.5

0

0.5

1

1.5

2


Using norms for scaling

Idea: change the variablesSkw=s
http://find/


94/429

ThenmSk(xk+w)f(xk+Skw) def= fS(w),

BSk={xk+w|w k}.mSk(xk) =f(xk), g

Sk=wf S(0) =STkxf(xk)

HSk wwfS(0) =STkxxf(xk)Sk.Thus

mSk(xk+w) = f(xk) + gSk, w + 12w, HSkw= f(xk) + S

T

kxf(xk), w + 1

2w, ST

k HkSkw= f(xk) + xf(xk), Skw + 12Skw, HkSkw= f(xk) + xf(xk), s + 12s, Hks= mk(xk+s)


Scaling: the geometry


95/429

0.5 1 1.5 2 2.5 3 3.50.5

1

1.5

2

2.5

3

3.5

0.5 1 1.5 2 2.5 3 3.50.5

1

1.5

2

2.5

3

3.5

0.5 1 1.5 2 2.5 3 3.50.5

1

1.5

2

2.5

3

3.5

Philippe Toint (Namur) April 2009 84 / 323 Trust region methods for unconstrained problems Solving the subproblem
http://find/


96/429

2.4: Solving the subproblem

Philippe Toint (Namur) April 2009 85 / 323 Trust region methods for unconstrained problems Solving the subproblem

The subproblem
http://find/


97/429

AssumeEuclidean norm

quadratic model (possibly non-convex)

(drop the index k)

minsIRn

q(s) g, s + 12s, Hs

subject to s2

Phili T i t (N ) A il 2009 86 / 323 Trust region methods for unconstrained problems Solving the subproblem

Possible approaches
http://find/


98/429

exact minimization

truncated conjugate-gradients

CG + Lanczos (GLTR)

doglegseigenvalue based methods

(projection methods)


The exact minimizer
http://find/


99/429

Anyglobalminimizer ofq(s) subject tos2 = satisfies theequation

H(M)sM =g,where

H(M

) def

= H+M

I is positive semi-definite,M 0 andM(sM2 ) = 0.

IfH(M) is positive definite, sM is unique.

Note: M is theLagrange multiplier


The exact minimizer: a geometrical view

2

3

4

2

3

4
http://find/


100/429

4 3 2 1 0 1 2 3 4

4

3

2

1

0

1

4 3 2 1 0 1 2 3 4

4

3

2

1

0

1

4 3 2 1 0 1 2 3 44

3

2

1

0

1

2

3

4

4 3 2 1 0 1 2 3 44

3

2

1

0

1

2

3

4

Phili T i t (N ) A il 2009 89 / 323
http://find/


101/429

Trust region methods for unconstrained problems Solving the subproblem

The convex case

30

s()2


102/429

6 5 4 3 2 1 0 1 2 3 4 5

0

5

10

15

20

25

solution curve

Phili T i (N ) A il 2009 91 / 323 Trust region methods for unconstrained problems Solving the subproblem

A nonconvex case

30

s()2
http://find/


103/429

4 3 2 1 0 1 2 3 4 5 6

0

5

10

15

20

25

1



The hard case: 1= 0

30

s()2
http://find/


104/429

4 3 2 1 0 1 2 3 4 5 6

0

5

10

15

20

25

1

no root larger than 2

root near 2.2



Near the hard case: 10

25

30

= 1 = 0.1

= 0.01 = 0.00001

s()2
http://find/


105/429

0

5

10

15

20

25

1 2 3



The secular equationIdea: consider thesecular equation

def 1 1
http://find/


106/429

()

def

=

1

s()2 1

= 0

Then

0.5

1

1.5

2

2.5

2 3 4

= 1 = 0.1

= 0.01 = 0.00001

1/s()2

apply Newtons methodto () = 0 :+ = ()/()Philippe Toint (Namur) April 2009 95 / 323


The derivatives of()

Supposeg = 0. Then
http://find/


107/429

S pp g

0

() is strictly increasing ( >1), and concave.

() =s(), s()s()32where s() =H()1s().

Note: ifH() =LLT and Lw=s(), then

s(), s()=s(), LTL1s()=w2



Newtons method on the secular equation

Algorithm 2.5: Exact trust-region solver
http://find/


108/429

Let >1 and > 0 be given.1 Factorize H() =LLT.

2 Solve LLTs=g.3 Solve Lw=s.

4 Replace by+ s2 s2

2w22.

But . . . more complications due to

bracketing the root (initial + update)termination rule

may be preconditioned

More (1978), More-Sorensen (1983), Dollar-Gould-Robinson(2009)



Approximate solution by truncated CG

Fact: CGnever reentersthe 2 trust-region
http://find/


109/429

2 g

1 1.5 2 2.5 3 3.5 4 4.50

0.5

1

1.5

2

2.5

3

3.5

1 1.5 2 2.5 3 3.5 4 4.50

0.5

1

1.5

2

2.5

3

3.5

4 3 2 1 0 1 2 3 44

3

2

1

0

1

2

3

4

May be preconditionedSteihaug (1983), T. (1981)



Approximate solution by the GLTR

ST might hit the boundary for steepest descent stepsometimes slow
http://find/


110/429

Idea: solve the subproblem on the nested Krylov subspaces

Algorithm 2.6: Two-phase GLTR algorithm

as long asinterior: conjugate-gradients

on theboundary: Lanczos method + subproblem solution inKrylov space

(smooth transition)Gould-Lucidi-Roma-T. (1999)



Doglegs

Idea: use steepest descent and the full Newtonstep (requires convexity?)

dogleg curve
http://find/


111/429

0

sDD

sD

sC

sN

trust-region boundary

double-dogleg curve

dogleg curve

Powell (1970), Dennis-Mei (1979)



An eigenvalue approach

Rewrite(H+M)s=g
http://find/


112/429

as(H g)

s

1

=Ms

or (introducing the parameter )

H ggT

s1

= ()

M 00 1

s1

choose such that0,H+Mpositive semi-definite

(sM ) = 0 Rendl-Wolkowicz (1997), Rojas-Santos-Sorensen (1999)Philippe Toint (Namur) April 2009 101 / 323

Trust region methods for unconstrained problems Bibliography

Bibliography for lesson 2 (1)J. E. Dennis and H. H. W. Mei,Two New Unconstrained Optimization Algorithms Which Use Function and Gradient Values,Journal of Optimization Theory and Applications, 28(4):453-482, 1979.

H. S. Dollar, N. I. M. Gould and D. P. Robinson,On solving trust-region and other regularised subproblems in optimization ,
http://find/


113/429

Rutherford-Appleton Laboratory, Chilton, UK, Report RAL-TR-2009-003, 2009.S. M. Goldfeldt, R. E. Quandt and H. F. Trotter,Maximization by quadratic hill-climbing,Econometrica, 34:541-551, 1966.

N. I. M. Gould, S. Lucidi, M. Roma and Ph. L. Toint,Solving the trust-region subproblem using the Lanczos method,SIAM Journal on Optimization, 9(2):504-525, 1999.

K. Levenberg,

A Method For The Solution Of Certain Problems In Least Squares,Quarterly Journal on Applied Mathematics, 2:164168, 1944.

D. Marquardt,An Algorithm For Least-Squares Estimation Of Nonlinear Parameters,SIAM Journal on Applied Mathematics, 11:431-441, 1963.

J. J. More,The Levenberg-Marquardt algorithm: implementation and theory,Numerical Analysis, Dundee 1977 (A. Watson, ed.), Springer Verlag, Heidelberg, 1978.

J. J. More,Recent developments in algorithms and software for trust region methods ,in Mathematical Programming: The State of the Art (A. Bachem, M. Grotschel and B. Korte, eds.), Springer Verlag,Heidelberg, pp. 258-287, 1983.

J. J. More and D. C. Sorensen,Computing A Trust Region Step,SIAM Journal on Scientific and Statistical Computing, 4(3):553-572, 1983.


Trust region methods for unconstrained problems Bibliography

Bibliography for lesson 2 (2)

D D Morrison
http://find/


114/429

D. D. Morrison,

Methods for nonlinear least squares problems and convergence proofs,Proceedings of the Seminar on Tracking Programs and Orbit Determination, (J. Lorell and F. Yagi, eds.), Jet PropulsionLaboratory, Pasadena, USA, pp. 1-9, 1960.

M. J. D. Powell,A New Algorithm for Unconstrained Optimization,in Nonlinear Programming (J. B. Rosen, O. L. Mangasarian and K. Ritter, eds.), Academic Press, London, pp. 31-65,1970.

F. Rendl and H. Wolkowicz,

A Semidefinite Framework for Trust Region Subproblems with Applications to Large Scale Minimization,Mathematical Programming, 77(2):273-299, 1997.

M. Rojas, S. A. Santos and D. C. Sorensen,A new matrix-free algorithm for the large-scale trust-region subproblem,CAAM, Rice University, TR99-19, 1999.

D. Winfield,Function and functional optimization by interpolation in data tables,Ph.D. Thesis, Harvard University, Cambridge, USA, 1969.


Derivative free optimization, filters and other topics
http://find/


115/429

Lesson 3:

Derivative-free optimization,infinite dimensions and filters


Derivative free optimization, filters and other topics Derivative free optimization
http://find/


116/429

3.1: Derivative-free optimization



An application of trust-regions: unconstrained DFO

Consider the unconstrained problem
http://find/


117/429

minx f(x)

Gradient (and Hessian) off(x)unavailable

physical measurement

object code

typically small-scale (but not always. . . )

Derivative free optimization (DFO)f(x) typicallyvery costly

Exploit each evaluation off(x) to the utmost possible

considerableinterestof practitioners



Interpolation methods for DFO

Idea: Winfield (1973), Powell (1994)
http://find/


118/429

Until convergence:

Use the available function values to build apolynomialinterpolation model mk:

mk(yi) =f(yi) yi Y;Minimize the model in a trust region, yielding a newpotentially good point;

Compute a new function value.

Y =interpolation set {points yiat which f(yi) is known}

http://find/


119/429


A naive trust-region method for DFO: illustration


120/429

21.510.500.511.5

22.53

2

1

0

1

2

3

0

5

10

15

20

25

30

35

40

45

50



http://find/


121/429

21.510.500.511.5

22.53

2

1

0

1

2

3

0

5

10

15

20

25

30

35

40

45

50



http://find/


122/429

21.510.500.511.5

22.53

2

1

0

1

2

3

0

5

10

15

20

25

30

35

40

45

50



Interpolation methods for DFO (2)
http://find/


123/429

To be considered:

poisednessof the interpolation set Y

choice of models (linear, quadratic, in between, beyond)

convergence theorynumerical performance



Poisedness

Assume a quadratic model
http://find/


124/429

mk(xk+s) =fk+ gk, s + 12s, Hks

Thusp= 1 +n+ 1

2n(n+ 1) = 1

2(n+ 1)(n+ 2)

parameters to determine need p function values (|Y|= p)

Not sufficient!

needgeometricconditions for the points in Y . . .



Poisedness: geometry with n= 2, p= 6

20
http://find/


125/429

2

1

0

1

2

21.5

10.5

00.51

1.52

0

2

4

6

8

10

12

14

16

18

20

With these 6 data points in IR3. . . . . .



http://find/


126/429

. . . is this the correct interpolation?



http://find/


127/429

...or this?



http://find/


128/429

...or this?



http://find/


129/429

The difference ... is zero on a quadratic curve containing Y!

http://find/


130/429


Lagrange polynomials

Remarkable: replacey byy+ in Y:

(Y )


131/429

(Y+)

(Y) =L(y+, y) isindependentof the basis{i()}pi=1

where

y Y L(y, y) =

1 if y=y0 if y=y

is theLagrange fundamental polynomial

Note: for quadratic interpolation,L(, y) is a quadratic polynomial!Powell (1994)



Interpolation using Lagrange polynomials

Idea: use the Lagrange polynomials to define the (quadratic) interpolantby
http://find/


132/429

mk(xk+s) =yYk

f(y)Lk(xk+s, y)

And then. . .

f(xk+ s) mk(xk+ s) yYk

xk+ s y2|Lk(xk+ s, y)|



Interpolation using Lagrange polynomials: construction
http://find/


133/429

The original function. . .



http://find/


134/429

. . . and the interpolation set



http://find/


135/429

The first Lagrange polynomial



http://find/


136/429

The second Lagrange polynomial



http://find/


137/429

The third Lagrange polynomial



http://find/


138/429

The fourth Lagrange polynomial



http://find/


139/429

The fifth Lagrange polynomial



http://find/


140/429

The sixth Lagrange polynomial



http://find/


141/429

The final interpolating quadratic



Other algorithmic ingredients

include a new point in the interpolation set

need to drop an existing interpolation point?

select which one to drop: make Yas poised as possibleN d l/f i i i i d b d !!
http://find/


142/429

Note: model/function minimizer may produce bad geometry!! geometry improvement procedure. . .

trust-region radius management

trust region =Bk={xk+s| s k}

standard: reduce kwhen no progressDFO: more complicated! (Could reduce to fast and prevent

convergence. . . )

verify that Y is poisedbeforereducing k



Improving the geometry in a ball
http://find/


143/429

k

attempt toreusepast points that are close to xk

attempt to replace adistantpoint ofY

attempt to replace aclosepoint ofY

good geometry for the current kimprovement impossible



Self-correction at unsuccessful iterations (1)

At iteration k, define the set of exchangeablefarpoints:
http://find/


144/429

Fk={y Yk| y xk>k and Lk(xk+sk, y)= 0}

and the set of exchangeableclosepoints (for some >1):

Ck={y Yk\{xk} | yxk k and |Lk(xk+sk, y)| }



Self-correction at unsuccessful iterations (2)

Remarkably,

Whenever
http://find/


145/429

iteration k is unsuccessful,

Fk=kis small w.r.t.

gk

,

thenCk=.

(an improvement of the geometry by a factor is always possible atunsuccessful iterations when kis small and all exchangeable far points

have been considered) no need to reduce k forever!



Trust-region algorithm for DFO (1)

Algorithm 3.1: TR for DFO

Step 0: Initialization. Given: x0, 0, Y0 (L0(, y)). Set k= 0.

Step 1: Criticality test [complicated and not discussed here]
http://find/


146/429

[ ]

Step 2: Solve the subproblem. Compute skthat sufficiently reduces mk(xk+s)within the trust region,

Step 3: Evaluation. Compute f(xk+sk) and

k= f(xk) f(xk+sk)mk(xk) mk(xk+sk)

.

Step 4: Define the next iterate and interpolation set.

the big question

Step 5: Update the Lagrange polynomials.

http://find/


147/429


Global convergence results

If the model is at least fully linear, then

lim inf

k xf(xk)

= lim inf

k gk

= 0


148/429

Scheinberg and T. (2009)

Withmore costlyalgorithm:

If the model is at least fully linear, then

limk

xf(xk)= limk

gk= 0

If the model at least fullyquadratic, then iterates converge to2nd-order critical points



For an efficient numerical method. . .

Many more issues:

whichHessian approximation?

(full/vs diagonal or structured)
http://find/


149/429

details ofcriticality testsdifficult

details fornumerically handling interpolation polynomials(Lagrange, Newton),

reference shifts,. . .

good codesaround: NEWUOA, DFOefficient solvers

Powell (2008 and previously), Conn, Scheinberg and T. (1998)Conn, Scheinberg and Vicente (2008)



On the ever famous banana function. . .

20

25
http://find/


150/429

0

0.2

0.4

0.6

0.8

1

1.2

0

0.5

1

1.5

10

5

0

5

10

15




20

25
http://find/


151/429

0

0.2

0.4

0.6

0.8

1

1.2

0

0.5

1

1.5

10

5

0

5

10

15




20

25
http://find/


152/429

0

0.2

0.4

0.6

0.8

1

1.2

0

0.5

1

1.5

10

5

0

5

10

15




20

25
http://find/


153/429

0

0.2

0.4

0.6

0.8

1

1.2

0

0.5

1

1.5

10

5

0

5

10

15




20

25
http://find/


154/429

0

0.2

0.4

0.6

0.8

1

1.2

0

0.5

1

1.5

10

5

0

5

10

15




20

25
http://find/


155/429

0

0.2

0.4

0.6

0.8

1

1.2

0

0.5

1

1.5

10

5

0

5

10

15




15

20

25
http://find/


156/429

0

0.2

0.4

0.6

0.8

1

1.2

0

0.5

1

1.5

10

5

0

5

10

15




15

20

25
http://find/


157/429

0

0.2

0.4

0.6

0.8

1

1.2

0

0.5

1

1.5

10

5

0

5

10

15




15

20

25
http://find/


158/429

0

0.2

0.4

0.6

0.8

1

1.2

0

0.5

1

1.5

10

5

0

5

10

15




15

20

25
http://find/


159/429

0

0.2

0.4

0.6

0.8

1

1.2

0

0.5

1

1.5

10

5

0

5

10

15




15

20

25
http://find/


160/429

0

0.2

0.4

0.6

0.8

1

1.2

0

0.5

1

1.5

10

5

0

5

10

15




15

20

25
http://find/


161/429

0

0.2

0.4

0.6

0.8

1

1.2

0

0.5

1

1.5

10

5

0

5

10

15




15

20

25
http://find/


162/429

0

0.20.4

0.6

0.8

1

1.2

0

0.5

1

1.5

10

5

0

5

10




15

20

25
http://find/


163/429

0

0.20.4

0.6

0.8

1

1.2

0

0.5

1

1.5

10

5

0

5

10




15

20

25
http://find/


164/429

0

0.20.4

0.6

0.8

1

1.2

0

0.5

1

1.5

10

5

0

5

10




15

20

25
http://find/


165/429

0

0.20.4

0.6

0.8

1

1.2

0

0.5

1

1.5

10

5

0

5

10




15

20

25
http://find/


166/429

0

0.20.4

0.6

0.8

1

1.2

0

0.5

1

1.5

10

5

0

5

10




15

20

25
http://find/


167/429

0

0.20.4

0.6

0.8

1

1.2

0

0.5

1

1.5

10

5

0

5

10




15

20

25
http://find/


168/429

0

0.20.4

0.6

0.8

1

1.2

0

0.5

1

1.5

10

5

0

5

10




15

20

25


169/429

0

0.20.4

0.6

0.8

1

1.2

0

0.5

1

1.5

10

5

0

5

10




15

20

25
http://find/


170/429

0

0.20.4

0.6

0.8

1

1.2

0

0.5

1

1.5

10

5

0

5

10




15

20

25
http://find/


171/429

0

0.20.4

0.6

0.8

1

1.2

0

0.5

1

1.5

10

5

0

5

10




15

20

25
http://find/


172/429

0

0.20.4

0.6

0.8

1

1.2

0

0.5

1

1.5

10

5

0

5

10




15

20

25
http://goforward/http://find/


173/429

0

0.20.4

0.6

0.8

1

1.2

0

0.5

1

1.5

10

5

0

5

10

http://find/


174/429



15

20

25


175/429

0

0.20.4

0.6

0.8

1

1.2

0

0.5

1

1.5

10

5

0

5

10




15

20

25
http://find/


176/429

0

0.20.4

0.6

0.8

1

1.2

0

0.5

1

1.5

10

5

0

5

10




10

15

20

25
http://find/


177/429

0

0.20.4

0.6

0.8

1

1.2

0

0.5

1

1.5

10

5

0

5

10




10

15

20

25
http://find/


178/429

0

0.20.4

0.6

0.8

1

1.2

0

0.5

1

1.5

10

5

0

5

10




10

15

20

25
http://find/

advanced algorithms in nonlinear optimization

Documents