overview on optimization algorithms - ntnu
TRANSCRIPT
Introduction Mathematical Foundations Algorithms
Overview on Optimization Algorithms
Markus A. Köbis
NTNU, TMA4180, Optimering I, Spring 2021
2021-02-26
1
Introduction Mathematical Foundations Algorithms
A Review Session
Idea• OPTION A - a live session with a presentation highlighting the ´big
picture’ of what we have learned so far
What we don’t do today• Learn new material ;-)• Go through proofs
Questions are welcome!
2
Introduction Mathematical Foundations Algorithms
Talking about the ‘Big Picture’(I)
Topics of Today:
3
Introduction Mathematical Foundations Algorithms
Talking about the ‘Big Picture’(II)
Topics of Today:
Recap of Theory and Methods for Unconstrained Optimization Problems
minx∈Rn
f(x)
• (Practical) Problems and Challenges• Theoretical Considerations• Algorithms
4
Introduction Mathematical Foundations Algorithms
Talking about the ‘Big Picture’(III)
Organization Problem
Problem, Convexity, Continuity, Opt.-Conditions
(Descent) Methods
Gradient
Line Search
Newton’s Method
Linear CG
Gauss-Newton
5
Introduction Mathematical Foundations Algorithms
Problems and Challenges (I)
Optimization Problems in Science and Engineering
PhysicsHamilton principle
EngineeringShape/Trajectory Optimization
EconomicsPortfolio OptimizationReturn MaximizationCost Minimization
Social SciencesExperiment Design
Data Fitting
Chemistry/BiologyReactor Optimization
6
Introduction Mathematical Foundations Algorithms
Taxonomy of Optimization Problems(Not exhaustive)
Optimization
Uncertainty Deterministic Multiobjective Optimization
Stochastic Programming Robust Optimization Continuous Discrete
Unconstrained ConstrainedInteger
Programming
Combinatorial
Optimization
Nonlinear
Least SquaresNonlinear Equations
Nondifferentiable
OptimizationGlobal Optimization Nonlinear Programming Network Optimization Bound Constrained Linearly Constrained
Mixed Integer
Nonlinear ProgrammingSemidefinite Programming
Semiinfinite
Programming
Mathematical Programs
with Equilibrium Constraints
Derivative-Free
Optimization
Linear
Programming
Quadratic
Programming
Second-Order
Cone Programming
Complementarity
Problems
Quadratically-Constrained
Quadratic Programming
Source: https://neos-guide.org/content/optimization-taxonomy7
Introduction Mathematical Foundations Algorithms
Quality of AlgorithmsProperties of ´good’ algorithms (Why isn’t there a best one?)• Accuracy• Efficiency (also: Scalability)
• Speed• Total execution time (avoid NP if possible, parallel vs. sequential)• Alternative: approximation quality after fixed time• Number of computer operations• Core function/ Linear algebra/GPU-routine-calls• Floating point operations (plus, times, roots, exp./trig, ...)• Bit operations• Number of f/∇f/∇2f calls
• Memory• Robustness
• Reliably find all solutions for a variety of problems• Feedback if algorithm unsuccessful, reproducibility• Provide additional information on problem structure
• User-friendliness, ability to include expert knowledge, interactivity• Easily extendable and maintainable• Cheap, trusted, supported, ...
8
Introduction Mathematical Foundations Algorithms
Building the ´Big Picture’
Mathematical Foundations ↔ Algorithms
9
Introduction Mathematical Foundations Algorithms
When Do Minimizers Exist?Weierstrass’ theorem
Weierstrass TheoremA continuous function on a compactdomain attains its minimum.
• In practice: Very often enough• Can one weaken the ´continuity’
requirement?e.g. complicated functions in technicalapplications
• Can one weaken the ´compactness’requirement?e.g. optimal control/problems inmathematical physics
Karl Weierstrass, 1815–1897Source: Public Domain,
https://commons.wikimedia.org/w/
index.php?curid=324146
10
Introduction Mathematical Foundations Algorithms
Convexity
Convex SetA set S ⊆ Rn is convex if for all x1, x2 ∈ S, λ ∈ [0, 1]
λx1 + (1− λ)x2︸ ︷︷ ︸convex combination
∈ S.
Convex FunctionA function f : S → R is convex if its domain S is a convex set and if for anytwo points x1, x2 ∈ S and λ ∈ [0, 1] the following property is satisfied:
f(λx1 + (1− λ)x2) ≤ λf(x1) + (1− λ)f(x2).
x1
x2
f(x)
x
(x1, f(x1)) (x2, f(x2))
11
Introduction Mathematical Foundations Algorithms
(Lower Semi-) Continuity
DefinitionA function f : Rn → R is called lower semi-continuous, if for every x ∈ Rnand every sequence (xk)k∈N converging to x we have
f(x) ≤ lim infk→∞
f(xk).
x
f(x)
α
Ωα
RemarkAn alternative (equivalent)definition of lower semi-continuityis the following: A functionf : Rn → R is lowersemi-continuous, if the lower levelset
Ωα(f) := x ∈ Rn : f(x) ≤ α
is closed for every α ∈ R.
12
Introduction Mathematical Foundations Algorithms
Coercivity
DefinitionA function f : Rn → R is called coercive, if we have for every sequence(xk)k∈N with ‖xk‖ → ∞ that f(xk)→∞.
x
f(x)
13
Introduction Mathematical Foundations Algorithms
Weierstrass revisited
PropositionAssume that f : Rn → R is lower semi-continuous and that Ω ⊂ Rn iscompact. Then the optimization problem
minx∈Ω
f(x)
has at least one global minimizer.
TheoremAssume that the function f : Rn → R is lower semi-continuous and coercive.Then the optimization problem minx∈Rn f(x) admits at least one globalminimizer.
14
Introduction Mathematical Foundations Algorithms
The Role of Convexity
TheoremIf f : Rn → R is convex, any local minimizer is a global minimizer. (Thisimplies that optimality conditions (see below) can be used for globaloptimization.)
15
Introduction Mathematical Foundations Algorithms
Designing Algorithms
Optimization Problem:minx∈Rn
f(x)
How to proceed?Design computer method to find good points in Rn until you are finished(a) Can the search be based on existing mathematical tools? (e.g. Solve
(non-)linear systems, feasibility problem solvers, simulation methods etc.)(b) How can you know that the algorithm is ‘finished’?
→ Optimality Conditions
16
Introduction Mathematical Foundations Algorithms
Optimality Conditions (I)First Order Necessary ConditionLet x∗ ∈ Rn be a local minimizer, and let f : Rn → R be continuouslydifferentiable in an open neighborhood of x∗. Then
∇f(x∗) = 0 .
This condition classifies x∗ as a stationary point
17
Introduction Mathematical Foundations Algorithms
Optimality Conditions (II)
Second Order Necessary ConditionLet x∗ ∈ Rn be a local minimizer of f : Rn → R. If ∇2f exists and iscontinuous in an open neighborhood of x∗, then ∇f(x∗) = 0 and ∇2f(x∗) ispositive semi-definite.
Second Order Sufficient ConditionLet ∇2f be continuous in an open neighborhood of x∗ ∈ Rn, ∇f(x∗) = 0 andlet ∇2f(x∗) be positive definite.Then x∗ is a strict local minimizer.
Geometry/Practice• Properties of a Quadratic Approximation on f• In (large-scale) algorithms often not/only approximately used.
18
Introduction Mathematical Foundations Algorithms
Designing Algorithms (II)
minx∈Rn
f(x)
Algorithm: Computerized function (Input → output scheme) with limitedinternal memory, calculation power and precision.
Picture: Point cloud
19
Introduction Mathematical Foundations Algorithms
Designing Algorithms (III)Versatile methods for ‘any’ optimization problem
1. First Attempt: Grid Search: Evaluate f along a predefined grid an choose somegood grids to refine this further.‘Curse of Dimension’: Number of grid points grows exponentially withdimension n of x.
2. Improvement (a): Solve a series of 1d-problems using ‘classical methods’How to define these?
3. Improvement (b): Algorithms based on Heuristics/Physics/Biologye.g. Nelder-Mead Simplex Method, Simulated Annealing, EvolutionaryAlgorithms
4. (Often) Better: Iterative Procedure: Choose a starting point x0 ∈ Rn andsearch along a sequence
x0, x1, x2, . . . , xk, xk+1, . . .→ x∗
with improved properties of xk+1 over xk.
Given the scarce information (f is just input→output), the only way to measure‘improved properties’is through function values → descent methods
scaling invariance, also important to account for easy extension and maintenance of code20
Introduction Mathematical Foundations Algorithms
Descent Methods (I)Basic Idea of Descent Methods1. Start with initial x0 ∈ Rn
2. For k = 0, 1, 2, . . .Find xk+1 such that f(xk+1) < f(xk)If some stopping criterion is fulfilled: STOP
Output: x∗ = xk+1
How to find a better x? Define the search direction
λ·pk := xk+1 − xk ⇔ xk+1 = xk + pk
When is pk good?
Definition
For a given xk ∈ Rn, a direction (nonzerovector (of undefined length)) pk ∈ Rn is calleda descent direction if there exists α > 0 suchthat
f(xk + αpk) < f(xk) ∀α ∈ (0, α). αf
(xk
+αpk)
α21
Introduction Mathematical Foundations Algorithms
Descent Methods (II)
Lemma: Characerizing descent directions for differentiable fLet f : Rn → R be continuously differentiable.1. Any direction p satisfying
∇f(x)> · p < 0
is a descent direction.2. The negative gradient −∇f(x) is a descent direction (if not zero)3. For any positive definite matrix B ∈ Rn×n, the direction
−B · ∇f(x)
is a descent direction (if x is not stationary).
22
Introduction Mathematical Foundations Algorithms
Line Search (I)
In general form of the descent algorithm:
xk+1 := xk + αk · pk
What is a good αk?1. Constant α:• Good for some special algorithms and when problem structure iswell-understood• α too large: ‘overshooting’ and method may diverge• α too small: Potentially many steps and slow convergence
2. Determining a perfect value, exact line search: αk := argminx Φk(α)︸ ︷︷ ︸=f(xk+α·pk)
• Good if Φk-minimization is easy.3. Inexact Line Searches: Use heuristics to obtain adequate step lengths at
justifiable cost
23
Introduction Mathematical Foundations Algorithms
Line Search (I)Backtracking
(Sub-) Algorithm: BacktrackingProblem: Find a good/quick approximation to the solution of minα Φk(α).Parameter: α ∈ R as initial guess, reduction constant ρ ∈ (0, 1)
Set: α = αWhile: some stopping criterion for φk(α) is not fulfilled:set: α = ρ · α.
In Practice• Also set a maximum number of iterations• Sometimes: Re-use α from previous run of algorithm• Improvements based on 1d equation solvers (regula falsi, interpolation
etc.)• ρ ∈ (0.4, 0.7)
24
Introduction Mathematical Foundations Algorithms
Line Search (II)
Armijo (Sufficient Decrease) ConditionGiven c1 ∈ (0, 1), we require
f(xk + αkpk)︸ ︷︷ ︸Φk(αk)
≤ f(xk) + c1 · αk · ∇f(xk)> · pk︸ ︷︷ ︸Φk(0)+αk·c1·Φ′
k(0)=:l(αk) α
f(x
k+αpk)
acceptable
α
f(x
k+αpk)
desired minimal slope
Armijo
acceptable
Wolfe (Curvature) Condition(s)Given, furthermore, a constant c2 ∈ (c1, 1), werequire additionally
∇f(xk + αk · pk)> · pk︸ ︷︷ ︸Φ′k
(αk)
≥ c2 · ∇f(xk)> · pk︸ ︷︷ ︸c2·Φ′
k(0)
Strong Wolfe condition: Also restrict positiveslope
Existence of such αk given if f smooth and bounded from below along the search line.
25
Introduction Mathematical Foundations Algorithms
Line Search (III)Fundamental Result for Line Search Methods
Theorem (Zoutendijk)Consider any iteration of the form
xk+1 = xk + αk · pk
where pk is a descent direction and αk satisfies the Wolfe conditions for0 < c1 < c2 < 1. Suppose that f is bounded below, differentiable withL-continuous gradient on any open set containing Ωf(x0)f .Then ∑
k≥0
(cos θk)2 · ‖∇f(xk)‖2 <∞ ,
where θk := ^(−∇f(xk), pk)
Result
cos θk ≥ δ > 0 ⇒ limk→∞
‖∇f(xk)‖ → 0
26
Introduction Mathematical Foundations Algorithms
Gradient DescentGiven a (continuously) differentiable objective f : Rn → R.
We have information concerning the slope(s) of f .
Algorithm: Gradient DescentUse a descent method (with or without line search).Use the search direction
∀k : pk := −∇f(xk)
Locally (i.e. α→ 0), this is the best possible choice
27
Introduction Mathematical Foundations Algorithms
Gradient Descent: ConvergenceGradient Descent for quadratic test problem
f(x) :=1
2x> ·Q · x− b> · x
with positive definite matrix Q.
Gradient Descent with Exact Line SearchUpdate formula
xk+1 = xk −∇f(xk)> · ∇f(xk)∇f(xk)> ·Q · ∇f(xk)
· ∇f(xk)
TheoremWhen gradient descent with exact line search is applied to f as above, then
‖xk+1 − x∗‖2Q ≤(λn − λ1
λn + λ1
)2
· ‖xk − x∗‖2Q ,
where 0 < λ1 ≤ λ2 ≤ · · · ≤ λn are the eigenvalues of Q. (linear convergence)
Similar result for nonlinear case. (Q↔ ∇2f(x∗), (.)2 ↔ r2 with κ(Q) < r < 1)28
Introduction Mathematical Foundations Algorithms
Gradient Descent: Convergence
x1
x2
Perfect (convergence in one step) if all level curves equally spaced along allaxes.
29
Introduction Mathematical Foundations Algorithms
Newton’s Method
Sir Isaac Newton, 1643 - 1727
Motivation• Instant method forf(x) = x> ·Q · x− b> · x with s.p.d.matrix Q
• Solve ∇f = 0 with the ‘classical’Newton’s method
• (Sequential) approximate of f byquadratic functions
Global Convergence Properties• Oscillations• Divergence• Stuck in local minima• Often not even a descent direction• Not defined if ∇2f(xk) singular
Source: By Godfrey Kneller - http://www.phys.uu.nl/ vgent/astrology/images/newton1689.jpg], Public Domain,
https://commons.wikimedia.org/w/index.php?curid=146431 30
Introduction Mathematical Foundations Algorithms
Newton’s Method
Newton’s MethodApply the descent method algorithm with the update
xk+1 := xk − αk︸︷︷︸=1
or→1
·(∇2f(xk)
)−1︸ ︷︷ ︸not always
s.p.d.
·∇f(xk)
(Additional) Costs:• approximation/calculation of Hessian ∇2f(xk), storage thereof• Inversion of ∇2f(xk) (never invert(!))
Why all this trouble?
TheoremIf f is twice differentiable with a L.-continuous Hessian and ∇2f(x∗) is s.p.d.in a local minimum x∗ and x0 is sufficiently close to x∗, then Newton’smethod convergences quadratically in x and ∇f towards x∗.
31
Introduction Mathematical Foundations Algorithms
Modifications of Newton’s Method *different names in literature
Modified Newton’s MethodReplace
∇2f(xk) ∇2f(x0/m) (m < k)
loose quadratic convergence, fewer evaluations/inversions
Simplified NewtonReplace Hessian by (good and easy to invert) approximation
∇2f(xk) Qk ≈ ∇2f(xk)
Damped NewtonUse a reduced step length αk < 1 in the Newton updates.loose quadratic convergence, (often:) increase domain of convergence
In PracticeNewton’s method only when Hessian cheap and/or to finalize iteration.
32
Introduction Mathematical Foundations Algorithms
(Linear) Conjugated Gradient Method (I)
Back to quadratic problem:
f(x) =1
2· x> ·A · x− b> · x
with a s.p.d. matrix A ∈ Rn×n.Alternatively: Solve the (large) linear system A · x = b. with the residualr(x) := A · x− b = ∇f(x)
Conjugate Direction MethodsUse a set of conjugate directions p0, p1, . . . , pn−1 (always lin. indep.) assearch directions in the descent algorithm. i.e. p>i ·A · pj = 0, (i 6= j)• Exact line search like for the gradient method possible• Convergence after n steps guaranteed• After k steps: Minimization of f over a k-dimensional subspace (a so-calledKrylov space (→ Cayley-Hamilton))
33
Introduction Mathematical Foundations Algorithms
(Linear) Conjugated Gradient Method (II)
CG method: Basic idea• Use only previous vector pk−1 and residual r(xk) to construct next
conjugate direction pk
pk = −r(xk) + βk · pk−1 (βk uniquely det. from being conj. direction)
Linear CG methodr0 = Ax0 − b, p0 = −r0
While rk 6= 0:αk = − r>k ·pk
p>k ·A·pkxk+1 = xk + αk · pkrk+1 = A · xk+1 − bβk+1 =
r>k+1·A·pkp>k ·A·pk
pk+1 = −rk+1 + βk+1 · pk
34
Introduction Mathematical Foundations Algorithms
Convergence and Preconditioned CG
TheoremIf A has only r distinct eigenvalues, then the (linear) conjugated gradientmethod terminates at the solution after r steps.
TheoremThe (theoretical) convergence rate is similar to the gradient method
‖xk+1 − x∗‖2A ≤(λn−k − λ1
λn−k + λ1
)2
· ‖x0 − x∗‖2A
PCGImprove convergence behavior by linearly transforming the search spacex := C · x
f(x) =1
2· x> · (C−>AC−1)x− (C−>b)> · x
NB: Extension (and convergence analysis) to the nonlinear case (relatively) easy.
35
Introduction Mathematical Foundations Algorithms
Least Squares ProblemsForm of the objective function:
f(x) =1
2
m∑j=1
r2j (x)
where each rj : Rn → R is a real valued smooth function called residual. Weassume that m ≥ n.
Remark• Least-squares problems arise in many areas of applications, and may in
fact be the largest source of unconstrained optimization problems• measure the discrepancy between the model and the observed behavior of
the system.• by minimizing f , we select values for the parameters that best match the
model to the data.• we show how to devise efficient, robust minimization algorithms by
exploiting the special structure of the function f and its derivatives.
36
Introduction Mathematical Foundations Algorithms
Least Squares Problems (cont.)Residual vector r : Rn → Rm
r(x) := (r1(x), . . . , rm(x))>
Then f = 12 ‖r(x)‖22. The Jacobian (m× n-matrix) is defined as
J(x) :=
[∂rj∂xi
]j=1,...,m,i=1,...,n
=
∇r1(x)>
...∇rm(x)>
.
∇f(x) =
m∑j=1
rj(x)∇rj(x) = J(x)>r(x),
∇2f(x)︸ ︷︷ ︸“for free”
=
m∑j=1
∇rj(x)∇rj(x)> +
m∑j=1
rj(x)∇2rj(x)>
= J(x)>J(x) +m∑j=1
rj(x)∇2rj(x). (1)
37
Introduction Mathematical Foundations Algorithms
Gauss Newton MethodsThe Gauss-Newton MethodThe standard Newton equations ∇2f(xk)p = −∇f(xk) are replaced by thefollowing system to obtain the search direction pGN
k :
J>k JkpGNk = −J>k rk.
AdvantagesUsing the approximation O2f(xk) ≈ JTk Jk saves the trouble of computing theindividual residual Hessians ∇2rj from (1). Often, the first term in (1)dominates the second one, such that J>k Jk is a good approximation of∇2f(xk). In practice, many least-squares problems have small residuals at thesolution, leading to rapid local convergence of Gauss-Newton.
ConvergenceIn theory: (Linear) convergence under mild regularity and full-rank condition onsystem matrices.
In practice: Good global and fast local convergence38
Introduction Mathematical Foundations Algorithms
What Next?
39