06 optimization notes - eurecom.frzuluaga/files/06_optimization_notes.pdf · )4% < ),*=1,2,,…...
TRANSCRIPT
25/10/2019
1
MALIS: Optimization
Maria A. Zuluaga
Data Science Department
Recap: estimating �� for logistic regression
• Iterative reweighted least squares (IRLS): an iterative algorithm derived from the Newton-Raphson method
• IRLS is an optimization algorithm which iteratively solved a weighted least squares problem in the context of least squares.
MALIS 2019 2
In this lecture we will have a formal introduction to optimization and review some optimization algorithms
Slides inspired on the optimization lecture from Sanjay Lall and Stephen Boyd - Stanford University
25/10/2019
2
Optimization problem
• Given � ∈ ℝ� and a function �: ℝ� → ℝ
• Goal: Choose θ to minimize �
• Lets denote �∗ optimal if ∀ �, � �∗ ≤ �(�)
• �∗ = �(�∗) is the optimal value
MALIS 2019 3
minimize �(�)
decision variableObjective/cost function or energy
Optimization problem: an equivalence
• Equivalently, the optimization problem can be framed as:
• Goal: Choose θ to maximize −�
• �∗ is optimal if ∀ �, � �∗ ≥ �(�)
• �∗ = �(�∗) is the optimal value
• We will use minimization but remember this is an equivalent formulation
MALIS 2019 4
maximize − �(�)
decision variableUtility/reward function
25/10/2019
3
Conditions to optimality
• If �∗ is optimal, it is a stationary point
• �� �∗ = 0 : optimality condition
• Not all stationary points are optimal
• Important: We are assuming that � is differentiable
MALIS 2019 5
Two examples: Is there any optimal point?
MALIS 2019 6
f
25/10/2019
4
Solving an optimization problem
• So far, we have faced some optimization problems that we have solved analytically
• First optimization problem of this course:
MALIS 2019 7
Least squares for linear regression
argmin�
1
(! − "�)#(! − "�)
�∗ = �� = ("#")$%"#!
Closed-form solution
Solving an optimization problem
• In other cases, it is not possible to obtain a closed form solution and there is the need to use an iterative algorithm
• Iterative algorithms compute a sequence of values �%, �', … , �)
hoping that � �) → �∗ as * → ∞
• First iterative algorithm of this course:
MALIS 2019 8
,-./ ← argmin 1 − 2, #3(1 − 2,)
Iterative reweighted least squares
25/10/2019
5
Iterative algorithms: Some formalization
• Compute a sequence of values �%, �', … , �)
• �): kth iterate
• �%: starting point or initialization
• Many iterative algorithms are descent methods, i.e.� �)4% < � �) , * = 1,2, , …
• Important: � �) converges but not necessarily to �∗
MALIS 2019 9
Iterative algorithms: How to stop?
• Rule 1: Reach maximum number K of iterations
• Rule 2: ��(�)) ≤ 7
• 7 is denoted the stopping tolerance and is a small positive number
• The goal is to have � �) not too far from �∗
• Important: � �) converges but not necessarily to �∗
MALIS 2019 10
25/10/2019
6
Non-heuristic vs. heuristic algorithms
Non-heuristic
• We know that � �) → �∗ for any �%
Heuristic
• There is no guarantee for � �) → �∗
• Examples:• Hill climbing
• Simulated Annealing
• Genetic Algorithms
• Tabu search
• … (and many more)
MALIS 2019 11
Convexity
• A function is convex if for every pair of points 8, 9 ∈ ℝ�, the line connecting (8, � 8 ) to (9, � 9 ) does not go below � ⋅ .
• More formally , a function �: ℝ� → ℝ is convex if:
• For any �, �;, and < with 0 ≤ < ≤ 1
� <� + 1 − < �; ≤ <� � + 1 − < �(�;)
• For the case where > = 1:
• Equivalent to �?? � ≥ 0 ∀�
MALIS 2019 12
x y
25/10/2019
7
Convex vs Concave
MALIS 2019 13
x y x y
Examples: Convex or not?
MALIS 2019 14
Fig 1 from: Q. Nguyen & M. Hein. The Loss Surface of Deep and Wide Neural Networks. JMLR 2017
25/10/2019
8
Some convex empirical risk functions
MALIS 2019 15
Convex vs. Non-convex functions
MALIS 2019 16
Optimizing convex functions is easy. Optimizing non-convex functions can be very hard.
25/10/2019
9
Convex optimization
• If � is convex the optimization problem is denoted convex optimization
• �� � = 0 only for � optimal.
• All stationary points are optimal.
• This implies that convex optimization is non-heuristic
• In principle one can find the exact solution
MALIS 2019 17
Constrained optimization
• Optimizing the objective function with respect to some variables in the presence of constraints on those variables
MALIS 2019 18
minimize � @
subject to HI @ = JI K = 1, . . , M
and OP @ ≥ >P Q = 1, . . , R
Examples of algorithms: Substitution or Lagrange multipliers (equality constraints)Linear and quadratic programming (inequalities)
25/10/2019
10
Smoothness
• So far we assumed that the gradient of � is defined everywhere
MALIS 2019 19
Optimizing smooth functions is “easier”. Linear programming
techniques are an example of methods to deal with piece-wise linear functions
Linear programming
• Technique for the optimization of a linear objective function, subject to linear equality and linear inequality constraints.
MALIS 2019 20
minimize � @ = S#@
subject to T@ ≤ U
and @ ≥ 0
Source: Concise Machine Learning - Jonathan Richard Shewchuk
Examples of LP algorithms: Simplex, interior point methods
25/10/2019
11
Quadratic programming
• Technique for the optimization of a quadratic objective function, subject to linear constraints.
MALIS 2019 21
minimize � @ = @V@W + S#@
subject to T@ ≤ U
and @ ≥ 0
Examples of QP algorithms: Sequential minimal, coordinate descent
where Q is a symmetric, positive definite matrix
We will revisit this in three lectures
MALIS 2019 22
Cre
dit
s: P
eter
Ric
hta
rik
–D
ata
Scie
nce
Su
mm
er S
cho
ol D
S3 É
cole
Po
lyte
chn
iqu
e20
17
25/10/2019
12
Gradient-based methods
MALIS 2019 23
Intuition
• Assumptions: • Unconstrained optimization
• � is continuous and convex
• Warm up: consider � :ℝ → ℝ. What would you do to minimize it?
MALIS 2019 24
f(x)
25/10/2019
13
Intuition
MALIS 2019 25
f(x)
x*=random(x)
while f’(x*) ≠ 0
if f’(x*) > 0 move x* in opposite direction
if f’(x*) < 0 move x* in same direction
Tangent line
Tangent line
df
dθ
Θ*�X
�Y: slope
How much change in f if there is a change in θ
Formalization: Gradient Descent
Initialize �% ∈ ℝ�
While �� � ' > ϵ do
� ≔ � − < ⋅ �� �
MALIS 2019 26
Update rule
α > 0 denoted the step size or learning rate
At each step the weight vector � is moved in the direction of the greatest rate of decrease of the error function.Also called steepest descent
25/10/2019
14
Back to the warm-up example
� ≔ � − < ⋅ �� �
MALIS 2019 27
α = 1
f(x)
d-dimensional functions
@ ≔ @ − < ⋅ �� @
MALIS 2019 28
�P ≔ �P − < ⋅]�
]�P
(@)
Zoo
m-i
n
The update rule can be thought of as d updates of the “zoomed-in” form being done in parallel (one per dimension)
25/10/2019
15
Step size
• If α too large we can get � �)4% > �(�).
• If too small, it progresses slow .
• Simple solution: Varying learning rate <).
• Example
• If � �)4% > � � , set �)4% = �) and <)4% =^_
'
• Else <)4% = 1.3<)
MALIS 2019 29
Convergence: Do we find �∗?
• Under some assumptions, the gradient method finds a stationary point:
��(�)) → 0 as * → ∞
• Convex problems• Non-heuristic
• No matter the initialization �%, � �) → �∗ as * → ∞
• Non-convex problems• Heuristic
• It can happen that, � �) ↛ �∗
MALIS 2019 30
25/10/2019
16
Gradient method: Algorithm
initialize �% ∈ ℝ� , <% > 0 and cdef
�∗ = �%
for * = 1,2, … , cdef:
do
Compute ��(�))
Compute tentative update �g.-g = �) − <)�� �)
if � �g.-g ≤ � �)
�)4% = �g.-g
�∗ = �g.-g
<)4% = 1.3<)
else
<)4% = 0.5<)
while ��(�)) ≥ ϵ
MALIS 2019 31
Examples: Changing the learning rate
MALIS 2019 32
K = 100, α=0.1 K = 100, α=0.01
Credits: Simone Rossi
25/10/2019
17
Examples: Faster learning rate
MALIS 2019 33
K = 12, α=1
Credits: Simone Rossi
Examples: Role of initialization
MALIS 2019 34
K = 100, α=0.1 K = 10, α=1
Credits: Simone Rossi
25/10/2019
18
Gradient descent (vanilla version)
Goods
• Scales well to large datasets
• No need to multiply matrices
• Solves many convex problems
• Unreasonable good heuristic for the approximate solution of non-convex problems
Bads
• Depends on good initialization
• Need to re-run several times to find a sufficiently good minimum
• Cannot cope with non-differentiable functions
• Slow over functions of elliptic form
MALIS 2019 35
Other gradient-based methods
• Conjugate gradient
• Stochastic gradient descent (*)
• Mini-batch gradient descent (*)
• Quasi-Newton methods
• …
• There are also formulations to deal with non-smooth functions
MALIS 2019 36
25/10/2019
19
Recap
• We formalized the problem of optimization
• We reviewed the different types of algorithms used to solve the optimization problem
• We introduced gradient-based optimization methods
• We introduced the gradient descent algorithm and investigated its behavior in different settings
MALIS 2019 37
Why optimization?
MALIS 2019 38
Interested in going deeper? Optimization Theory with applications [optim]
25/10/2019
20
Further reading
Source Chapters
Pattern Recognition and Machine Learning Sec 5.2.4
Convex Optimization – S. Boyd All (Great book)
Northwestern University Open Text book on process optimization
All
Peter Richtarik’s slides –DS3 2017 All
Scipy tutorials (contains code) Mathematical optimization
MALIS 2019 39