06 optimization notes - eurecom.frzuluaga/files/06_optimization_notes.pdf · )4% < ),*=1,2,,…...

25/10/2019

1

MALIS: Optimization

Maria A. Zuluaga

Data Science Department

Recap: estimating �� for logistic regression

• Iterative reweighted least squares (IRLS): an iterative algorithm derived from the Newton-Raphson method

• IRLS is an optimization algorithm which iteratively solved a weighted least squares problem in the context of least squares.

MALIS 2019 2

In this lecture we will have a formal introduction to optimization and review some optimization algorithms

Slides inspired on the optimization lecture from Sanjay Lall and Stephen Boyd - Stanford University

25/10/2019

2

Optimization problem

• Given � ∈ ℝ� and a function �: ℝ� → ℝ

• Goal: Choose θ to minimize �

• Lets denote �∗ optimal if ∀ �, � �∗ ≤ �(�)

• �∗ = �(�∗) is the optimal value

MALIS 2019 3

minimize �(�)

decision variableObjective/cost function or energy

Optimization problem: an equivalence

• Equivalently, the optimization problem can be framed as:

• Goal: Choose θ to maximize −�

• �∗ is optimal if ∀ �, � �∗ ≥ �(�)

• �∗ = �(�∗) is the optimal value

• We will use minimization but remember this is an equivalent formulation

MALIS 2019 4

maximize − �(�)

decision variableUtility/reward function

25/10/2019

3

Conditions to optimality

• If �∗ is optimal, it is a stationary point

• �� ∗ = 0 : optimality condition

• Not all stationary points are optimal

• Important: We are assuming that � is differentiable

MALIS 2019 5

Two examples: Is there any optimal point?

MALIS 2019 6

f

25/10/2019

4

Solving an optimization problem

• So far, we have faced some optimization problems that we have solved analytically

• First optimization problem of this course:

MALIS 2019 7

Least squares for linear regression

argmin�

1

(! − "�)#(! − "�)

�∗ = �� = ("#")$%"#!

Closed-form solution

Solving an optimization problem

• In other cases, it is not possible to obtain a closed form solution and there is the need to use an iterative algorithm

• Iterative algorithms compute a sequence of values �%, �', … , �)

hoping that � �) → �∗ as * → ∞

• First iterative algorithm of this course:

MALIS 2019 8

,-./ ← argmin 1 − 2, #3(1 − 2,)

Iterative reweighted least squares

25/10/2019

5

Iterative algorithms: Some formalization

• Compute a sequence of values �%, �', … , �)

• �): kth iterate

• �%: starting point or initialization

• Many iterative algorithms are descent methods, i.e.� �)4% < � �) , * = 1,2, , …

• Important: � �) converges but not necessarily to �∗

MALIS 2019 9

Iterative algorithms: How to stop?

• Rule 1: Reach maximum number K of iterations

• Rule 2: ��(�)) ≤ 7

• 7 is denoted the stopping tolerance and is a small positive number

• The goal is to have � �) not too far from �∗

• Important: � �) converges but not necessarily to �∗

MALIS 2019 10

25/10/2019

6

Non-heuristic vs. heuristic algorithms

Non-heuristic

• We know that � �) → �∗ for any �%

Heuristic

• There is no guarantee for � �) → �∗

• Examples:• Hill climbing

• Simulated Annealing

• Genetic Algorithms

• Tabu search

• … (and many more)

MALIS 2019 11

Convexity

• A function is convex if for every pair of points 8, 9 ∈ ℝ�, the line connecting (8, � 8 ) to (9, � 9 ) does not go below � ⋅ .

• More formally , a function �: ℝ� → ℝ is convex if:

• For any �, �;, and < with 0 ≤ < ≤ 1

� <� + 1 − < �; ≤ <� � + 1 − < �(�;)

• For the case where > = 1:

• Equivalent to �?? � ≥ 0 ∀�

MALIS 2019 12

x y

25/10/2019

7

Convex vs Concave

MALIS 2019 13

x y x y

Examples: Convex or not?

MALIS 2019 14

Fig 1 from: Q. Nguyen & M. Hein. The Loss Surface of Deep and Wide Neural Networks. JMLR 2017

25/10/2019

8

Some convex empirical risk functions

MALIS 2019 15

Convex vs. Non-convex functions

MALIS 2019 16

Optimizing convex functions is easy. Optimizing non-convex functions can be very hard.

25/10/2019

9

Convex optimization

• If � is convex the optimization problem is denoted convex optimization

• �� = 0 only for � optimal.

• All stationary points are optimal.

• This implies that convex optimization is non-heuristic

• In principle one can find the exact solution

MALIS 2019 17

Constrained optimization

• Optimizing the objective function with respect to some variables in the presence of constraints on those variables

MALIS 2019 18

minimize � @

subject to HI @ = JI K = 1, . . , M

and OP @ ≥ >P Q = 1, . . , R

Examples of algorithms: Substitution or Lagrange multipliers (equality constraints)Linear and quadratic programming (inequalities)

25/10/2019

10

Smoothness

• So far we assumed that the gradient of � is defined everywhere

MALIS 2019 19

Optimizing smooth functions is “easier”. Linear programming

techniques are an example of methods to deal with piece-wise linear functions

Linear programming

• Technique for the optimization of a linear objective function, subject to linear equality and linear inequality constraints.

MALIS 2019 20

minimize � @ = S#@

subject to T@ ≤ U

and @ ≥ 0

Source: Concise Machine Learning - Jonathan Richard Shewchuk

Examples of LP algorithms: Simplex, interior point methods

25/10/2019

11

Quadratic programming

• Technique for the optimization of a quadratic objective function, subject to linear constraints.

MALIS 2019 21

minimize � @ = @V@W + S#@

subject to T@ ≤ U

and @ ≥ 0

Examples of QP algorithms: Sequential minimal, coordinate descent

where Q is a symmetric, positive definite matrix

We will revisit this in three lectures

MALIS 2019 22

Cre

dit

s: P

eter

Ric

hta

rik

–D

ata

Scie

nce

Su

mm

er S

cho

ol D

S3 É

cole

Po

lyte

chn

iqu

e20

17

25/10/2019

12

Gradient-based methods

MALIS 2019 23

Intuition

• Assumptions: • Unconstrained optimization

• � is continuous and convex

• Warm up: consider � :ℝ → ℝ. What would you do to minimize it?

MALIS 2019 24

f(x)

25/10/2019

13

Intuition

MALIS 2019 25

f(x)

x*=random(x)

while f’(x*) ≠ 0

if f’(x*) > 0 move x* in opposite direction

if f’(x*) < 0 move x* in same direction

Tangent line

Tangent line

df

dθ

Θ*�X

�Y: slope

How much change in f if there is a change in θ

Formalization: Gradient Descent

Initialize �% ∈ ℝ�

While �� ' > ϵ do

� ≔ � − < ⋅ ��

MALIS 2019 26

Update rule

α > 0 denoted the step size or learning rate

At each step the weight vector � is moved in the direction of the greatest rate of decrease of the error function.Also called steepest descent

25/10/2019

14

Back to the warm-up example

� ≔ � − < ⋅ ��

MALIS 2019 27

α = 1

f(x)

d-dimensional functions

@ ≔ @ − < ⋅ �� @

MALIS 2019 28

�P ≔ �P − < ⋅]�

]�P

(@)

Zoo

m-i

n

The update rule can be thought of as d updates of the “zoomed-in” form being done in parallel (one per dimension)

25/10/2019

15

Step size

• If α too large we can get � �)4% > �(�).

• If too small, it progresses slow .

• Simple solution: Varying learning rate <).

• Example

• If � �)4% > � � , set �)4% = �) and <)4% =^_

'

• Else <)4% = 1.3<)

MALIS 2019 29

Convergence: Do we find �∗?

• Under some assumptions, the gradient method finds a stationary point:

��(�)) → 0 as * → ∞

• Convex problems• Non-heuristic

• No matter the initialization �%, � �) → �∗ as * → ∞

• Non-convex problems• Heuristic

• It can happen that, � �) ↛ �∗

MALIS 2019 30

25/10/2019

16

Gradient method: Algorithm

initialize �% ∈ ℝ� , <% > 0 and cdef

�∗ = �%

for * = 1,2, … , cdef:

do

Compute ��(�))

Compute tentative update �g.-g = �) − <)�� )

if � �g.-g ≤ � �)

�)4% = �g.-g

�∗ = �g.-g

<)4% = 1.3<)

else

<)4% = 0.5<)

while ��(�)) ≥ ϵ

MALIS 2019 31

Examples: Changing the learning rate

MALIS 2019 32

K = 100, α=0.1 K = 100, α=0.01

Credits: Simone Rossi

25/10/2019

17

Examples: Faster learning rate

MALIS 2019 33

K = 12, α=1


Examples: Role of initialization

MALIS 2019 34

K = 100, α=0.1 K = 10, α=1


25/10/2019

18

Gradient descent (vanilla version)

Goods

• Scales well to large datasets

• No need to multiply matrices

• Solves many convex problems

• Unreasonable good heuristic for the approximate solution of non-convex problems

Bads

• Depends on good initialization

• Need to re-run several times to find a sufficiently good minimum

• Cannot cope with non-differentiable functions

• Slow over functions of elliptic form

MALIS 2019 35

Other gradient-based methods

• Conjugate gradient

• Stochastic gradient descent (*)

• Mini-batch gradient descent (*)

• Quasi-Newton methods

• …

• There are also formulations to deal with non-smooth functions

MALIS 2019 36

25/10/2019

19

Recap

• We formalized the problem of optimization

• We reviewed the different types of algorithms used to solve the optimization problem

• We introduced gradient-based optimization methods

• We introduced the gradient descent algorithm and investigated its behavior in different settings

MALIS 2019 37

Why optimization?

MALIS 2019 38

Interested in going deeper? Optimization Theory with applications [optim]

25/10/2019

20

Further reading

Source Chapters

Pattern Recognition and Machine Learning Sec 5.2.4

Convex Optimization – S. Boyd All (Great book)

Northwestern University Open Text book on process optimization

All

Peter Richtarik’s slides –DS3 2017 All

Scipy tutorials (contains code) Mathematical optimization

MALIS 2019 39

06 optimization notes - eurecom.frzuluaga/files/06_optimization_notes.pdf · )4% < ),*=1,2,,…...

Documents