06 optimization notes - eurecom.frzuluaga/files/06_optimization_notes.pdf · )4% < ),*=1,2,,…...

20
25/10/2019 1 MALIS: Optimization Maria A. Zuluaga Data Science Department Recap: estimating for logistic regression Iterative reweighted least squares (IRLS): an iterative algorithm derived from the Newton-Raphson method IRLS is an optimization algorithm which iteratively solved a weighted least squares problem in the context of least squares. MALIS 2019 2 In this lecture we will have a formal introduction to optimization and review some optimization algorithms Slides inspired on the optimization lecture from Sanjay Lall and Stephen Boyd - Stanford University

Upload: others

Post on 18-Jul-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 06 optimization notes - eurecom.frzuluaga/files/06_optimization_notes.pdf · )4% < ),*=1,2,,… • Important:) converges but not necessarily to ∗ MALIS 2019 9 Iterative algorithms:

25/10/2019

1

MALIS: Optimization

Maria A. Zuluaga

Data Science Department

Recap: estimating �� for logistic regression

• Iterative reweighted least squares (IRLS): an iterative algorithm derived from the Newton-Raphson method

• IRLS is an optimization algorithm which iteratively solved a weighted least squares problem in the context of least squares.

MALIS 2019 2

In this lecture we will have a formal introduction to optimization and review some optimization algorithms

Slides inspired on the optimization lecture from Sanjay Lall and Stephen Boyd - Stanford University

Page 2: 06 optimization notes - eurecom.frzuluaga/files/06_optimization_notes.pdf · )4% < ),*=1,2,,… • Important:) converges but not necessarily to ∗ MALIS 2019 9 Iterative algorithms:

25/10/2019

2

Optimization problem

• Given � ∈ ℝ� and a function �: ℝ� → ℝ

• Goal: Choose θ to minimize �

• Lets denote �∗ optimal if ∀ �, � �∗ ≤ �(�)

• �∗ = �(�∗) is the optimal value

MALIS 2019 3

minimize �(�)

decision variableObjective/cost function or energy

Optimization problem: an equivalence

• Equivalently, the optimization problem can be framed as:

• Goal: Choose θ to maximize −�

• �∗ is optimal if ∀ �, � �∗ ≥ �(�)

• �∗ = �(�∗) is the optimal value

• We will use minimization but remember this is an equivalent formulation

MALIS 2019 4

maximize − �(�)

decision variableUtility/reward function

Page 3: 06 optimization notes - eurecom.frzuluaga/files/06_optimization_notes.pdf · )4% < ),*=1,2,,… • Important:) converges but not necessarily to ∗ MALIS 2019 9 Iterative algorithms:

25/10/2019

3

Conditions to optimality

• If �∗ is optimal, it is a stationary point

• �� �∗ = 0 : optimality condition

• Not all stationary points are optimal

• Important: We are assuming that � is differentiable

MALIS 2019 5

Two examples: Is there any optimal point?

MALIS 2019 6

f

Page 4: 06 optimization notes - eurecom.frzuluaga/files/06_optimization_notes.pdf · )4% < ),*=1,2,,… • Important:) converges but not necessarily to ∗ MALIS 2019 9 Iterative algorithms:

25/10/2019

4

Solving an optimization problem

• So far, we have faced some optimization problems that we have solved analytically

• First optimization problem of this course:

MALIS 2019 7

Least squares for linear regression

argmin�

1

(! − "�)#(! − "�)

�∗ = �� = ("#")$%"#!

Closed-form solution

Solving an optimization problem

• In other cases, it is not possible to obtain a closed form solution and there is the need to use an iterative algorithm

• Iterative algorithms compute a sequence of values �%, �', … , �)

hoping that � �) → �∗ as * → ∞

• First iterative algorithm of this course:

MALIS 2019 8

,-./ ← argmin 1 − 2, #3(1 − 2,)

Iterative reweighted least squares

Page 5: 06 optimization notes - eurecom.frzuluaga/files/06_optimization_notes.pdf · )4% < ),*=1,2,,… • Important:) converges but not necessarily to ∗ MALIS 2019 9 Iterative algorithms:

25/10/2019

5

Iterative algorithms: Some formalization

• Compute a sequence of values �%, �', … , �)

• �): kth iterate

• �%: starting point or initialization

• Many iterative algorithms are descent methods, i.e.� �)4% < � �) , * = 1,2, , …

• Important: � �) converges but not necessarily to �∗

MALIS 2019 9

Iterative algorithms: How to stop?

• Rule 1: Reach maximum number K of iterations

• Rule 2: ��(�)) ≤ 7

• 7 is denoted the stopping tolerance and is a small positive number

• The goal is to have � �) not too far from �∗

• Important: � �) converges but not necessarily to �∗

MALIS 2019 10

Page 6: 06 optimization notes - eurecom.frzuluaga/files/06_optimization_notes.pdf · )4% < ),*=1,2,,… • Important:) converges but not necessarily to ∗ MALIS 2019 9 Iterative algorithms:

25/10/2019

6

Non-heuristic vs. heuristic algorithms

Non-heuristic

• We know that � �) → �∗ for any �%

Heuristic

• There is no guarantee for � �) → �∗

• Examples:• Hill climbing

• Simulated Annealing

• Genetic Algorithms

• Tabu search

• … (and many more)

MALIS 2019 11

Convexity

• A function is convex if for every pair of points 8, 9 ∈ ℝ�, the line connecting (8, � 8 ) to (9, � 9 ) does not go below � ⋅ .

• More formally , a function �: ℝ� → ℝ is convex if:

• For any �, �;, and < with 0 ≤ < ≤ 1

� <� + 1 − < �; ≤ <� � + 1 − < �(�;)

• For the case where > = 1:

• Equivalent to �?? � ≥ 0 ∀�

MALIS 2019 12

x y

Page 7: 06 optimization notes - eurecom.frzuluaga/files/06_optimization_notes.pdf · )4% < ),*=1,2,,… • Important:) converges but not necessarily to ∗ MALIS 2019 9 Iterative algorithms:

25/10/2019

7

Convex vs Concave

MALIS 2019 13

x y x y

Examples: Convex or not?

MALIS 2019 14

Fig 1 from: Q. Nguyen & M. Hein. The Loss Surface of Deep and Wide Neural Networks. JMLR 2017

Page 8: 06 optimization notes - eurecom.frzuluaga/files/06_optimization_notes.pdf · )4% < ),*=1,2,,… • Important:) converges but not necessarily to ∗ MALIS 2019 9 Iterative algorithms:

25/10/2019

8

Some convex empirical risk functions

MALIS 2019 15

Convex vs. Non-convex functions

MALIS 2019 16

Optimizing convex functions is easy. Optimizing non-convex functions can be very hard.

Page 9: 06 optimization notes - eurecom.frzuluaga/files/06_optimization_notes.pdf · )4% < ),*=1,2,,… • Important:) converges but not necessarily to ∗ MALIS 2019 9 Iterative algorithms:

25/10/2019

9

Convex optimization

• If � is convex the optimization problem is denoted convex optimization

• �� � = 0 only for � optimal.

• All stationary points are optimal.

• This implies that convex optimization is non-heuristic

• In principle one can find the exact solution

MALIS 2019 17

Constrained optimization

• Optimizing the objective function with respect to some variables in the presence of constraints on those variables

MALIS 2019 18

minimize � @

subject to HI @ = JI K = 1, . . , M

and OP @ ≥ >P Q = 1, . . , R

Examples of algorithms: Substitution or Lagrange multipliers (equality constraints)Linear and quadratic programming (inequalities)

Page 10: 06 optimization notes - eurecom.frzuluaga/files/06_optimization_notes.pdf · )4% < ),*=1,2,,… • Important:) converges but not necessarily to ∗ MALIS 2019 9 Iterative algorithms:

25/10/2019

10

Smoothness

• So far we assumed that the gradient of � is defined everywhere

MALIS 2019 19

Optimizing smooth functions is “easier”. Linear programming

techniques are an example of methods to deal with piece-wise linear functions

Linear programming

• Technique for the optimization of a linear objective function, subject to linear equality and linear inequality constraints.

MALIS 2019 20

minimize � @ = S#@

subject to T@ ≤ U

and @ ≥ 0

Source: Concise Machine Learning - Jonathan Richard Shewchuk

Examples of LP algorithms: Simplex, interior point methods

Page 11: 06 optimization notes - eurecom.frzuluaga/files/06_optimization_notes.pdf · )4% < ),*=1,2,,… • Important:) converges but not necessarily to ∗ MALIS 2019 9 Iterative algorithms:

25/10/2019

11

Quadratic programming

• Technique for the optimization of a quadratic objective function, subject to linear constraints.

MALIS 2019 21

minimize � @ = @V@W + S#@

subject to T@ ≤ U

and @ ≥ 0

Examples of QP algorithms: Sequential minimal, coordinate descent

where Q is a symmetric, positive definite matrix

We will revisit this in three lectures

MALIS 2019 22

Cre

dit

s: P

eter

Ric

hta

rik

–D

ata

Scie

nce

Su

mm

er S

cho

ol D

S3 É

cole

Po

lyte

chn

iqu

e20

17

Page 12: 06 optimization notes - eurecom.frzuluaga/files/06_optimization_notes.pdf · )4% < ),*=1,2,,… • Important:) converges but not necessarily to ∗ MALIS 2019 9 Iterative algorithms:

25/10/2019

12

Gradient-based methods

MALIS 2019 23

Intuition

• Assumptions: • Unconstrained optimization

• � is continuous and convex

• Warm up: consider � :ℝ → ℝ. What would you do to minimize it?

MALIS 2019 24

f(x)

Page 13: 06 optimization notes - eurecom.frzuluaga/files/06_optimization_notes.pdf · )4% < ),*=1,2,,… • Important:) converges but not necessarily to ∗ MALIS 2019 9 Iterative algorithms:

25/10/2019

13

Intuition

MALIS 2019 25

f(x)

x*=random(x)

while f’(x*) ≠ 0

if f’(x*) > 0 move x* in opposite direction

if f’(x*) < 0 move x* in same direction

Tangent line

Tangent line

df

Θ*�X

�Y: slope

How much change in f if there is a change in θ

Formalization: Gradient Descent

Initialize �% ∈ ℝ�

While �� � ' > ϵ do

� ≔ � − < ⋅ �� �

MALIS 2019 26

Update rule

α > 0 denoted the step size or learning rate

At each step the weight vector � is moved in the direction of the greatest rate of decrease of the error function.Also called steepest descent

Page 14: 06 optimization notes - eurecom.frzuluaga/files/06_optimization_notes.pdf · )4% < ),*=1,2,,… • Important:) converges but not necessarily to ∗ MALIS 2019 9 Iterative algorithms:

25/10/2019

14

Back to the warm-up example

� ≔ � − < ⋅ �� �

MALIS 2019 27

α = 1

f(x)

d-dimensional functions

@ ≔ @ − < ⋅ �� @

MALIS 2019 28

�P ≔ �P − < ⋅]�

]�P

(@)

Zoo

m-i

n

The update rule can be thought of as d updates of the “zoomed-in” form being done in parallel (one per dimension)

Page 15: 06 optimization notes - eurecom.frzuluaga/files/06_optimization_notes.pdf · )4% < ),*=1,2,,… • Important:) converges but not necessarily to ∗ MALIS 2019 9 Iterative algorithms:

25/10/2019

15

Step size

• If α too large we can get � �)4% > �(�).

• If too small, it progresses slow .

• Simple solution: Varying learning rate <).

• Example

• If � �)4% > � � , set �)4% = �) and <)4% =^_

'

• Else <)4% = 1.3<)

MALIS 2019 29

Convergence: Do we find �∗?

• Under some assumptions, the gradient method finds a stationary point:

��(�)) → 0 as * → ∞

• Convex problems• Non-heuristic

• No matter the initialization �%, � �) → �∗ as * → ∞

• Non-convex problems• Heuristic

• It can happen that, � �) ↛ �∗

MALIS 2019 30

Page 16: 06 optimization notes - eurecom.frzuluaga/files/06_optimization_notes.pdf · )4% < ),*=1,2,,… • Important:) converges but not necessarily to ∗ MALIS 2019 9 Iterative algorithms:

25/10/2019

16

Gradient method: Algorithm

initialize �% ∈ ℝ� , <% > 0 and cdef

�∗ = �%

for * = 1,2, … , cdef:

do

Compute ��(�))

Compute tentative update �g.-g = �) − <)�� �)

if � �g.-g ≤ � �)

�)4% = �g.-g

�∗ = �g.-g

<)4% = 1.3<)

else

<)4% = 0.5<)

while ��(�)) ≥ ϵ

MALIS 2019 31

Examples: Changing the learning rate

MALIS 2019 32

K = 100, α=0.1 K = 100, α=0.01

Credits: Simone Rossi

Page 17: 06 optimization notes - eurecom.frzuluaga/files/06_optimization_notes.pdf · )4% < ),*=1,2,,… • Important:) converges but not necessarily to ∗ MALIS 2019 9 Iterative algorithms:

25/10/2019

17

Examples: Faster learning rate

MALIS 2019 33

K = 12, α=1

Credits: Simone Rossi

Examples: Role of initialization

MALIS 2019 34

K = 100, α=0.1 K = 10, α=1

Credits: Simone Rossi

Page 18: 06 optimization notes - eurecom.frzuluaga/files/06_optimization_notes.pdf · )4% < ),*=1,2,,… • Important:) converges but not necessarily to ∗ MALIS 2019 9 Iterative algorithms:

25/10/2019

18

Gradient descent (vanilla version)

Goods

• Scales well to large datasets

• No need to multiply matrices

• Solves many convex problems

• Unreasonable good heuristic for the approximate solution of non-convex problems

Bads

• Depends on good initialization

• Need to re-run several times to find a sufficiently good minimum

• Cannot cope with non-differentiable functions

• Slow over functions of elliptic form

MALIS 2019 35

Other gradient-based methods

• Conjugate gradient

• Stochastic gradient descent (*)

• Mini-batch gradient descent (*)

• Quasi-Newton methods

• …

• There are also formulations to deal with non-smooth functions

MALIS 2019 36

Page 19: 06 optimization notes - eurecom.frzuluaga/files/06_optimization_notes.pdf · )4% < ),*=1,2,,… • Important:) converges but not necessarily to ∗ MALIS 2019 9 Iterative algorithms:

25/10/2019

19

Recap

• We formalized the problem of optimization

• We reviewed the different types of algorithms used to solve the optimization problem

• We introduced gradient-based optimization methods

• We introduced the gradient descent algorithm and investigated its behavior in different settings

MALIS 2019 37

Why optimization?

MALIS 2019 38

Interested in going deeper? Optimization Theory with applications [optim]

Page 20: 06 optimization notes - eurecom.frzuluaga/files/06_optimization_notes.pdf · )4% < ),*=1,2,,… • Important:) converges but not necessarily to ∗ MALIS 2019 9 Iterative algorithms:

25/10/2019

20

Further reading

Source Chapters

Pattern Recognition and Machine Learning Sec 5.2.4

Convex Optimization – S. Boyd All (Great book)

Northwestern University Open Text book on process optimization

All

Peter Richtarik’s slides –DS3 2017 All

Scipy tutorials (contains code) Mathematical optimization

MALIS 2019 39