chapter 2-optimization

125
Chapter 2-OPTIMIZATION G.Anuradha

Upload: keon

Post on 18-Mar-2016

82 views

Category:

Documents


3 download

DESCRIPTION

Chapter 2-OPTIMIZATION. G.Anuradha. Contents. Derivative-based Optimization Descent Methods The Method of Steepest Descent Classical Newton’s Method Step Size Determination Derivative-free Optimization Genetic Algorithms Simulated Annealing Random Search Downhill Simplex Search. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Chapter 2-OPTIMIZATION

Chapter 2-OPTIMIZATION

G.Anuradha

Page 2: Chapter 2-OPTIMIZATION

Contents• Derivative-based Optimization

– Descent Methods– The Method of Steepest Descent – Classical Newton’s Method– Step Size Determination

• Derivative-free Optimization– Genetic Algorithms– Simulated Annealing– Random Search– Downhill Simplex Search

Page 3: Chapter 2-OPTIMIZATION

What is Optimization?

• Choosing the best element from some set of available alternatives

• Solving problems in which one seeks to minimize or maximize a real function

Page 4: Chapter 2-OPTIMIZATION

Notation of OptimizationOptimize y=f(x1,x2….xn) --------------------------------1subject to gj(x1,x2…xn) ≤ / ≥ /= bj ----------------------2 where j=1,2,….n

Eqn:1 is objective function Eqn:2 a set of constraints imposed on the solution. x1,x2…xn are the set of decision variables Note:- The problem is either to maximize or minimize the

value of objective function.

Page 5: Chapter 2-OPTIMIZATION
Page 6: Chapter 2-OPTIMIZATION
Page 7: Chapter 2-OPTIMIZATION

Complicating factors in optimization

1. Existence of multiple decision variables2. Complex nature of the relationships

between the decision variables and the associated income

3. Existence of one or more complex constraints on the decision variables

Page 8: Chapter 2-OPTIMIZATION

Types of optimization

• Constraint:- Solution is arrived at by maximizing or minimizing the objective function

• Unconstraint:- No constraints are imposed on the decision variables and differential calculus can be used to analyze them

Page 9: Chapter 2-OPTIMIZATION

Least Square Methods for System Identification

• System Identification:- Determining a mathematical model for an unknown system by observing the input-output data pairs

• System identification is required– To predict a system behavior– To explain the interactions and relationship between

inputs and outputs – To design a controller

• System identification– Structure identification– Parameter identification

Page 10: Chapter 2-OPTIMIZATION

Structure identification

• Apply a priori knowledge about the target system to determine a class of models within which the search for the most suitable model is conducted

• y=f(u;θ) y – model’s output u – Input Vector θ – parameter vector

Page 11: Chapter 2-OPTIMIZATION

Parameter Identification

• Structure of the model is known and optimization techniques are applied to determine the parameter vector θ= θ

Page 12: Chapter 2-OPTIMIZATION

Block diagram of parameter identification

Page 13: Chapter 2-OPTIMIZATION

Parameter identification

• An input ui is applied to both the system and the model

• Difference between the target system’s output yi and model’s output yi is used to update a parameter vector θ to minimize the difference

• System identification is not a one-pass process; it needs to do both structure and parameter identification repeatedly

Page 14: Chapter 2-OPTIMIZATION

Classification of Optimization algorithms

• Derivative-based algorithms:-• Derivative-free algorithms

Page 15: Chapter 2-OPTIMIZATION

Characteristics of derivative free algorithm

1. Derivative freeness:- repeated evaluation of objective function

2. Intuitive guidelines:- concepts are based on nature’s wisdom, such as evolution and thermodynamics

3. Slower4. Flexibility5. Randomness:- global optimizers6. Analytic Opacity:-knowledge about them are based on

empirical studies7. Iterative nature:-

Page 16: Chapter 2-OPTIMIZATION

Characteristics of derivative free algorithm

• Stopping condition of iteration:- let k denote an iteration count and fk denote the best objective function obtained at count k. stopping condition depends on– Computation time– Optimization goal;– Minimal Improvement– Minimal relative improvement

Page 17: Chapter 2-OPTIMIZATION

Basics of Matrix Manipulation and Calculus

Page 18: Chapter 2-OPTIMIZATION

Basics of Matrix Manipulation and Calculus

Page 19: Chapter 2-OPTIMIZATION

Gradient of a Scalar Function

Page 20: Chapter 2-OPTIMIZATION

Jacobian of a Vector Function

Page 21: Chapter 2-OPTIMIZATION

Least Square Estimator

• Method of least squares is a standard approach to approximate solution of overdetermined systems.

• Least Squares- Overall solution minimizes the sum of the squares of the errors made in solving every single equation

• Application—Data Fitting

Page 22: Chapter 2-OPTIMIZATION

Types of Least Squares• Least Squares

– Linear:- It is a linear combination of parameters.

– The model may represent a straight line, a parabola or any other linear combination of functions

– Non-Linear:- the parameters appear as functions, such as β2,eβx.

If the derivatives are either constant or depend only on the values of the independent variable, the model is linear else non-linear.

Page 23: Chapter 2-OPTIMIZATION

Differences between Linear and Non-Linear Least Squares

Linear Non-LinearAlgorithms Does not require initial values

Algorithms Require Initial values

Globally concave; Non convergence is not an issue

Non convergence is a common issue

Normally solved using direct methods Usually an iterative process

Solution is unique Multiple minima in the sum of squares

Yields unbiased estimates even when errors are uncorrelated with predictor values

Yields biased estimates

Page 24: Chapter 2-OPTIMIZATION

Linear regression with one variable

Model representation

Machine Learning

Page 25: Chapter 2-OPTIMIZATION

Housing Prices(Portland, OR)

Price(in 1000s of dollars)

Size (feet2)

Supervised Learning

Given the “right answer” for each example in the data.

Regression Problem

Predict real-valued output

Page 26: Chapter 2-OPTIMIZATION

Notation:

m = Number of training examples x’s = “input” variable / features y’s = “output” variable / “target” variable

Size in feet2 (x)

Price ($) in 1000's (y)

2104 4601416 2321534 315852 178… …

Training set ofhousing prices(Portland, OR)

Page 27: Chapter 2-OPTIMIZATION

Training Set

Learning Algorithm

hSize of house

Estimated price

How do we represent h ?

Linear regression with one variable.Univariate linear regression.

Page 28: Chapter 2-OPTIMIZATION
Page 29: Chapter 2-OPTIMIZATION

Cost function

Machine Learning

Linear regression with one variable

Page 30: Chapter 2-OPTIMIZATION

How to choose ‘s ?

Training Set

Hypothesis:

‘s: Parameters

Size in feet2 (x)

Price ($) in 1000's (y)

2104 4601416 2321534 315852 178… …

Page 31: Chapter 2-OPTIMIZATION
Page 32: Chapter 2-OPTIMIZATION

y

x

Idea: Choose so that is close to for our training examples

Page 33: Chapter 2-OPTIMIZATION
Page 34: Chapter 2-OPTIMIZATION

Cost functionintuition I

Machine Learning

Linear regression with one variable

Page 35: Chapter 2-OPTIMIZATION

Hypothesis:

Parameters:

Cost Function:

Goal:

Simplified

Page 36: Chapter 2-OPTIMIZATION

y

x

(for fixed , this is a function of x) (function of the parameter )

Page 37: Chapter 2-OPTIMIZATION

y

x

(for fixed , this is a function of x) (function of the parameter )

Page 38: Chapter 2-OPTIMIZATION

y

x

(for fixed , this is a function of x) (function of the parameter )

Page 39: Chapter 2-OPTIMIZATION
Page 40: Chapter 2-OPTIMIZATION

Cost functionintuition II

Machine Learning

Linear regression with one variable

Page 41: Chapter 2-OPTIMIZATION

Hypothesis:

Parameters:

Cost Function:

Goal:

Page 42: Chapter 2-OPTIMIZATION

(for fixed , this is a function of x) (function of the parameters )

Price ($) in 1000’s

Size in feet2 (x)

Page 43: Chapter 2-OPTIMIZATION
Page 44: Chapter 2-OPTIMIZATION

(for fixed , this is a function of x) (function of the parameters )

Page 45: Chapter 2-OPTIMIZATION

(for fixed , this is a function of x) (function of the parameters )

Page 46: Chapter 2-OPTIMIZATION

(for fixed , this is a function of x) (function of the parameters )

Page 47: Chapter 2-OPTIMIZATION

(for fixed , this is a function of x) (function of the parameters )

Page 48: Chapter 2-OPTIMIZATION
Page 49: Chapter 2-OPTIMIZATION

Gradient descent

Machine Learning

Linear regression with one variable

Page 50: Chapter 2-OPTIMIZATION

Have some function

Want

Outline:• Start with some

• Keep changing to reduce

until we hopefully end up at a

minimum

Page 51: Chapter 2-OPTIMIZATION

J()

Page 52: Chapter 2-OPTIMIZATION

J()

Page 53: Chapter 2-OPTIMIZATION

Gradient descent algorithm

Correct: Simultaneous update Incorrect:

Page 54: Chapter 2-OPTIMIZATION
Page 55: Chapter 2-OPTIMIZATION

Gradient descentintuition

Machine Learning

Linear regression with one variable

Page 56: Chapter 2-OPTIMIZATION

Gradient descent algorithm

Page 57: Chapter 2-OPTIMIZATION
Page 58: Chapter 2-OPTIMIZATION

If α is too small, gradient descent can be slow.

If α is too large, gradient descent can overshoot the minimum. It may fail to converge, or even diverge.

Page 59: Chapter 2-OPTIMIZATION

at local optima

Current value of

Page 60: Chapter 2-OPTIMIZATION

Gradient descent can converge to a local minimum, even with the learning rate α fixed.

As we approach a local minimum, gradient descent will automatically take smaller steps. So, no need to decrease α over time.

Page 61: Chapter 2-OPTIMIZATION
Page 62: Chapter 2-OPTIMIZATION

Gradient descent for linear regression

Machine Learning

Linear regression with one variable

Page 63: Chapter 2-OPTIMIZATION

Gradient descent algorithm Linear Regression Model

Page 64: Chapter 2-OPTIMIZATION
Page 65: Chapter 2-OPTIMIZATION

Gradient descent algorithm

update and

simultaneously

Page 66: Chapter 2-OPTIMIZATION

J()

Page 67: Chapter 2-OPTIMIZATION
Page 68: Chapter 2-OPTIMIZATION

(for fixed , this is a function of x) (function of the parameters )

Page 69: Chapter 2-OPTIMIZATION

(for fixed , this is a function of x) (function of the parameters )

Page 70: Chapter 2-OPTIMIZATION

(for fixed , this is a function of x) (function of the parameters )

Page 71: Chapter 2-OPTIMIZATION

(for fixed , this is a function of x) (function of the parameters )

Page 72: Chapter 2-OPTIMIZATION

(for fixed , this is a function of x) (function of the parameters )

Page 73: Chapter 2-OPTIMIZATION

(for fixed , this is a function of x) (function of the parameters )

Page 74: Chapter 2-OPTIMIZATION

(for fixed , this is a function of x) (function of the parameters )

Page 75: Chapter 2-OPTIMIZATION

(for fixed , this is a function of x) (function of the parameters )

Page 76: Chapter 2-OPTIMIZATION

(for fixed , this is a function of x) (function of the parameters )

Page 77: Chapter 2-OPTIMIZATION

“Batch” Gradient Descent

“Batch”: Each step of gradient descent uses all the training examples.

Page 78: Chapter 2-OPTIMIZATION
Page 79: Chapter 2-OPTIMIZATION

Linear Regression with multiple variables

Multiple features

Machine Learning

Page 80: Chapter 2-OPTIMIZATION

Size (feet2)

Price ($1000)

2104 4601416 2321534 315852 178… …

Multiple features (variables).

Page 81: Chapter 2-OPTIMIZATION

Size (feet2)

Number of

bedrooms

Number of floors

Age of home

(years)

Price ($1000)

2104 5 1 45 4601416 3 2 40 2321534 3 2 30 315852 2 1 36 178… … … … …

Multiple features (variables).

Notation:= number of features= input (features) of training example.= value of feature in training example.

Page 82: Chapter 2-OPTIMIZATION

Hypothesis:

Previously:

Page 83: Chapter 2-OPTIMIZATION

For convenience of notation, define .

Multivariate linear regression.

Page 84: Chapter 2-OPTIMIZATION
Page 85: Chapter 2-OPTIMIZATION

Linear Regression with multiple variables

Gradient descent for multiple variables

Machine Learning

Page 86: Chapter 2-OPTIMIZATION

Hypothesis:

Cost function:

Parameters:

(simultaneously update for every )

Repeat

Gradient descent:

Page 87: Chapter 2-OPTIMIZATION

(simultaneously update )

Gradient Descent

Repeat

Previously (n=1):

New algorithm :

Repeat

(simultaneously update for )

Page 88: Chapter 2-OPTIMIZATION
Page 89: Chapter 2-OPTIMIZATION

Linear Regression with multiple variables

Gradient descent in practice I: Feature Scaling

Machine Learning

Page 90: Chapter 2-OPTIMIZATION

E.g. = size (0-2000 feet2)

= number of bedrooms (1-5)

Feature ScalingIdea: Make sure features are on a similar scale.

size (feet2)

number of bedrooms

Page 91: Chapter 2-OPTIMIZATION

Feature Scaling

Get every feature into approximately a range.

Page 92: Chapter 2-OPTIMIZATION

Replace with to make features have approximately zero mean (Do not apply to ).

Mean normalization

E.g.

Page 93: Chapter 2-OPTIMIZATION
Page 94: Chapter 2-OPTIMIZATION

Linear Regression with multiple variables

Gradient descent in practice II: Learning rate

Machine Learning

Page 95: Chapter 2-OPTIMIZATION

Gradient descent

- “Debugging”: How to make sure gradient descent is working correctly.

- How to choose learning rate .

Page 96: Chapter 2-OPTIMIZATION

Example automatic convergence test:

Declare convergence if decreases by less than in one iteration.

No. of iterations

Making sure gradient descent is working correctly.

Page 97: Chapter 2-OPTIMIZATION

Making sure gradient descent is working correctly.

Gradient descent not working.

Use smaller .

No. of iterations

No. of iterations No. of iterations

- For sufficiently small , should decrease on every iteration.

- But if is too small, gradient descent can be slow to converge.

Page 98: Chapter 2-OPTIMIZATION

Summary:

- If is too small: slow convergence.- If is too large: may not decrease

on every iteration; may not converge.

To choose , try

Page 99: Chapter 2-OPTIMIZATION
Page 100: Chapter 2-OPTIMIZATION

Linear Regression with multiple variables

Features and polynomial regression

Machine Learning

Page 101: Chapter 2-OPTIMIZATION

Housing prices prediction

Page 102: Chapter 2-OPTIMIZATION

Polynomial regression

Price(y)

Size (x)

Page 103: Chapter 2-OPTIMIZATION

Choice of features

Price(y)

Size (x)

Page 104: Chapter 2-OPTIMIZATION
Page 105: Chapter 2-OPTIMIZATION

Linear Regression with multiple variables

Normal equation

Machine Learning

Page 106: Chapter 2-OPTIMIZATION

Gradient Descent

Normal equation: Method to solve for analytically.

Page 107: Chapter 2-OPTIMIZATION

Intuition: If 1D

Solve for

(for every )

Page 108: Chapter 2-OPTIMIZATION

Size (feet2)

Number of

bedrooms

Number of floors

Age of home

(years)

Price ($1000)

1 2104 5 1 45 4601 1416 3 2 40 2321 1534 3 2 30 3151 852 2 1 36 178

Size (feet2)

Number of

bedrooms

Number of floors

Age of home

(years)

Price ($1000)

2104 5 1 45 4601416 3 2 40 2321534 3 2 30 315852 2 1 36 178

Examples:

Page 109: Chapter 2-OPTIMIZATION

examples ; features.

E.g. If

Page 110: Chapter 2-OPTIMIZATION

is inverse of matrix .

Octave: pinv(X’*X)*X’*y

Page 111: Chapter 2-OPTIMIZATION

training examples, features.

Gradient Descent Normal Equation

• No need to choose .• Don’t need to iterate.

• Need to choose . • Needs many

iterations.• Works well even

when is large.• Need to compute

• Slow if is very large.

Page 112: Chapter 2-OPTIMIZATION
Page 113: Chapter 2-OPTIMIZATION

Linear Regression with multiple variables

Normal equation and non-invertibility (optional)

Machine Learning

Page 114: Chapter 2-OPTIMIZATION

Normal equation

- What if is non-invertible? (singular/ degenerate)

- Octave: pinv(X’*X)*X’*y

Page 115: Chapter 2-OPTIMIZATION

What if is non-invertible?

• Redundant features (linearly dependent).E.g. size in feet2

size in m2

• Too many features (e.g. ).

- Delete some features, or use regularization.

Page 116: Chapter 2-OPTIMIZATION
Page 117: Chapter 2-OPTIMIZATION

Linear model

Regression Function

Page 118: Chapter 2-OPTIMIZATION

Linear model contd…

Using matrix notationWhere A is a m*n matrix

Page 119: Chapter 2-OPTIMIZATION

Due to noise a small amount of error is added

Page 120: Chapter 2-OPTIMIZATION

Least Square Estimator

Page 121: Chapter 2-OPTIMIZATION

Problem on Least Square Estimator

Page 122: Chapter 2-OPTIMIZATION

Derivative Based Optimization

• Deals with gradient-based optimization techniques, capable of determining search directions according to an objective function’s derivative information

• Used in optimizing non-linear neuro-fuzzy models, – Steepest descent– Conjugate gradient

Page 123: Chapter 2-OPTIMIZATION

First-Order Optimality ConditionF x F x x+ F x F x T

x x=x+= = 1

2---xT F x

x x=x2 + +

x x x–=

F x x+ F x F x T

x x=x+

For small x:

F x T

x x=x 0

F x T

x x=x 0

If x* is a minimum, this implies:

F x x– F x F x T

x x=x – F x If then

But this would imply that x* is not a minimum. ThereforeF x

T

x x=x 0=

Since this must be true for every x, F x x x=

0=

Page 124: Chapter 2-OPTIMIZATION

Second-Order ConditionF x x+ F x 1

2---xT F x

x x=x2 + +=

xT F x x x=

x2 0A strong minimum will exist at x* if for any x ° 0.

Therefore the Hessian matrix must be positive definite. A matrix A is positive definite if:

zTAz 0

A necessary condition is that the Hessian matrix be positive semidefinite. A matrix A is positive semidefinite if:

zTAz 0

If the first-order condition is satisfied (zero gradient), then

for any z ° 0.

for any z.

This is a sufficient condition for optimality.

Page 125: Chapter 2-OPTIMIZATION

Basic Optimization Algorithmxk 1+ xk kpk+=

x k xk 1+ x k– kpk= =

pk - Search Direction

k - Learning Rate

or

xk

x k 1+kpk