chapter 2-optimization
DESCRIPTION
Chapter 2-OPTIMIZATION. G.Anuradha. Contents. Derivative-based Optimization Descent Methods The Method of Steepest Descent Classical Newton’s Method Step Size Determination Derivative-free Optimization Genetic Algorithms Simulated Annealing Random Search Downhill Simplex Search. - PowerPoint PPT PresentationTRANSCRIPT
Chapter 2-OPTIMIZATION
G.Anuradha
Contents• Derivative-based Optimization
– Descent Methods– The Method of Steepest Descent – Classical Newton’s Method– Step Size Determination
• Derivative-free Optimization– Genetic Algorithms– Simulated Annealing– Random Search– Downhill Simplex Search
What is Optimization?
• Choosing the best element from some set of available alternatives
• Solving problems in which one seeks to minimize or maximize a real function
Notation of OptimizationOptimize y=f(x1,x2….xn) --------------------------------1subject to gj(x1,x2…xn) ≤ / ≥ /= bj ----------------------2 where j=1,2,….n
Eqn:1 is objective function Eqn:2 a set of constraints imposed on the solution. x1,x2…xn are the set of decision variables Note:- The problem is either to maximize or minimize the
value of objective function.
Complicating factors in optimization
1. Existence of multiple decision variables2. Complex nature of the relationships
between the decision variables and the associated income
3. Existence of one or more complex constraints on the decision variables
Types of optimization
• Constraint:- Solution is arrived at by maximizing or minimizing the objective function
• Unconstraint:- No constraints are imposed on the decision variables and differential calculus can be used to analyze them
Least Square Methods for System Identification
• System Identification:- Determining a mathematical model for an unknown system by observing the input-output data pairs
• System identification is required– To predict a system behavior– To explain the interactions and relationship between
inputs and outputs – To design a controller
• System identification– Structure identification– Parameter identification
Structure identification
• Apply a priori knowledge about the target system to determine a class of models within which the search for the most suitable model is conducted
• y=f(u;θ) y – model’s output u – Input Vector θ – parameter vector
Parameter Identification
• Structure of the model is known and optimization techniques are applied to determine the parameter vector θ= θ
Block diagram of parameter identification
Parameter identification
• An input ui is applied to both the system and the model
• Difference between the target system’s output yi and model’s output yi is used to update a parameter vector θ to minimize the difference
• System identification is not a one-pass process; it needs to do both structure and parameter identification repeatedly
Classification of Optimization algorithms
• Derivative-based algorithms:-• Derivative-free algorithms
Characteristics of derivative free algorithm
1. Derivative freeness:- repeated evaluation of objective function
2. Intuitive guidelines:- concepts are based on nature’s wisdom, such as evolution and thermodynamics
3. Slower4. Flexibility5. Randomness:- global optimizers6. Analytic Opacity:-knowledge about them are based on
empirical studies7. Iterative nature:-
Characteristics of derivative free algorithm
• Stopping condition of iteration:- let k denote an iteration count and fk denote the best objective function obtained at count k. stopping condition depends on– Computation time– Optimization goal;– Minimal Improvement– Minimal relative improvement
Basics of Matrix Manipulation and Calculus
Basics of Matrix Manipulation and Calculus
Gradient of a Scalar Function
Jacobian of a Vector Function
Least Square Estimator
• Method of least squares is a standard approach to approximate solution of overdetermined systems.
• Least Squares- Overall solution minimizes the sum of the squares of the errors made in solving every single equation
• Application—Data Fitting
Types of Least Squares• Least Squares
– Linear:- It is a linear combination of parameters.
– The model may represent a straight line, a parabola or any other linear combination of functions
– Non-Linear:- the parameters appear as functions, such as β2,eβx.
If the derivatives are either constant or depend only on the values of the independent variable, the model is linear else non-linear.
Differences between Linear and Non-Linear Least Squares
Linear Non-LinearAlgorithms Does not require initial values
Algorithms Require Initial values
Globally concave; Non convergence is not an issue
Non convergence is a common issue
Normally solved using direct methods Usually an iterative process
Solution is unique Multiple minima in the sum of squares
Yields unbiased estimates even when errors are uncorrelated with predictor values
Yields biased estimates
Linear regression with one variable
Model representation
Machine Learning
Housing Prices(Portland, OR)
Price(in 1000s of dollars)
Size (feet2)
Supervised Learning
Given the “right answer” for each example in the data.
Regression Problem
Predict real-valued output
Notation:
m = Number of training examples x’s = “input” variable / features y’s = “output” variable / “target” variable
Size in feet2 (x)
Price ($) in 1000's (y)
2104 4601416 2321534 315852 178… …
Training set ofhousing prices(Portland, OR)
Training Set
Learning Algorithm
hSize of house
Estimated price
How do we represent h ?
Linear regression with one variable.Univariate linear regression.
Cost function
Machine Learning
Linear regression with one variable
How to choose ‘s ?
Training Set
Hypothesis:
‘s: Parameters
Size in feet2 (x)
Price ($) in 1000's (y)
2104 4601416 2321534 315852 178… …
y
x
Idea: Choose so that is close to for our training examples
Cost functionintuition I
Machine Learning
Linear regression with one variable
Hypothesis:
Parameters:
Cost Function:
Goal:
Simplified
y
x
(for fixed , this is a function of x) (function of the parameter )
y
x
(for fixed , this is a function of x) (function of the parameter )
y
x
(for fixed , this is a function of x) (function of the parameter )
Cost functionintuition II
Machine Learning
Linear regression with one variable
Hypothesis:
Parameters:
Cost Function:
Goal:
(for fixed , this is a function of x) (function of the parameters )
Price ($) in 1000’s
Size in feet2 (x)
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
Gradient descent
Machine Learning
Linear regression with one variable
Have some function
Want
Outline:• Start with some
• Keep changing to reduce
until we hopefully end up at a
minimum
J()
J()
Gradient descent algorithm
Correct: Simultaneous update Incorrect:
Gradient descentintuition
Machine Learning
Linear regression with one variable
Gradient descent algorithm
If α is too small, gradient descent can be slow.
If α is too large, gradient descent can overshoot the minimum. It may fail to converge, or even diverge.
at local optima
Current value of
Gradient descent can converge to a local minimum, even with the learning rate α fixed.
As we approach a local minimum, gradient descent will automatically take smaller steps. So, no need to decrease α over time.
Gradient descent for linear regression
Machine Learning
Linear regression with one variable
Gradient descent algorithm Linear Regression Model
Gradient descent algorithm
update and
simultaneously
J()
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
“Batch” Gradient Descent
“Batch”: Each step of gradient descent uses all the training examples.
Linear Regression with multiple variables
Multiple features
Machine Learning
Size (feet2)
Price ($1000)
2104 4601416 2321534 315852 178… …
Multiple features (variables).
Size (feet2)
Number of
bedrooms
Number of floors
Age of home
(years)
Price ($1000)
2104 5 1 45 4601416 3 2 40 2321534 3 2 30 315852 2 1 36 178… … … … …
Multiple features (variables).
Notation:= number of features= input (features) of training example.= value of feature in training example.
Hypothesis:
Previously:
For convenience of notation, define .
Multivariate linear regression.
Linear Regression with multiple variables
Gradient descent for multiple variables
Machine Learning
Hypothesis:
Cost function:
Parameters:
(simultaneously update for every )
Repeat
Gradient descent:
(simultaneously update )
Gradient Descent
Repeat
Previously (n=1):
New algorithm :
Repeat
(simultaneously update for )
Linear Regression with multiple variables
Gradient descent in practice I: Feature Scaling
Machine Learning
E.g. = size (0-2000 feet2)
= number of bedrooms (1-5)
Feature ScalingIdea: Make sure features are on a similar scale.
size (feet2)
number of bedrooms
Feature Scaling
Get every feature into approximately a range.
Replace with to make features have approximately zero mean (Do not apply to ).
Mean normalization
E.g.
Linear Regression with multiple variables
Gradient descent in practice II: Learning rate
Machine Learning
Gradient descent
- “Debugging”: How to make sure gradient descent is working correctly.
- How to choose learning rate .
Example automatic convergence test:
Declare convergence if decreases by less than in one iteration.
No. of iterations
Making sure gradient descent is working correctly.
Making sure gradient descent is working correctly.
Gradient descent not working.
Use smaller .
No. of iterations
No. of iterations No. of iterations
- For sufficiently small , should decrease on every iteration.
- But if is too small, gradient descent can be slow to converge.
Summary:
- If is too small: slow convergence.- If is too large: may not decrease
on every iteration; may not converge.
To choose , try
Linear Regression with multiple variables
Features and polynomial regression
Machine Learning
Housing prices prediction
Polynomial regression
Price(y)
Size (x)
Choice of features
Price(y)
Size (x)
Linear Regression with multiple variables
Normal equation
Machine Learning
Gradient Descent
Normal equation: Method to solve for analytically.
Intuition: If 1D
Solve for
(for every )
Size (feet2)
Number of
bedrooms
Number of floors
Age of home
(years)
Price ($1000)
1 2104 5 1 45 4601 1416 3 2 40 2321 1534 3 2 30 3151 852 2 1 36 178
Size (feet2)
Number of
bedrooms
Number of floors
Age of home
(years)
Price ($1000)
2104 5 1 45 4601416 3 2 40 2321534 3 2 30 315852 2 1 36 178
Examples:
examples ; features.
E.g. If
is inverse of matrix .
Octave: pinv(X’*X)*X’*y
training examples, features.
Gradient Descent Normal Equation
• No need to choose .• Don’t need to iterate.
• Need to choose . • Needs many
iterations.• Works well even
when is large.• Need to compute
• Slow if is very large.
Linear Regression with multiple variables
Normal equation and non-invertibility (optional)
Machine Learning
Normal equation
- What if is non-invertible? (singular/ degenerate)
- Octave: pinv(X’*X)*X’*y
What if is non-invertible?
• Redundant features (linearly dependent).E.g. size in feet2
size in m2
• Too many features (e.g. ).
- Delete some features, or use regularization.
Linear model
Regression Function
Linear model contd…
Using matrix notationWhere A is a m*n matrix
Due to noise a small amount of error is added
Least Square Estimator
Problem on Least Square Estimator
Derivative Based Optimization
• Deals with gradient-based optimization techniques, capable of determining search directions according to an objective function’s derivative information
• Used in optimizing non-linear neuro-fuzzy models, – Steepest descent– Conjugate gradient
First-Order Optimality ConditionF x F x x+ F x F x T
x x=x+= = 1
2---xT F x
x x=x2 + +
x x x–=
F x x+ F x F x T
x x=x+
For small x:
F x T
x x=x 0
F x T
x x=x 0
If x* is a minimum, this implies:
F x x– F x F x T
x x=x – F x If then
But this would imply that x* is not a minimum. ThereforeF x
T
x x=x 0=
Since this must be true for every x, F x x x=
0=
Second-Order ConditionF x x+ F x 1
2---xT F x
x x=x2 + +=
xT F x x x=
x2 0A strong minimum will exist at x* if for any x ° 0.
Therefore the Hessian matrix must be positive definite. A matrix A is positive definite if:
zTAz 0
A necessary condition is that the Hessian matrix be positive semidefinite. A matrix A is positive semidefinite if:
zTAz 0
If the first-order condition is satisfied (zero gradient), then
for any z ° 0.
for any z.
This is a sufficient condition for optimality.
Basic Optimization Algorithmxk 1+ xk kpk+=
x k xk 1+ x k– kpk= =
pk - Search Direction
k - Learning Rate
or
xk
x k 1+kpk