cmsc 422: machine learning linear regression kalman filters · 2018. 9. 18. · •prediction (ŷ-k...

CMSC 422: Machine Learning

Linear RegressionKalman Filters

William Regli, Professor of Computer Science

Slide Credits: Andrew W. Moore, School of Computer Science, Carnegie Mellon UniversityHal Daumé, Furong Huang, Marine Carpuat, Computer Science Department, U of MarylandOther slides and images attributed as best one can, apologies for any errors or omissions, used either with permission or under Fair Use (https://www.copyright.gov/fls/fl102.html)

Regression Models• Learning a functional relationship about a real-valued

number, i.e., when y is tomorrow’s temperature.• Technically, solving a regression problem is finding a

conditional expectation or average value of y,• the probability that we have found exactly the right real-valued

number for y is 0.

• Regression models capture the relationship between one dependent variable and explanatory variable(s)

• Use equation to set up relationship• Numerical Dependent (Response) Variable• 1 or More Numerical or Categorical Independent (Explanatory)

Variables

• Used Mainly for Prediction & Estimation

2

Copyright © 2001, 2003, Andrew W. Moore

Linear Regression

Linear regression assumes that the expected value of the output given an input, E[y|x], is linear.

Simplest case: Out(x) = wx for some unknown w.

Given the data, we can estimate w.

inputs outputsx1 = 1 y1 = 1

x2 = 3 y2 = 2.2

x3 = 2 y3 = 2

x4 = 1.5 y4 = 1.9

x5 = 4 y5 = 3.1

DATASET

¬ 1 ®

w¯

Copyright © 2001, 2003, Andrew W. Moore

1-parameter linear regressionAssume that the data is formed by

yi = wxi + noiseiwhere…• the noise signals are independent• the noise has a normal distribution with

mean 0 and unknown variance σ2

p(y|w,x) has a normal distribution with• mean wx• variance σ2

Regression examples

Prediction of menu pricesChaheau Gimpel … and Smith EMNLP 2012

…

7

Types of Regression Models

RegressionModels

Linear Non-Linear

2+ ExplanatoryVariables

Simple Multiple

Linear

1 ExplanatoryVariable

Non-Linear

8

Regression Modeling Steps

• Hypothesize Deterministic Component• Estimate Unknown Parameters

• Specify Probability Distribution of Random Error Term

• Estimate Standard Deviation of Error

• Evaluate the fitted Model• Use Model for Prediction & Estimation

Linear regression

• Given an input x we would like to compute an output y

• For example:• Predict height from age• Predict Google’s price from

Apple‘s price• Predict distance from wall

from sensors• BMI based on height and

weight• Papers published based on

age

X

Y

Linear regression• Given an input x we would like to

compute an output y• In linear regression we assume

that y and x are related with the following equation:

y = wx+ewhere w is a parameter and erepresents measurement or other noise

X

Y

What we are trying to predict

Observed values

Our goal is to estimate w from a training data of <xi,yi> pairs

Optimization goal: minimize squared error (least squares)

Why least squares?

• minimizes squared distance between measurements and predicted line

• has a nice probabilistic interpretation

Linear regression

∑ −i

iiw wxy 2)(minargX

Y

ε+= wxy

Regression example

• Generated: w=2• Recovered: w=2.03• Noise: std=1

Regression example


Bias term

• What if the line does not pass through the origin?

• No problem, simply change the model to

y = w0 + w1x+e

• Can use least squares to determine w0 , w1

n

xwyw i

ii∑ −=

1

0

X

Y

w0

∑

∑ −=

ii

iii

x

wyxw 2

0

1

)(

Data points of price versus floor space of houses for sale in Berkeley, CA, in July ‘09:

9/17/18 18

Regression function hypothesis that minimizes squared error loss

Plot of the loss function

9/17/18 19

Note: shape is convexone global minima

Multivariate regression

• What if we have several inputs?Stock prices for Apple, Microsoft and Amazon for the Google prediction task

Multivariate Regression• Model:

y = w0 + w1x1+ … + wkxk + e

Google’s stock price

Apple’s stock price

Microsoft’s stock price

Non-Linear basis function

• So far we only used the observed values x1,x2,… directly• However, linear regression can be applied in the same

way to functions of these values• E.g., to add a term w x1x2 add a new variable z=x1x2 so each

example becomes: x1, x2, …. z

• As long as these functions can be directly computed from the observed values the parameters are still linear in the data and the problem remains a multi-variate linear regression problem

e++++= 22110 kk xwxwwy !

Non-linear basis functions

• What type of functions can we use?

• A few common examples:

- Polynomial: fj(x) = xj for j=0 … n

- Gaussian:

- Sigmoid:

- Logs:

€

φ j (x) =(x −µ j )2σ j

2

€

φ j (x) =1

1+ exp(−s j x)

Any function of the input values can be used. The solution for the parameters of the regression remains the same.

φ j (x) = log(x +1)

General linear regression problem

• Using our new notations for the basis function linear regression can be written as

• Where fj(x) can be either xj for multivariate regression or one of the non-linear basis functions we defined

• … and f0(x)=1 for the intercept term€

y = w jφ j (x)j= 0

n

∑

26

Introduction to Kalman Filters

27

The Problem

• Why do we need Kalman Filters?• What is a Kalman Filter?• Conceptual Overview• The Theory of Kalman Filter• Simple Example

9/17/18 28

29

• System state cannot be measured directly• Need to estimate “optimally” from measurements

Measuring Devices Estimator

MeasurementError Sources

System State (desired but not known)

External Controls

Observed Measurements

Optimal Estimate of

System State

SystemError Sources

System

Black Box

30

What is a Kalman Filter?• Recursive data processing algorithm• Generates optimal estimate of desired quantities

given the set of measurements• Optimal?

• For linear system and Gaussian errors, Kalman filter is “best” estimate based on all previous measurements

• For non-linear system optimality is ‘qualified’• Recursive?

• Doesn’t need to store all previous measurements and reprocess all data each time step

31

Conceptual Overview

• Lost on the 1-dimensional line• Position: y(t)• Assume Gaussian distributed measurements

y

32

Conceptual Overview

0 10 20 30 40 50 60 70 80 90 1000

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

• Sextant Measurement at t1: Mean = z1 and Variance = s2z1

• Optimal estimate of position is: ŷ(t1) = z1

• Variance of error in estimate: s2x (t1) = s2

z1

• Boat in same position at time t2 - Predicted position is z1

33

0 10 20 30 40 50 60 70 80 90 1000

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Conceptual Overview

• So we have the prediction ŷ-(t2)• GPS Measurement at t2: Mean = z2 and Variance = sz2

• Need to correct the prediction due to measurement to get ŷ(t2)• Closer to more trusted measurement – linear interpolation?

prediction ŷ-(t2)measurement z(t2)

34

0 10 20 30 40 50 60 70 80 90 1000

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Conceptual Overview

• Corrected mean is the new optimal estimate of position• New variance is smaller than either of the previous two variances

measurement z(t2)

corrected optimal estimate ŷ(t2)

prediction ŷ-(t2)

35

Conceptual Overview

Basic ideas:

Make prediction based on previous data: ŷ-, s-

Take measurement: zk, sz

Optimal estimate (ŷ) = Prediction + (Kalman Gain) * (Measurement - Prediction)

Variance of estimate = Variance of prediction * (1 – Kalman Gain)

36

0 10 20 30 40 50 60 70 80 90 1000

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Conceptual Overview

• At time t3, boat moves with velocity dy/dt=u• Naïve approach: Shift probability to the right to predict• This would work if we knew the velocity exactly (perfect model)

ŷ(t2)Naïve Prediction ŷ-(t3)

37

0 10 20 30 40 50 60 70 80 90 1000

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Conceptual Overview

• Better to assume imperfect model by adding Gaussian noise

• dy/dt = u + w

• Distribution for prediction moves and spreads out

ŷ(t2)

Naïve Prediction

ŷ-(t3)

Prediction ŷ-(t3)

38

0 10 20 30 40 50 60 70 80 90 1000

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Conceptual Overview

• Now we take a measurement at t3• Need to once again correct the prediction• Same as before

Prediction ŷ-(t3)

Measurement z(t3)

Corrected optimal estimate ŷ(t3)

39

Conceptual OverviewSummary:• Initial conditions (ŷk-1 and sk-1)• Prediction (ŷ-

k , s-k)

• Use initial conditions and model (e.g., constant velocity) to make prediction

• Measurement (zk)• Take measurement

• Correction (ŷk , sk)• Use measurement to correct prediction by ‘blending’

prediction and residual – always a case of merging only two Gaussians

• Optimal estimate with smaller variance

40

Theoretical Basis

• Process to be estimated:

yk = Ayk-1 + Buk + wk-1

zk = Hyk + vk

Process Noise (w) with covariance Q

Measurement Noise (v) with covariance R

• Kalman Filter

Predicted: ŷ-k is estimate based on measurements at previous time-steps

ŷk = ŷ-k + K(zk - H ŷ-

k )

Corrected: ŷk has additional information – the measurement at time k

K = P-kHT(HP-

kHT + R)-1

ŷ-k = Ayk-1 + Buk

P-k = APk-1AT + Q

Pk = (I - KH)P-k

41

Kalman Filter Algorithm (notation abuse) Algorithm Kalman_filter( µt-1, St-1, ut, zt):

Prediction:

Correction:

Return µt, St

ttttt uBA += -1µµ

tTtttt RAA +S=S -1

1)( -+SS= tTttt

Tttt QCCCK

)( tttttt CzK µµµ -+=

tttt CKI S-=S )(

42

Theoretical Basis

ŷ-k = Ayk-1 + Buk

P-k = APk-1AT + Q

Prediction (Time Update)

(1) Project the state ahead

(2) Project the error covariance ahead

Correction (Measurement Update)

(1) Compute the Kalman Gain

(2) Update estimate with measurement zk

(3) Update Error Covariance


k )

K = P-kHT(HP-

kHT + R)-1

Pk = (I - KH)P-k

43

The Prediction-Correction-CyclePrediction

44

The Prediction-Correction-Cycle

Correction

45

Theoretical Basis

ŷ-k = Ayk-1 + Buk

P-k = APk-1AT + Q

Prediction (Time Update)

(1) Project the state ahead

(2) Project the error covariance ahead

Correction (Measurement Update)

(1) Compute the Kalman Gain

(2) Update estimate with measurement zk

(3) Update Error Covariance


k )

K = P-kHT(HP-

kHT + R)-1

Pk = (I - KH)P-k

46

Kalman Filter Summary• Highly efficient:

Polynomial in measurement dimensionality k and state dimensionality n:

O(k2.376 + n2)

• Optimal for linear Gaussian systems!

• Most robotics systems are nonlinear!

9/17/18 47

Relating Regression to Kalman Filters

Kalman Filter• estimate the state• discrete-time process• linear stochastic

difference equation

Linear Regression• Estimate value• Finite set of data/values• Linear equations with

gaussian noise

9/17/18 48

Relating Regression to Kalman Filters

• Kalman Filter produces “real-time” estimates of the coefficients of a linear regression

• Kalman filter is linear optimal estimator, it infers parameters from indirect, inaccurate and uncertain observations

• For Gaussian noise, the Kalman filter minimizes the mean square error of the estimated parameters

• Faulty intuition: Kalman filter is used for prediction of future events based on past data where as regression (least squares) does smoothing within end to end points

• This is not really true…both the estimators (and almost all estimators you can think of) can do either job.

9/17/18 49

END

9/17/18 50

cmsc 422: machine learning linear regression kalman filters · 2018. 9. 18. · •prediction (ŷ-k...

Documents