iaml: linear regression · 2010. 10. 18. · i generalized linear regression i radial basis...

26
IAML: Linear Regression Charles Sutton and Victor Lavrenko School of Informatics Semester 1 1 / 26

Upload: others

Post on 21-Aug-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IAML: Linear Regression · 2010. 10. 18. · I Generalized linear regression I Radial basis function (RBF) models I Curse of dimensionality 2/26. The Regression Problem I Classification

IAML: Linear Regression

Charles Sutton and Victor LavrenkoSchool of Informatics

Semester 1

1 / 26

Page 2: IAML: Linear Regression · 2010. 10. 18. · I Generalized linear regression I Radial basis function (RBF) models I Curse of dimensionality 2/26. The Regression Problem I Classification

Overview

I The linear modelI Fitting the linear model to dataI Probabilistic interpretation of the error functionI Examples of regression problemsI Dealing with multiple outputsI Generalized linear regressionI Radial basis function (RBF) modelsI Curse of dimensionality

2 / 26

Page 3: IAML: Linear Regression · 2010. 10. 18. · I Generalized linear regression I Radial basis function (RBF) models I Curse of dimensionality 2/26. The Regression Problem I Classification

The Regression Problem

I Classification and regression problems:I Classification: target of prediction is discreteI Regression: target of prediction is continuous

II Training data: Set D of pairs (xi , yi) for i = 1, . . . ,n, wherexi ∈ RD and yi ∈ R

I Today: Linear regression, i.e., relationship between x andy is linear, for example, . . .

3 / 26

Page 4: IAML: Linear Regression · 2010. 10. 18. · I Generalized linear regression I Radial basis function (RBF) models I Curse of dimensionality 2/26. The Regression Problem I Classification

One dimensional exampleOne dimensional example

−2 −1 0 1 2 30

0.5

1

1.5

2

2.5

What would a linear model with a two-dimensional input space(x1, x2) look like geometrically?

3 / 174 / 26

Page 5: IAML: Linear Regression · 2010. 10. 18. · I Generalized linear regression I Radial basis function (RBF) models I Curse of dimensionality 2/26. The Regression Problem I Classification

Two dimensional example

Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 3

•• •

••

• •••

• •••

••

•••

••

••

•• ••

••

•• •

•• •

••

• ••

• •

••

• •••

X1

X2

Y

FIGURE 3.1. Linear least squares fitting withX ∈ IR2. We seek the linear function of X that mini-mizes the sum of squared residuals from Y .

Figure: Hastie, Tibshirani, and Friedman

5 / 26

Page 6: IAML: Linear Regression · 2010. 10. 18. · I Generalized linear regression I Radial basis function (RBF) models I Curse of dimensionality 2/26. The Regression Problem I Classification

The Linear Model

I Linear model

f (x; w) = w0 + w1x1 + . . .+ wDxD

= wTφ(x)

where φ(x) = (1, x1, . . . , xD)T = (1,xT )T

I The maths of fitting linear models to data is easy. We usethe notation φ(x) to make generalisation easy later.

6 / 26

Page 7: IAML: Linear Regression · 2010. 10. 18. · I Generalized linear regression I Radial basis function (RBF) models I Curse of dimensionality 2/26. The Regression Problem I Classification

Linear model (part 2)

In matrix notation:

I Design matrix is n × (D + 1)

Φ =

1 x11 x12 . . . x1D1 x21 x22 . . . x2D...

......

......

1 xn1 xn2 . . . xnD

I xij is the j th component of the training input xi

I Let y = (y1, . . . , yn)T

I Then y = Φw is ...?

7 / 26

Page 8: IAML: Linear Regression · 2010. 10. 18. · I Generalized linear regression I Radial basis function (RBF) models I Curse of dimensionality 2/26. The Regression Problem I Classification

Linear model (part 2)

In matrix notation:

I Design matrix is n × (D + 1)

Φ =

1 x11 x12 . . . x1D1 x21 x22 . . . x2D...

......

......

1 xn1 xn2 . . . xnD

I xij is the j th component of the training input xi

I Let y = (y1, . . . , yn)T

I Then y = Φw is the model’s predicted values on traininginputs.

8 / 26

Page 9: IAML: Linear Regression · 2010. 10. 18. · I Generalized linear regression I Radial basis function (RBF) models I Curse of dimensionality 2/26. The Regression Problem I Classification

Solving for Model Parameters

This looks like what we’ve seen in linear algebra

y = Φw

We know y and Φ but not w.

So why not take w = Φ−1y? (You can’t, but why?)

Three reasons:I Φ is not square. It is n × (D + 1).I The system is overconstrained (n equations for D + 1

parameters), in other wordsI The data has noise

9 / 26

Page 10: IAML: Linear Regression · 2010. 10. 18. · I Generalized linear regression I Radial basis function (RBF) models I Curse of dimensionality 2/26. The Regression Problem I Classification

Solving for Model Parameters

This looks like what we’ve seen in linear algebra

y = Φw

We know y and Φ but not w.

So why not take w = Φ−1y? (You can’t, but why?)

Three reasons:I Φ is not square. It is n × (D + 1).I The system is overconstrained (n equations for D + 1

parameters), in other wordsI The data has noise

10 / 26

Page 11: IAML: Linear Regression · 2010. 10. 18. · I Generalized linear regression I Radial basis function (RBF) models I Curse of dimensionality 2/26. The Regression Problem I Classification

Loss function

Want a loss function O(w) thatI We minimize wrt w.I At minimum, y looks like y.I (Recall: y depends on w)

y = Φw

11 / 26

Page 12: IAML: Linear Regression · 2010. 10. 18. · I Generalized linear regression I Radial basis function (RBF) models I Curse of dimensionality 2/26. The Regression Problem I Classification

Fitting a linear model to data

I Squared error makes the maths easy

O(w) =n∑

i=1

(yi − f (xi ; w))2

=n∑

i=1

(yi −wT xi)2

= (y− Φw)T (y− Φw)

12 / 26

Page 13: IAML: Linear Regression · 2010. 10. 18. · I Generalized linear regression I Radial basis function (RBF) models I Curse of dimensionality 2/26. The Regression Problem I Classification

Fitting procedure

I The error surface is a parabolic bowl

Fitting a linear model to data

� Given a dataset D of pairs (xi , yi) for i = 1, . . . , n� Squared error makes the maths easy

E(w) =n�

i=1

(yi − f (xi ; w))2

� The error surface is a parabolic bowl

-1

0

1

2

-2-1

01

23

0

5

10

15

20

25

w0 w1

E[w]

Figure: Tom Mitchell 5 / 17I Minimize O(w) = |y− Φw|2, analytical solution

w = (ΦT Φ)−1ΦT y

I (ΦT Φ)−1ΦT is the pseudo-inverse of Φ

Figure: Tom Mitchell

13 / 26

Page 14: IAML: Linear Regression · 2010. 10. 18. · I Generalized linear regression I Radial basis function (RBF) models I Curse of dimensionality 2/26. The Regression Problem I Classification

Another Geometric Viewpoint

(Warning: slightly mind blowing)

Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 3

x1

x2

y

y

FIGURE 3.2. The N-dimensional geometry of leastsquares regression with two predictors. The outcomevector y is orthogonally projected onto the hyperplanespanned by the input vectors x1 and x2. The projectiony represents the vector of the least squares predictions

I This figure depicts RN

I x1 and x2 are columns of the design matrixI The dashed line is the residual vector.

Figure: Hastie, Tibshirani, and Friedman

14 / 26

Page 15: IAML: Linear Regression · 2010. 10. 18. · I Generalized linear regression I Radial basis function (RBF) models I Curse of dimensionality 2/26. The Regression Problem I Classification

Probabilistic interpretation of O(w)

I Assume that y = wT x + ε, where ε ∼ N(0,1)

I (This is an exact linear relationship plus Gaussian noise.)I This implies that y |xi ∼ N(f (xi ; w), σ2), i.e.

− log p(yi |xi) = log√

2π + logσ +(yi − f (xi ; w))2

2σ2

I So minimising O(w) equivalent to maximising likelihood!I Can view wT x as E [y |x].I Squared residuals allow estimation of σ2

σ2 =1n

n∑

i=1

(yi − f (xi ; w))2

15 / 26

Page 16: IAML: Linear Regression · 2010. 10. 18. · I Generalized linear regression I Radial basis function (RBF) models I Curse of dimensionality 2/26. The Regression Problem I Classification

Fitting this into the general structure for learning algorithms:

I Define the task: regressionI Decide on the model structure: linear regression modelI Decide on the score function: squared error (likelihood)I Decide on optimization/search method to optimize the

score function: calculus (analytic solution)

16 / 26

Page 17: IAML: Linear Regression · 2010. 10. 18. · I Generalized linear regression I Radial basis function (RBF) models I Curse of dimensionality 2/26. The Regression Problem I Classification

Examples of regression problems

I Robot inverse dynamics: predicting what torques areneeded to drive a robot arm along a given trajectory

I Electricity load forecasting, generate hourly forecasts twodays in advance (see W & F, §1.3)

I Predicting staffing requirements at help desks based onhistorical data and product and sales information,

I Predicting the time to failure of equipment based onutilization and environmental conditions

17 / 26

Page 18: IAML: Linear Regression · 2010. 10. 18. · I Generalized linear regression I Radial basis function (RBF) models I Curse of dimensionality 2/26. The Regression Problem I Classification

Sensitivity to Outliers

I Linear regression is sensitive to outliersI Example: Suppose y = 0.5x + ε, where ε ∼ N(0,

√0.25),

and then add a point (2.5,3):

0 1 2 3 4 5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

18 / 26

Page 19: IAML: Linear Regression · 2010. 10. 18. · I Generalized linear regression I Radial basis function (RBF) models I Curse of dimensionality 2/26. The Regression Problem I Classification

Diagnositics

Graphical diagnostics can be useful for checking:I Is the relationship obviously nonlinear? Look for structure

in residuals?I Are there obvious outliers?

The goal isn’t to find all problems. You can’t. The goal is to findobvious, embarrassing problems.

Examples: Plot residuals by fitted values. Stats packages willdo this for you.

19 / 26

Page 20: IAML: Linear Regression · 2010. 10. 18. · I Generalized linear regression I Radial basis function (RBF) models I Curse of dimensionality 2/26. The Regression Problem I Classification

Dealing with multiple outputs

I Suppose there are q different targets for each input xI We introduce a different wi for each target dimension, and

do regression separately for each oneI This is called multiple regression

20 / 26

Page 21: IAML: Linear Regression · 2010. 10. 18. · I Generalized linear regression I Radial basis function (RBF) models I Curse of dimensionality 2/26. The Regression Problem I Classification

Basis expansion

I We can easily transform the original attributes xnon-linearly into φ(x) and do linear regression on them

−1 0 1−1

−0.5

0

0.5

1

−1 0 10

0.25

0.5

0.75

1

−1 0 10

0.25

0.5

0.75

1

polynomial Gaussians sigmoids

Figure credit: Chris Bishop, PRML

21 / 26

Page 22: IAML: Linear Regression · 2010. 10. 18. · I Generalized linear regression I Radial basis function (RBF) models I Curse of dimensionality 2/26. The Regression Problem I Classification

I Design matrix is n ×m

Φ =

φ1(x1) φ2(x1) . . . φm(x1)φ1(x2) φ2(x2) . . . φm(x2)

......

......

φ1(xn) φ2(xn) . . . φm(xn)

I Let y = (y1, . . . , yn)T

I Minimize E(w) = |y− Φw|2. As before we have ananalytical solution

w = (ΦT Φ)−1ΦT y

I (ΦT Φ)−1ΦT is the pseudo-inverse of Φ

22 / 26

Page 23: IAML: Linear Regression · 2010. 10. 18. · I Generalized linear regression I Radial basis function (RBF) models I Curse of dimensionality 2/26. The Regression Problem I Classification

Example: polynomial regression

φ(x) = (1, x , x2, . . . , xM)T

x

t

M = 0

0 1

−1

0

1

x

t

M = 1

0 1

−1

0

1

x

t

M = 3

0 1

−1

0

1

x

t

M = 9

0 1

−1

0

1

Figure credit: Chris Bishop, PRML

23 / 26

Page 24: IAML: Linear Regression · 2010. 10. 18. · I Generalized linear regression I Radial basis function (RBF) models I Curse of dimensionality 2/26. The Regression Problem I Classification

Radial basis function (RBF) models

I Set φi(x) = exp(−12 |x− ci |2/α2).

I Need to position these “basis functions” at some priorchosen centres ci and with a given width α. There aremany ways to do this but choosing a subset of thedatapoints as centres is one method that is quite effective

I Finding the weights is the same as ever: thepseudo-inverse solution.

24 / 26

Page 25: IAML: Linear Regression · 2010. 10. 18. · I Generalized linear regression I Radial basis function (RBF) models I Curse of dimensionality 2/26. The Regression Problem I Classification

How big should M be?

I How many radial basis bumps do we need?I Suppose we only needed 3 for a 1D regression problem.

Then we might need 3D for a D dimensional problem.I Danger of curse of dimensionality (Bellman) – number of

basis functions may need to grow exponentially withdimensionality of x if all of the input space is explored

I If M ≥ n then we can fit the training data exactly; danger ofoverfitting

I Being able to adapt the basis functions in light of the datacan help→ neural networks (multilayer perceptrons)

25 / 26

Page 26: IAML: Linear Regression · 2010. 10. 18. · I Generalized linear regression I Radial basis function (RBF) models I Curse of dimensionality 2/26. The Regression Problem I Classification

Summary

I There are many linear-in-the-parameters models; e.g.linear, polynomial, RBF

I Maximum likelihood solution is computationally efficient(pseudo-inverse)

I Danger of overfitting

26 / 26