iaml: linear regression - university of edinburgh · 2015. 10. 19. · the regression problem i...

38
IAML: Linear Regression Nigel Goddard School of Informatics Semester 1 1 / 38

Upload: others

Post on 21-Aug-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IAML: Linear Regression - University of Edinburgh · 2015. 10. 19. · The Regression Problem I Classification and regression problems: I Classification: target of prediction is

IAML: Linear Regression

Nigel GoddardSchool of Informatics

Semester 1

1 / 38

Page 2: IAML: Linear Regression - University of Edinburgh · 2015. 10. 19. · The Regression Problem I Classification and regression problems: I Classification: target of prediction is

Overview

I The linear modelI Fitting the linear model to dataI Probabilistic interpretation of the error functionI Examples of regression problemsI Dealing with multiple outputsI Generalized linear regressionI Radial basis function (RBF) models

2 / 38

Page 3: IAML: Linear Regression - University of Edinburgh · 2015. 10. 19. · The Regression Problem I Classification and regression problems: I Classification: target of prediction is

The Regression Problem

I Classification and regression problems:I Classification: target of prediction is discreteI Regression: target of prediction is continuous

II Training data: Set D of pairs (xi , yi) for i = 1, . . . ,n, wherexi ∈ RD and yi ∈ R

I Today: Linear regression, i.e., relationship between x andy is linear.

I Although this is simple (and limited) it is:I More powerful than you would expectI The basis for more complex nonlinear methodsI Teaches a lot about regression and classification

3 / 38

Page 4: IAML: Linear Regression - University of Edinburgh · 2015. 10. 19. · The Regression Problem I Classification and regression problems: I Classification: target of prediction is

Examples of regression problems

I Robot inverse dynamics: predicting what torques areneeded to drive a robot arm along a given trajectory

I Electricity load forecasting, generate hourly forecasts twodays in advance (see W & F, §1.3)

I Predicting staffing requirements at help desks based onhistorical data and product and sales information,

I Predicting the time to failure of equipment based onutilization and environmental conditions

4 / 38

Page 5: IAML: Linear Regression - University of Edinburgh · 2015. 10. 19. · The Regression Problem I Classification and regression problems: I Classification: target of prediction is

The Linear Model

I Linear model

f (x; w) = w0 + w1x1 + . . .+ wDxD

= φ(x)w

where φ(x) = (1, x1, . . . , xD) = (1,xT )and

w =

w0w1...wD

(1)

I The maths of fitting linear models to data is easy. We usethe notation φ(x) to make generalisation easy later.

5 / 38

Page 6: IAML: Linear Regression - University of Edinburgh · 2015. 10. 19. · The Regression Problem I Classification and regression problems: I Classification: target of prediction is

Toy example: Data

●●

●●

−3 −2 −1 0 1 2 3

−2

−1

01

23

4

x

y

6 / 38

Page 7: IAML: Linear Regression - University of Edinburgh · 2015. 10. 19. · The Regression Problem I Classification and regression problems: I Classification: target of prediction is

Toy example: Data

●●

●●

−3 −2 −1 0 1 2 3

−2

−1

01

23

4

x

y

●●

●●

−3 −2 −1 0 1 2 3

−2

−1

01

23

4x

y

7 / 38

Page 8: IAML: Linear Regression - University of Edinburgh · 2015. 10. 19. · The Regression Problem I Classification and regression problems: I Classification: target of prediction is

With two features

Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 3

•• •

••

• •••

• •••

••

•••

••

••

•• ••

••

•• •

•• •

••

• ••

• •

••

• •••

X1

X2

Y

FIGURE 3.1. Linear least squares fitting withX ∈ IR2. We seek the linear function of X that mini-mizes the sum of squared residuals from Y .

Instead of a line, a plane. With more features, a hyperplane.

Figure: Hastie, Tibshirani, and Friedman

8 / 38

Page 9: IAML: Linear Regression - University of Edinburgh · 2015. 10. 19. · The Regression Problem I Classification and regression problems: I Classification: target of prediction is

With more features

CPU Performance Data Set

I Predict: PRP: published relative performanceI MYCT: machine cycle time in nanoseconds (integer)I MMIN: minimum main memory in kilobytes (integer)I MMAX: maximum main memory in kilobytes (integer)I CACH: cache memory in kilobytes (integer)I CHMIN: minimum channels in units (integer)I CHMAX: maximum channels in units (integer)

9 / 38

Page 10: IAML: Linear Regression - University of Edinburgh · 2015. 10. 19. · The Regression Problem I Classification and regression problems: I Classification: target of prediction is

With more features

PRP = - 56.1+ 0.049 MYCT+ 0.015 MMIN+ 0.006 MMAX+ 0.630 CACH- 0.270 CHMIN+ 1.46 CHMAX

10 / 38

Page 11: IAML: Linear Regression - University of Edinburgh · 2015. 10. 19. · The Regression Problem I Classification and regression problems: I Classification: target of prediction is

In matrix notation

I Design matrix is n × (D + 1)

Φ =

1 x11 x12 . . . x1D1 x21 x22 . . . x2D...

......

......

1 xn1 xn2 . . . xnD

I xij is the j th component of the training input xi

I Let y = (y1, . . . , yn)T

I Then y = Φw is ...?

11 / 38

Page 12: IAML: Linear Regression - University of Edinburgh · 2015. 10. 19. · The Regression Problem I Classification and regression problems: I Classification: target of prediction is

Linear Algebra: The 1-Slide Version

What is matrix multiplication?

A =

a11 a12 a13a21 a22 a23a31 a32 a33

,b =

b1b2b3

First consider matrix times vector, i.e., Ab. Two answers:

1. Ab is a linear combination of the columns of A

Ab = b1

a11a21a31

+ b2

a12a22a32

+ b3

a13a23a33

2. Ab is a vector. Each element of the vector is the dot

products between b and one row of A.

Ab =

(a11,a12,a13)b(a21,a22,a23)b(a31,a32,a33)b

12 / 38

Page 13: IAML: Linear Regression - University of Edinburgh · 2015. 10. 19. · The Regression Problem I Classification and regression problems: I Classification: target of prediction is

Linear model (part 2)

In matrix notation:

I Design matrix is n × (D + 1)

Φ =

1 x11 x12 . . . x1D1 x21 x22 . . . x2D...

......

......

1 xn1 xn2 . . . xnD

I xij is the j th component of the training input xi

I Let y = (y1, . . . , yn)T

I Then y = Φw is the model’s predicted values on traininginputs.

13 / 38

Page 14: IAML: Linear Regression - University of Edinburgh · 2015. 10. 19. · The Regression Problem I Classification and regression problems: I Classification: target of prediction is

Solving for Model Parameters

This looks like what we’ve seen in linear algebra

y = Φw

We know y and Φ but not w.

So why not take w = Φ−1y? (You can’t, but why?)

Three reasons:

I Φ is not square. It is n × (D + 1).I The system is overconstrained (n equations for D + 1

parameters), in other wordsI The data has noise

14 / 38

Page 15: IAML: Linear Regression - University of Edinburgh · 2015. 10. 19. · The Regression Problem I Classification and regression problems: I Classification: target of prediction is

Solving for Model Parameters

This looks like what we’ve seen in linear algebra

y = Φw

We know y and Φ but not w.

So why not take w = Φ−1y? (You can’t, but why?)

Three reasons:

I Φ is not square. It is n × (D + 1).I The system is overconstrained (n equations for D + 1

parameters), in other wordsI The data has noise

15 / 38

Page 16: IAML: Linear Regression - University of Edinburgh · 2015. 10. 19. · The Regression Problem I Classification and regression problems: I Classification: target of prediction is

Loss function

Want a loss function O(w) that

I We minimize wrt w.I At minimum, y looks like y.I (Recall: y depends on w)

y = Φw

16 / 38

Page 17: IAML: Linear Regression - University of Edinburgh · 2015. 10. 19. · The Regression Problem I Classification and regression problems: I Classification: target of prediction is

Fitting a linear model to data

Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 3

•• •

••

• •••

• •••

••

•••

••

••

•• ••

••

•• •

•• •

••

• ••

• •

••

• •••

X1

X2

Y

FIGURE 3.1. Linear least squares fitting withX ∈ IR2. We seek the linear function of X that mini-mizes the sum of squared residuals from Y .

I A common choice: squared error(makes the maths easy)

O(w) =n∑

i=1

(yi −wT xi)2

I In the picture: this is sum ofsquared length of black sticks.

I (Each one is called a residual,i.e., each yi −wT xi )

17 / 38

Page 18: IAML: Linear Regression - University of Edinburgh · 2015. 10. 19. · The Regression Problem I Classification and regression problems: I Classification: target of prediction is

Fitting a linear model to data

I

O(w) =n∑

i=1

(yi −wT xi)2

= (y− Φw)T (y− Φw)

I We want to minimize this with respect to w.I The error surface is a parabolic bowl

Fitting a linear model to data

� Given a dataset D of pairs (xi , yi) for i = 1, . . . , n� Squared error makes the maths easy

E(w) =n�

i=1

(yi − f (xi ; w))2

� The error surface is a parabolic bowl

-1

0

1

2

-2-1

01

23

0

5

10

15

20

25

w0 w1

E[w]

Figure: Tom Mitchell 5 / 17I How do we do this?

Figure: Tom Mitchell

18 / 38

Page 19: IAML: Linear Regression - University of Edinburgh · 2015. 10. 19. · The Regression Problem I Classification and regression problems: I Classification: target of prediction is

The Solution

I Answer: to minimize O(w) =∑n

i=1(yi −wT xi)2, set partial

derivatives to 0.I This has an analytical solution

w = (ΦT Φ)−1ΦT y

I (ΦT Φ)−1ΦT is the pseudo-inverse of Φ

I First check: Does this make sense? Do the matrixdimensions line up?

I Then: Why is this called a pseudo-inverse? ()I Finally: What happens if there are no features?

19 / 38

Page 20: IAML: Linear Regression - University of Edinburgh · 2015. 10. 19. · The Regression Problem I Classification and regression problems: I Classification: target of prediction is

Probabilistic interpretation of O(w)

I Assume that y = wT x + ε, where ε ∼ N(0, σ2)

I (This is an exact linear relationship plus Gaussian noise.)I This implies that y |xi ∼ N(wT xi , σ

2), i.e.

− log p(yi |xi) = log√

2π + logσ +(yi −wT xi)

2

2σ2

I So minimising O(w) equivalent to maximising likelihood!I Can view wT x as E [y |x].I Squared residuals allow estimation of σ2

σ2 =1n

n∑i=1

(yi −wT xi)2

20 / 38

Page 21: IAML: Linear Regression - University of Edinburgh · 2015. 10. 19. · The Regression Problem I Classification and regression problems: I Classification: target of prediction is

Fitting this into the general structure for learning algorithms:

I Define the task: regressionI Decide on the model structure: linear regression modelI Decide on the score function: squared error (likelihood)I Decide on optimization/search method to optimize the

score function: calculus (analytic solution)

21 / 38

Page 22: IAML: Linear Regression - University of Edinburgh · 2015. 10. 19. · The Regression Problem I Classification and regression problems: I Classification: target of prediction is

Sensitivity to Outliers

I Linear regression is sensitive to outliersI Example: Suppose y = 0.5x + ε, where ε ∼ N(0,

√0.25),

and then add a point (2.5,3):

0 1 2 3 4 5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

22 / 38

Page 23: IAML: Linear Regression - University of Edinburgh · 2015. 10. 19. · The Regression Problem I Classification and regression problems: I Classification: target of prediction is

Diagnositics

Graphical diagnostics can be useful for checking:

I Is the relationship obviously nonlinear? Look for structurein residuals?

I Are there obvious outliers?

The goal isn’t to find all problems. You can’t. The goal is to findobvious, embarrassing problems.

Examples: Plot residuals by fitted values. Stats packages willdo this for you.

23 / 38

Page 24: IAML: Linear Regression - University of Edinburgh · 2015. 10. 19. · The Regression Problem I Classification and regression problems: I Classification: target of prediction is

Dealing with multiple outputs

I Suppose there are q different targets for each input xI We introduce a different wi for each target dimension, and

do regression separately for each oneI This is called multiple regression

24 / 38

Page 25: IAML: Linear Regression - University of Edinburgh · 2015. 10. 19. · The Regression Problem I Classification and regression problems: I Classification: target of prediction is

Basis expansion

I We can easily transform the original attributes xnon-linearly into φ(x) and do linear regression on them

−1 0 1−1

−0.5

0

0.5

1

−1 0 10

0.25

0.5

0.75

1

−1 0 10

0.25

0.5

0.75

1

polynomial Gaussians sigmoids

Figure credit: Chris Bishop, PRML

25 / 38

Page 26: IAML: Linear Regression - University of Edinburgh · 2015. 10. 19. · The Regression Problem I Classification and regression problems: I Classification: target of prediction is

I Design matrix is n ×m

Φ =

φ1(x1) φ2(x1) . . . φm(x1)φ1(x2) φ2(x2) . . . φm(x2)

......

......

φ1(xn) φ2(xn) . . . φm(xn)

I Let y = (y1, . . . , yn)T

I Minimize E(w) = |y− Φw|2. As before we have ananalytical solution

w = (ΦT Φ)−1ΦT y

I (ΦT Φ)−1ΦT is the pseudo-inverse of Φ

26 / 38

Page 27: IAML: Linear Regression - University of Edinburgh · 2015. 10. 19. · The Regression Problem I Classification and regression problems: I Classification: target of prediction is

Example: polynomial regression

φ(x) = (1, x , x2, . . . , xM)T

x

t

M = 0

0 1

−1

0

1

x

t

M = 1

0 1

−1

0

1

x

t

M = 3

0 1

−1

0

1

x

t

M = 9

0 1

−1

0

1

Figure credit: Chris Bishop, PRML27 / 38

Page 28: IAML: Linear Regression - University of Edinburgh · 2015. 10. 19. · The Regression Problem I Classification and regression problems: I Classification: target of prediction is

More about the features

I Transforming the features can be important.I Example: Suppose I want to predict the CPU performance.I Maybe one of the features is manufacturer.

x1 =

1 if Intel2 if AMD3 if Apple4 if Motorola

I Let’s use this as a feature. Will this work?

I Not the way you want. Do you really believe AMD is doubleIntel?

28 / 38

Page 29: IAML: Linear Regression - University of Edinburgh · 2015. 10. 19. · The Regression Problem I Classification and regression problems: I Classification: target of prediction is

More about the features

I Transforming the features can be important.I Example: Suppose I want to predict the CPU performance.I Maybe one of the features is manufacturer.

x1 =

1 if Intel2 if AMD3 if Apple4 if Motorola

I Let’s use this as a feature. Will this work?I Not the way you want. Do you really believe AMD is double

Intel?

29 / 38

Page 30: IAML: Linear Regression - University of Edinburgh · 2015. 10. 19. · The Regression Problem I Classification and regression problems: I Classification: target of prediction is

More about the features

I Instead convert this into 0/1

x1 = 1 if Intel, 0 otherwisex2 = 1 if AMD, 0 otherwise

...

I Note this is a consequence of linearity. We saw somethingsimilar with text in the first week.

I Other good transformations: log, square, square root

30 / 38

Page 31: IAML: Linear Regression - University of Edinburgh · 2015. 10. 19. · The Regression Problem I Classification and regression problems: I Classification: target of prediction is

Radial basis function (RBF) models

I Set φi(x) = exp(−12 |x− ci |2/α2).

I Need to position these “basis functions” at some priorchosen centres ci and with a given width α. There aremany ways to do this but choosing a subset of thedatapoints as centres is one method that is quite effective

I Finding the weights is the same as ever: thepseudo-inverse solution.

31 / 38

Page 32: IAML: Linear Regression - University of Edinburgh · 2015. 10. 19. · The Regression Problem I Classification and regression problems: I Classification: target of prediction is

RBF example

0 1 2 3 4 5 6 7

−0.

50.

00.

51.

01.

52.

0

x

y

0 1 2 3 4 5 6 7

−0.

50.

00.

51.

01.

52.

0

x

y

32 / 38

Page 33: IAML: Linear Regression - University of Edinburgh · 2015. 10. 19. · The Regression Problem I Classification and regression problems: I Classification: target of prediction is

RBF example

0 1 2 3 4 5 6 7

−0.

50.

00.

51.

01.

52.

0

x

y

0 1 2 3 4 5 6 7

−0.

50.

00.

51.

01.

52.

0

xy

33 / 38

Page 34: IAML: Linear Regression - University of Edinburgh · 2015. 10. 19. · The Regression Problem I Classification and regression problems: I Classification: target of prediction is

An RBF feature

Original data

0 1 2 3 4 5 6 7

−0.

50.

00.

51.

01.

52.

0

x

y

RBF feature, c1 = 3, α1 = 1

0.0 0.1 0.2 0.3 0.4−

0.5

0.0

0.5

1.0

1.5

2.0

phi3.x

y

34 / 38

Page 35: IAML: Linear Regression - University of Edinburgh · 2015. 10. 19. · The Regression Problem I Classification and regression problems: I Classification: target of prediction is

Another RBF feature

Notice how the feature functions “specialize” in input space.

Original data

0 1 2 3 4 5 6 7

−0.

50.

00.

51.

01.

52.

0

x

y

RBF feature, c2 = 6, α2 = 1

0.0 0.1 0.2 0.3 0.4

−0.

50.

00.

51.

01.

52.

0

RBF kernel (mean 6)

y

35 / 38

Page 36: IAML: Linear Regression - University of Edinburgh · 2015. 10. 19. · The Regression Problem I Classification and regression problems: I Classification: target of prediction is

RBF example

Run the RBF with both basis functions above, plot the residuals

yi − φ(xi)T w

Original data

0 1 2 3 4 5 6 7

−0.

50.

00.

51.

01.

52.

0

x

y

Residuals

0 1 2 3 4 5 6 7

−0.

50.

00.

51.

01.

52.

0

x

Res

idua

l (R

BF

mod

el)

36 / 38

Page 37: IAML: Linear Regression - University of Edinburgh · 2015. 10. 19. · The Regression Problem I Classification and regression problems: I Classification: target of prediction is

RBF: Ay, there’s the rub

I So why not use RBFs for everything?I Short answer: You might need too many basis functions.I This is especially true in high dimensions (we’ll say more

later)I Too many means you probably overfit.I Extreme example: Centre one on each training point.I Also: notice that we haven’t seen yet in the course how to

learn the RBF parameters, i.e., the mean and standarddeviation of each kernel

I Main point of presenting RBFs now: Set up later methodslike support vector machines

37 / 38

Page 38: IAML: Linear Regression - University of Edinburgh · 2015. 10. 19. · The Regression Problem I Classification and regression problems: I Classification: target of prediction is

Summary

I Linear regression often useful out of the box.I More useful than it would be seem because linear means

linear in the parameters. You can do a nonlinear transformof the data first, e.g., polynomial, RBF. This point will comeup again.

I Maximum likelihood solution is computationally efficient(pseudo-inverse)

I Danger of overfitting, especially with many features orbasis functions

38 / 38