linear regression linear regression with shrinkagerita/ml_course/lectures/regression...typically we...

36
Linear Regression Linear Regression with Shrinkage Some slides are due to Tommi Jaakkola, MIT AI Lab

Upload: others

Post on 03-Aug-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Linear Regression Linear Regression with Shrinkagerita/ml_course/lectures/Regression...Typically we have a set of training data from which we estimate the parameters . Linear Regression

Linear RegressionLinear Regression with Shrinkage

Some slides are due to Tommi Jaakkola, MIT AI Lab

Page 2: Linear Regression Linear Regression with Shrinkagerita/ml_course/lectures/Regression...Typically we have a set of training data from which we estimate the parameters . Linear Regression

Introduction� The goal of regression is to make quantitative (real

valued) predictions on the basis of a (vector of) features or attributes.

� Examples: house prices, stock values, survival time, fuel efficiency of cars, etc.

Predicting vehicle fuel efficiency (mpg) from 8 attributes:

Page 3: Linear Regression Linear Regression with Shrinkagerita/ml_course/lectures/Regression...Typically we have a set of training data from which we estimate the parameters . Linear Regression

A generic regression problem

� The input attributes are given as fixed length vectorsthat could come from different sources: inputs,transformation of inputs (log, square root, etc…) or basis functions.

� The outputs are assumed to be real valued (with some possible restrictions).

Mℜ∈ x

ℜ∈y some possible restrictions).

� Given n iid training samples from unknown distribution P(x, y), the goal is to minimize the prediction error (loss) on new examples (x, y) drawn at random from the same P(x, y).

� An a example of a loss function:

)},x)...(,x{(D 11 nn yy=

2)x)(()),x(( yfyfL −= squared loss

our prediction for x

Page 4: Linear Regression Linear Regression with Shrinkagerita/ml_course/lectures/Regression...Typically we have a set of training data from which we estimate the parameters . Linear Regression

Regression Function

� We need to define a class of functions (types of � We need to define a class of functions (types of predictions we make).

Linear prediction:

xwwwwxf 1010 ),;( +=

where are the parameters we need estimate.10,ww

Page 5: Linear Regression Linear Regression with Shrinkagerita/ml_course/lectures/Regression...Typically we have a set of training data from which we estimate the parameters . Linear Regression

Typically we have a set of training data from which we estimate the parameters .

Linear Regression

),)...(,( 11 nn yxyx

Each is a vector of measurements for ithcase.

),..,( 1 iMii xxx =

10,ww

or

a basis expansion of)}(),...,({)( 0 iMii xhxhxh = ix

)()();()(1

0 xhwxhwwwxfxyM

j

tjj∑

=

=+=≈)1)( (define 0 =xh

Page 6: Linear Regression Linear Regression with Shrinkagerita/ml_course/lectures/Regression...Typically we have a set of training data from which we estimate the parameters . Linear Regression

Basis Functions

� There are many basis functions we can use e.g.� Polynomial

� Radial basis functions

1)( −= jj xxh

( )

−−=

2

2

2exp)(

s

xxh j

j

µ

−x µ� Sigmoidal

� Splines, Fourier, Wavelets, etc

−=

s

xxh j

j

µσ)(

Page 7: Linear Regression Linear Regression with Shrinkagerita/ml_course/lectures/Regression...Typically we have a set of training data from which we estimate the parameters . Linear Regression

Estimation Criterion

� We need a fitting/estimation criterion to select appropriate � We need a fitting/estimation criterion to select appropriate values for the parameters based on the training set

� For example, we can use the empirical loss:

10,ww)},)...(,{( 11 nn yxyxD =

( )∑=

−=n

iiin wwxfy

nwwJ

1

20101 ),;(

1),(

(note: the loss is the same as in evaluation)

Page 8: Linear Regression Linear Regression with Shrinkagerita/ml_course/lectures/Regression...Typically we have a set of training data from which we estimate the parameters . Linear Regression

Empirical loss: motivation

� Ideally, we would like to find the parameters , that minimize the expected loss (unlimited training data):

where the expectation is over samples from P(x,y).

( )210~),(10 ),;(),( wwxfyEwwJ iiPyx −=

10,ww

� When the number of training examples n is large:

( ) ( )∑=

−≈−n

iiiiiPyx wwxfy

nwwxfyE

1

210

210~),( ),;(

1),;(

Expected loss Empirical loss

Page 9: Linear Regression Linear Regression with Shrinkagerita/ml_course/lectures/Regression...Typically we have a set of training data from which we estimate the parameters . Linear Regression

Linear Regression Estimation� Minimize the empirical squared loss

( )∑=

−=n

iiin wwxfy

nwwJ

1

21010 ),;(

1),(

( )∑=

−−=n

iii xwwy

n 1

210

1

� By setting the derivatives with respect to to zero we get necessary conditions for the “optimal” parameter values.

∑=in 1

( ) 0)(2

),(1

10101

=−−−=∂∂

∑=

i

n

iiin xxwwy

nwwJ

w

( ) 0)1(2

),(1

10100

=−−−=∂∂

∑=

n

iiin xwwy

nwwJ

w

01 , ww

Page 10: Linear Regression Linear Regression with Shrinkagerita/ml_course/lectures/Regression...Typically we have a set of training data from which we estimate the parameters . Linear Regression

Interpretation

( ) 0)(2

110 =−−−∑

=i

n

iii xxwwy

n

( ) 0)1(2

110 =−−−∑

=

n

iii xwwy

n

( )xwwy −−=ε

The optimality conditions

ensure that the prediction error is decorrelated with any linear function of the inputs.

( )ii xwwy 10 −−=ε

Page 11: Linear Regression Linear Regression with Shrinkagerita/ml_course/lectures/Regression...Typically we have a set of training data from which we estimate the parameters . Linear Regression

=0

w

w

=

1

......

1

X

Linear Regression: Matrix Form

n samples

M+

1

M+1

=

ny

y

...y1

1w

1x

nx

( ) ( ) ( )yXwyXw1

),(1

21010 −−=−−= ∑

=

tn

iiin xwwy

nwwJ

Page 12: Linear Regression Linear Regression with Shrinkagerita/ml_course/lectures/Regression...Typically we have a set of training data from which we estimate the parameters . Linear Regression

Linear Regression Solution

� By setting the derivatives of to zero,

( ) ( )yXwyXw −− t

( ) 0XwXyX2 tt =−n

we get the solution:

( ) yXXXˆ t1t −=w

The solution is a linear function of the outputs y.

Page 13: Linear Regression Linear Regression with Shrinkagerita/ml_course/lectures/Regression...Typically we have a set of training data from which we estimate the parameters . Linear Regression

Statistical view of linear regression

� In a statistical regression model we model both the function and noise

Observed output = function + noise

ε+= );()( wxfxy

),0(~ e.g., where, 2σε N

� Whatever we cannot capture with our chosen family of functions will be interpreted as noise

),0(~ e.g., where, 2σε N

Page 14: Linear Regression Linear Regression with Shrinkagerita/ml_course/lectures/Regression...Typically we have a set of training data from which we estimate the parameters . Linear Regression

Statistical view of linear regression

� f(x;w) is trying to capture the mean of the observations y given the input x:

� where E[ y| x] is the conditional expectation of ygiven x, evaluated according to the model (not

);(]|);([]|[ wxfxwxfExyE =+= ε

given x, evaluated according to the model (not according to the underlying distribution of X)

Page 15: Linear Regression Linear Regression with Shrinkagerita/ml_course/lectures/Regression...Typically we have a set of training data from which we estimate the parameters . Linear Regression

Statistical view of linear regression

� According to our statistical model

the outputs y given x are normally distributed with mean f (x;w) and variance σ2:

ε+= );()( wxfxy ),0(~ , 2σε N

mean f (x;w) and variance σ2:

(we model the uncertainty in the predictions, not just the mean)

−−= 222

2 ));((2

1exp

2

1),,|( wxfywxyp

σπσσ

Page 16: Linear Regression Linear Regression with Shrinkagerita/ml_course/lectures/Regression...Typically we have a set of training data from which we estimate the parameters . Linear Regression

Maximum likelihood estimation

� Given observations we find the parameters w that maximize the likelihood of the outputs:

{ }),(),...,,( 11 nn yxyxD =

=∏=

n

n

iii wxypwL

1

22 ),,|(),( σσ

� Maximize log-likelihood

( )( )

−−

= ∑

=

n

ikk

n

wxfy1

2

22;

2

1exp

2

1

σπσ

( )( )

−−

= ∑

=

n

ikk

n

wxfywL1

2

22

2 ;2

1

2

1log),(log

σπσσ

minimize

Page 17: Linear Regression Linear Regression with Shrinkagerita/ml_course/lectures/Regression...Typically we have a set of training data from which we estimate the parameters . Linear Regression

� Thus

� But the empirical squared loss is

Maximum likelihood estimation

( )( )∑=

−=n

iii

wMLE wxfyw

1

2;minarg

( )∑=

−=n

iiin wxfy

nwJ

1

2);(1

)(

Least-squares Linear Regression is MLE for Gaussian noise !!!

Page 18: Linear Regression Linear Regression with Shrinkagerita/ml_course/lectures/Regression...Typically we have a set of training data from which we estimate the parameters . Linear Regression

Linear Regression (is it “good”?)

� Simple model� Straightforward solution

• MLS is not a good estimator for prediction error

BUTBUT

• MLS is not a good estimator for prediction error

• The matrix XTX could be ill conditioned�Inputs are correlated�Input dimension is large�Training set is small

Page 19: Linear Regression Linear Regression with Shrinkagerita/ml_course/lectures/Regression...Typically we have a set of training data from which we estimate the parameters . Linear Regression

Linear Regression - Example

ε++= 21 2.08.0 xxy

In this example the output is generated by the model (where ε is a small white Gaussian noise):

Three training sets with different correlations Three training sets with different correlations between the two inputs were randomly chosen, and the linear regression solution was applied.

Page 20: Linear Regression Linear Regression with Shrinkagerita/ml_course/lectures/Regression...Typically we have a set of training data from which we estimate the parameters . Linear Regression

Linear Regression – Example

And the results are…

-0.50

0.51

1.52

x 2

RSS=0.0901465b=80.807276 , 0.209322 <

0

1

2

3

x 2

RSS=0.104955b=80.434832 , 0.568331 <

0

1

2

3

x 2

RSS=0.0725469b=830.2363, -29.2196 <

RSS=0.47w=30.36, -29.21

RSS=0.204w=0.42, 0.56

RSS=0.29w=0.82, 0.209

Strong correlation can cause the coefficients to be very large, and they will cancel each other to achieve a good RSS.

-2 -1 0 1 2 3x1

-1.5-1

-0.5

-2 -1 0 1 2 3x1

-2

-1

-2 -1 0 1 2 3x1

-2

-1

x1,x2 uncorrelated x1,x2 correlated x1,x2 strongly correlated

Page 21: Linear Regression Linear Regression with Shrinkagerita/ml_course/lectures/Regression...Typically we have a set of training data from which we estimate the parameters . Linear Regression

Linear Regression

What can be done?

Page 22: Linear Regression Linear Regression with Shrinkagerita/ml_course/lectures/Regression...Typically we have a set of training data from which we estimate the parameters . Linear Regression

Shrinkage methods

� Ridge Regression� Lasso� PCR (Principal Components Regression)� PLS (Partial Least Squares)� PLS (Partial Least Squares)

Page 23: Linear Regression Linear Regression with Shrinkagerita/ml_course/lectures/Regression...Typically we have a set of training data from which we estimate the parameters . Linear Regression

Shrinkage methods

Before we proceed:� Since the following methods are not invariant under input

scale, we will assume the input is normalized (mean 0, variance 1):

jijij xxx −←′j

ijij

xx

σ

′←

� The offset is always estimated as and we will work with centered y (meaning y � y-w0)

Nywi

i /)(0 ∑=0w

Page 24: Linear Regression Linear Regression with Shrinkagerita/ml_course/lectures/Regression...Typically we have a set of training data from which we estimate the parameters . Linear Regression

Ridge Regression� Ridge regression shrinks the regression coefficients by

imposing a penalty on their size ( also called weight decay)� In ridge regression, we add a quadratic penalty on the

weights:∑ ∑∑= ==

+

−−=

N

i

M

jj

M

jjiji wwxwywJ

1 1

2

2

10)( λ

where λ ≥ 0 is a tuning parameter that controls the amount of shrinkage.

� The size constraint prevents the phenomenon of wildly large coefficients that cancel each other from occurring.

Page 25: Linear Regression Linear Regression with Shrinkagerita/ml_course/lectures/Regression...Typically we have a set of training data from which we estimate the parameters . Linear Regression

Ridge Regression Solution� Ridge regression in matrix form:

� The solution is

( ) ( ) wwXwyXwywJ tt λ+−−=)(

yXIXXw tM

tridge 1)(ˆ −+= λ ( ) ywLS t1t XXXˆ−

=

� The solution adds a positive constant to the diagonal of XTX before inversion. This makes the problem non singular even if X does not have full column rank.

� For orthogonal inputs the ridge estimates are the scaled version of least squares estimates:

10 ˆ ˆ ≤≤= γγ LSridge ww

Page 26: Linear Regression Linear Regression with Shrinkagerita/ml_course/lectures/Regression...Typically we have a set of training data from which we estimate the parameters . Linear Regression

Ridge Regression (insights)

TUDVX =

The matrix X can be represented by it’s SVD:

• U is a N*M matrix, it’s columns span the column space of X

• D is a M*M diagonal matrix of singular values• D is a M*M diagonal matrix of singular values

• V is a M*M matrix, it’s columns span the row space of X

Lets see how this method looks in the Principal Components coordinates of X

Page 27: Linear Regression Linear Regression with Shrinkagerita/ml_course/lectures/Regression...Typically we have a set of training data from which we estimate the parameters . Linear Regression

Ridge Regression (insights)

( )( ) ''

'

wXwVXVXw

XVXT ==

=

( ) yUDyVDUUDVVDUVwVw TTTTTlsTls 11ˆ'ˆ −−

===

The least squares solution is given by: ( ) ywls t1t XXXˆ−

=

( ) yUDyVDUUDVVDUVwVw ˆ'ˆ ===

( )( ) ( ) lsT

TTTTridgeTridge

wDIDyDUID

yVDUIUDVVDUVwVw

ˆ'ˆ

21212

1

−−

+=+=

=+==

λλ

λ

The Ridge Regression is similarly given by:

Diagonal

Page 28: Linear Regression Linear Regression with Shrinkagerita/ml_course/lectures/Regression...Typically we have a set of training data from which we estimate the parameters . Linear Regression

Ridge Regression (insights)

lsj

j

jridgej w

d

dw 'ˆ'ˆ

2

2

λ+=

In the PCA axes, the ridge coefficients are just scaled LS coefficients!

The coefficients that correspond to smaller input variance directions are scaled down more.

d1d2

scaled down more.

Page 29: Linear Regression Linear Regression with Shrinkagerita/ml_course/lectures/Regression...Typically we have a set of training data from which we estimate the parameters . Linear Regression

Ridge Regression

( ) ∑= +

=M

j j

j

d

ddf

12

2

λλ

In the following simulations, quantities are plotted versus the quantity

This monotonic decreasing function is the effective degrees of freedom of the ridge regression fit.

Page 30: Linear Regression Linear Regression with Shrinkagerita/ml_course/lectures/Regression...Typically we have a set of training data from which we estimate the parameters . Linear Regression

Ridge Regression

123456

b

Ridge Regression

No correlation Strong correlation

123456

b

Ridge Regression

Simulation results: 4 dimensions (w=1,2,3,4)

w w

0 0.5 1 1.5 2 2.5 3 3.5Effective Dimension

01

0.5 1 1.5 2 2.5 3 3.5Effective Dimension

1

2

3

4

5

6

SS

R

0 1 2 3 4Effective Dimension

01

0.5 1 1.5 2 2.5 3 3.5 4Effective Dimension

0123456

SS

R

Page 31: Linear Regression Linear Regression with Shrinkagerita/ml_course/lectures/Regression...Typically we have a set of training data from which we estimate the parameters . Linear Regression

Ridge regression is MAP with Gaussian prior

constwwXwyXwy

wΝxwyΝ

wPwDPwJ

tt

n

ii

ti

++−−=

−=

−=

∏=

22

1

22

2

1)()(

2

1

),0|(),|(log

)()|(log)(

τσ

τσ

This is the same objective function that ridge solves, using

Ridge:

22 τσ

22 /τσλ =

( ) ( ) wwXwyXwywJ tt λ+−−=)(

Page 32: Linear Regression Linear Regression with Shrinkagerita/ml_course/lectures/Regression...Typically we have a set of training data from which we estimate the parameters . Linear Regression

Lasso

tw

wxyw

M

N

i

M

jjiji

lasso

−=

∑ ∑= =1

2

1

subject to

minargˆβ

Lasso is a shrinkage method like ridge, with a subtle but important difference. It is defined by

twj

j ≤∑=1

subject to

There is a similarity to the ridge regression problem: the L2 ridge regression penalty is replaced by the L1 lasso penalty.

The L1 penalty makes the solution non-linear in y and requires a quadratic programming algorithm to compute it.

Page 33: Linear Regression Linear Regression with Shrinkagerita/ml_course/lectures/Regression...Typically we have a set of training data from which we estimate the parameters . Linear Regression

Lasso

∑=

≡≥M

j

lswtt1

0 ˆIf t is chosen to be larger then t0:

then the lasso estimation is identical to the least squares. On the other hand, for say t=t0/2, the least squares coefficients are shrunk by about 50% on average.

For convenience we will plot the simulation results versus shrinkage factor s:

∑=

=≡ M

j

lsw

t

t

ts

1

0 ˆ

Page 34: Linear Regression Linear Regression with Shrinkagerita/ml_course/lectures/Regression...Typically we have a set of training data from which we estimate the parameters . Linear Regression

Lasso

Simulation results:

No correlation

123456

b

Lasso

Strong correlation

123456

b

Lasso

w w

0 0.2 0.4 0.6 0.8 1Shrinkage factor

01

0.2 0.4 0.6 0.8 1Shrinkage factor

0

1234

56

SS

R

0 0.2 0.4 0.6 0.8 1Shrinkage factor

01

0.2 0.4 0.6 0.8 1Shrinkage factor

0

1

2

3

4

SS

R

Page 35: Linear Regression Linear Regression with Shrinkagerita/ml_course/lectures/Regression...Typically we have a set of training data from which we estimate the parameters . Linear Regression

L2 vs L1 penaltiesContours of the LS error function

L1 L2

t≤+ ββ 222 ≤+ ββ=βt≤+ 21 ββ 222

21 t≤+ ββ01 =β

In Lasso the constraint region has corners; when the solution hits a corner the corresponding coefficients becomes 0 (when M>2 more than one).

Page 36: Linear Regression Linear Regression with Shrinkagerita/ml_course/lectures/Regression...Typically we have a set of training data from which we estimate the parameters . Linear Regression

Problem:Figure 1 plots linear regression results on the basis of only three data points. We used various types of regularization to obtain the plots (see below) but got confused aboutwhich plot corresponds to which regularization method. Please assign each plot to one (and only one) of the following regularization method.