linear regression linear regression with shrinkagerita/ml_course/lectures/regression...typically we...

Linear RegressionLinear Regression with Shrinkage

Some slides are due to Tommi Jaakkola, MIT AI Lab

Introduction� The goal of regression is to make quantitative (real

valued) predictions on the basis of a (vector of) features or attributes.

� Examples: house prices, stock values, survival time, fuel efficiency of cars, etc.

Predicting vehicle fuel efficiency (mpg) from 8 attributes:

A generic regression problem

� The input attributes are given as fixed length vectorsthat could come from different sources: inputs,transformation of inputs (log, square root, etc…) or basis functions.

� The outputs are assumed to be real valued (with some possible restrictions).

Mℜ∈ x

ℜ∈y some possible restrictions).

� Given n iid training samples from unknown distribution P(x, y), the goal is to minimize the prediction error (loss) on new examples (x, y) drawn at random from the same P(x, y).

� An a example of a loss function:

)},x)...(,x{(D 11 nn yy=

2)x)(()),x(( yfyfL −= squared loss

our prediction for x

Regression Function

� We need to define a class of functions (types of � We need to define a class of functions (types of predictions we make).

Linear prediction:

xwwwwxf 1010 ),;( +=

where are the parameters we need estimate.10,ww

Typically we have a set of training data from which we estimate the parameters .

Linear Regression

),)...(,( 11 nn yxyx

Each is a vector of measurements for ithcase.

),..,( 1 iMii xxx =

10,ww

or

a basis expansion of)}(),...,({)( 0 iMii xhxhxh = ix

)()();()(1

0 xhwxhwwwxfxyM

j

tjj∑

=

=+=≈)1)( (define 0 =xh

Basis Functions

� There are many basis functions we can use e.g.� Polynomial

� Radial basis functions

1)( −= jj xxh

( )

−−=

2

2

2exp)(

s

xxh j

j

µ

−x µ� Sigmoidal

� Splines, Fourier, Wavelets, etc

−=

s

xxh j

j

µσ)(

Estimation Criterion

� We need a fitting/estimation criterion to select appropriate � We need a fitting/estimation criterion to select appropriate values for the parameters based on the training set

� For example, we can use the empirical loss:

10,ww)},)...(,{( 11 nn yxyxD =

( )∑=

−=n

iiin wwxfy

nwwJ

1

20101 ),;(

1),(

(note: the loss is the same as in evaluation)

Empirical loss: motivation

� Ideally, we would like to find the parameters , that minimize the expected loss (unlimited training data):

where the expectation is over samples from P(x,y).

( )210~),(10 ),;(),( wwxfyEwwJ iiPyx −=

10,ww

� When the number of training examples n is large:

( ) ( )∑=

−≈−n

iiiiiPyx wwxfy

nwwxfyE

1

210

210~),( ),;(

1),;(

Expected loss Empirical loss

Linear Regression Estimation� Minimize the empirical squared loss

( )∑=

−=n

iiin wwxfy

nwwJ

1

21010 ),;(

1),(

( )∑=

−−=n

iii xwwy

n 1

210

1

� By setting the derivatives with respect to to zero we get necessary conditions for the “optimal” parameter values.

∑=in 1

( ) 0)(2

),(1

10101

=−−−=∂∂

∑=

i

n

iiin xxwwy

nwwJ

w

( ) 0)1(2

),(1

10100

=−−−=∂∂

∑=

n

iiin xwwy

nwwJ

w

01 , ww

Interpretation

( ) 0)(2

110 =−−−∑

=i

n

iii xxwwy

n

( ) 0)1(2

110 =−−−∑

=

n

iii xwwy

n

( )xwwy −−=ε

The optimality conditions

ensure that the prediction error is decorrelated with any linear function of the inputs.

( )ii xwwy 10 −−=ε

=0

w

w

=

1

......

1

X

Linear Regression: Matrix Form

n samples

M+

1

M+1

=

ny

y

...y1

1w

1x

nx

( ) ( ) ( )yXwyXw1

),(1

21010 −−=−−= ∑

=

tn

iiin xwwy

nwwJ

Linear Regression Solution

� By setting the derivatives of to zero,

( ) ( )yXwyXw −− t

( ) 0XwXyX2 tt =−n

we get the solution:

( ) yXXXˆ t1t −=w

The solution is a linear function of the outputs y.

Statistical view of linear regression

� In a statistical regression model we model both the function and noise

Observed output = function + noise

ε+= );()( wxfxy

),0(~ e.g., where, 2σε N

� Whatever we cannot capture with our chosen family of functions will be interpreted as noise

),0(~ e.g., where, 2σε N


� f(x;w) is trying to capture the mean of the observations y given the input x:

� where E[ y| x] is the conditional expectation of ygiven x, evaluated according to the model (not

);(]|);([]|[ wxfxwxfExyE =+= ε

given x, evaluated according to the model (not according to the underlying distribution of X)


� According to our statistical model

the outputs y given x are normally distributed with mean f (x;w) and variance σ2:

ε+= );()( wxfxy ),0(~ , 2σε N

mean f (x;w) and variance σ2:

(we model the uncertainty in the predictions, not just the mean)

−−= 222

2 ));((2

1exp

2

1),,|( wxfywxyp

σπσσ

Maximum likelihood estimation

� Given observations we find the parameters w that maximize the likelihood of the outputs:

{ }),(),...,,( 11 nn yxyxD =

=∏=

n

n

iii wxypwL

1

22 ),,|(),( σσ

� Maximize log-likelihood

( )( )

−−

= ∑

=

n

ikk

n

wxfy1

2

22;

2

1exp

2

1

σπσ

( )( )

−−

= ∑

=

n

ikk

n

wxfywL1

2

22

2 ;2

1

2

1log),(log

σπσσ

minimize

� Thus

� But the empirical squared loss is

Maximum likelihood estimation

( )( )∑=

−=n

iii

wMLE wxfyw

1

2;minarg

( )∑=

−=n

iiin wxfy

nwJ

1

2);(1

)(

Least-squares Linear Regression is MLE for Gaussian noise !!!

Linear Regression (is it “good”?)

� Simple model� Straightforward solution

• MLS is not a good estimator for prediction error

BUTBUT

• MLS is not a good estimator for prediction error

• The matrix XTX could be ill conditioned�Inputs are correlated�Input dimension is large�Training set is small

Linear Regression - Example

ε++= 21 2.08.0 xxy

In this example the output is generated by the model (where ε is a small white Gaussian noise):

Three training sets with different correlations Three training sets with different correlations between the two inputs were randomly chosen, and the linear regression solution was applied.

Linear Regression – Example

And the results are…

-0.50

0.51

1.52

x 2

RSS=0.0901465b=80.807276 , 0.209322 <

0

1

2

3

x 2

RSS=0.104955b=80.434832 , 0.568331 <

0

1

2

3

x 2

RSS=0.0725469b=830.2363, -29.2196 <

RSS=0.47w=30.36, -29.21

RSS=0.204w=0.42, 0.56

RSS=0.29w=0.82, 0.209

Strong correlation can cause the coefficients to be very large, and they will cancel each other to achieve a good RSS.

-2 -1 0 1 2 3x1

-1.5-1

-0.5

-2 -1 0 1 2 3x1

-2

-1

-2 -1 0 1 2 3x1

-2

-1

x1,x2 uncorrelated x1,x2 correlated x1,x2 strongly correlated

Linear Regression

What can be done?

Shrinkage methods

� Ridge Regression� Lasso� PCR (Principal Components Regression)� PLS (Partial Least Squares)� PLS (Partial Least Squares)

Shrinkage methods

Before we proceed:� Since the following methods are not invariant under input

scale, we will assume the input is normalized (mean 0, variance 1):

jijij xxx −←′j

ijij

xx

σ

′←

� The offset is always estimated as and we will work with centered y (meaning y � y-w0)

Nywi

i /)(0 ∑=0w

jσ

Ridge Regression� Ridge regression shrinks the regression coefficients by

imposing a penalty on their size ( also called weight decay)� In ridge regression, we add a quadratic penalty on the

weights:∑ ∑∑= ==

+

−−=

N

i

M

jj

M

jjiji wwxwywJ

1 1

2

2

10)( λ

where λ ≥ 0 is a tuning parameter that controls the amount of shrinkage.

� The size constraint prevents the phenomenon of wildly large coefficients that cancel each other from occurring.

Ridge Regression Solution� Ridge regression in matrix form:

� The solution is

( ) ( ) wwXwyXwywJ tt λ+−−=)(

yXIXXw tM

tridge 1)(ˆ −+= λ ( ) ywLS t1t XXXˆ−

=

� The solution adds a positive constant to the diagonal of XTX before inversion. This makes the problem non singular even if X does not have full column rank.

� For orthogonal inputs the ridge estimates are the scaled version of least squares estimates:

10 ˆ ˆ ≤≤= γγ LSridge ww

Ridge Regression (insights)

TUDVX =

The matrix X can be represented by it’s SVD:

• U is a N*M matrix, it’s columns span the column space of X

• D is a M*M diagonal matrix of singular values• D is a M*M diagonal matrix of singular values

• V is a M*M matrix, it’s columns span the row space of X

Lets see how this method looks in the Principal Components coordinates of X


( )( ) ''

'

wXwVXVXw

XVXT ==

=

( ) yUDyVDUUDVVDUVwVw TTTTTlsTls 11ˆ'ˆ −−

===

The least squares solution is given by: ( ) ywls t1t XXXˆ−

=

( ) yUDyVDUUDVVDUVwVw ˆ'ˆ ===

( )( ) ( ) lsT

TTTTridgeTridge

wDIDyDUID

yVDUIUDVVDUVwVw

'ˆ

ˆ'ˆ

21212

1

−−

−

+=+=

=+==

λλ

λ

The Ridge Regression is similarly given by:

Diagonal


lsj

j

jridgej w

d

dw 'ˆ'ˆ

2

2

λ+=

In the PCA axes, the ridge coefficients are just scaled LS coefficients!

The coefficients that correspond to smaller input variance directions are scaled down more.

d1d2

scaled down more.

Ridge Regression

( ) ∑= +

=M

j j

j

d

ddf

12

2

λλ

In the following simulations, quantities are plotted versus the quantity

This monotonic decreasing function is the effective degrees of freedom of the ridge regression fit.

Ridge Regression

123456

b

Ridge Regression

No correlation Strong correlation

123456

b

Ridge Regression

Simulation results: 4 dimensions (w=1,2,3,4)

w w

0 0.5 1 1.5 2 2.5 3 3.5Effective Dimension

01

0.5 1 1.5 2 2.5 3 3.5Effective Dimension

1

2

3

4

5

6

SS

R

0 1 2 3 4Effective Dimension

01

0.5 1 1.5 2 2.5 3 3.5 4Effective Dimension

0123456

SS

R

Ridge regression is MAP with Gaussian prior

constwwXwyXwy

wΝxwyΝ

wPwDPwJ

tt

n

ii

ti

++−−=

−=

−=

∏=

22

1

22

2

1)()(

2

1

),0|(),|(log

)()|(log)(

τσ

τσ

This is the same objective function that ridge solves, using

Ridge:

22 τσ

22 /τσλ =

( ) ( ) wwXwyXwywJ tt λ+−−=)(

Lasso

tw

wxyw

M

N

i

M

jjiji

lasso

≤

−=

∑

∑ ∑= =1

2

1

subject to

minargˆβ

Lasso is a shrinkage method like ridge, with a subtle but important difference. It is defined by

twj

j ≤∑=1

subject to

There is a similarity to the ridge regression problem: the L2 ridge regression penalty is replaced by the L1 lasso penalty.

The L1 penalty makes the solution non-linear in y and requires a quadratic programming algorithm to compute it.

Lasso

∑=

≡≥M

j

lswtt1

0 ˆIf t is chosen to be larger then t0:

then the lasso estimation is identical to the least squares. On the other hand, for say t=t0/2, the least squares coefficients are shrunk by about 50% on average.

For convenience we will plot the simulation results versus shrinkage factor s:

∑=

=≡ M

j

lsw

t

t

ts

1

0 ˆ

Lasso

Simulation results:

No correlation

123456

b

Lasso

Strong correlation

123456

b

Lasso

w w

0 0.2 0.4 0.6 0.8 1Shrinkage factor

01

0.2 0.4 0.6 0.8 1Shrinkage factor

0

1234

56

SS

R

0 0.2 0.4 0.6 0.8 1Shrinkage factor

01

0.2 0.4 0.6 0.8 1Shrinkage factor

0

1

2

3

4

SS

R

L2 vs L1 penaltiesContours of the LS error function

L1 L2

t≤+ ββ 222 ≤+ ββ=βt≤+ 21 ββ 222

21 t≤+ ββ01 =β

In Lasso the constraint region has corners; when the solution hits a corner the corresponding coefficients becomes 0 (when M>2 more than one).

Problem:Figure 1 plots linear regression results on the basis of only three data points. We used various types of regularization to obtain the plots (see below) but got confused aboutwhich plot corresponds to which regularization method. Please assign each plot to one (and only one) of the following regularization method.

linear regression linear regression with shrinkagerita/ml_course/lectures/regression...typically we...

Documents