modern regression - ridge regression

8/13/2019 Modern Regression - Ridge Regression

1/21

Modern regression 1: Ridge regression

Ryan Tibshirani

Data Mining: 36-462/36-662

March 19 2013

Optional reading: ISL 6.2.1, ESL 3.4.1

1


2/21

Reminder: shortcomings of linear regression

Last time we talked about:

1. Predictive ability: recall that we can decompose predictionerror into squared bias and variance. Linear regression has lowbias (zero bias) but suffers from high variance. So it may beworth sacrificing some bias to achieve a lower variance

2. Interpretative ability: with a large number of predictors, it can

be helpful to identify a smaller subset of important variables.Linear regression doesnt do this

Also: linear regression is not defined when p > n (Homework 4)

Setup: given fixed covariates xi Rp

, i= 1, . . . n, we observeyi=f(xi) +i, i= 1, . . . n ,

where f : Rp R is unknown (think f(xi) =xTi

for a linearmodel) and i R with E[i] = 0, Var(i) =

2, Cov(i, j) = 0

2


3/21

Example: subset of small coefficients

Recall our example: we have n= 50, p= 30, and 2 = 1. The

true model is linear with 10 large coefficients (between 0.5 and 1)and 20 small ones (between 0 and 0.3). Histogram:

True coefficients

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

0

1

2

3

4

5

6

7

Thelinear regressionfit:

Squared bias 0.006Variance 0.627Pred. error 1 + 0.006+0.627 1.633

We reasoned that we can do better by shrinkingthe coefficients, toreduce variance

3


4/21

0 5 10 15 20 25

1.5

0

1.5

5

1.6

0

1.6

5

1.70

1.7

5

1.8

0

Amount of shrinkage

Prediction

error

Low High

Linear regression

Ridge regression

Linear regression:Squared bias 0.006Variance 0.627Pred. error 1 + 0.006 + 0.627Pred. error 1.633

Ridge regression, at its best:Squared bias 0.077Variance 0.403Pred. error 1 + 0.077 + 0.403Pred. error 1.48

4


5/21

Ridge regression

Ridge regressionis like least squares but shrinks the estimatedcoefficients towards zero. Given a response vector y Rn and a

predictor matrix X Rnp, the ridge regression coefficients aredefined as

ridge = argminRp

ni=1

(yi xTi )

2 +

pj=1

2j

= argminRp

y X22 Loss

+ 22Penalty

Here 0 is atuning parameter, which controls the strength ofthe penalty term. Note that:

When = 0, we get the linear regression estimate

When = , we get ridge = 0

For in between, we are balancing two ideas: fitting a linearmodel ofy on X, and shrinking the coefficients

5


6/21

Example: visual representation of ridge coefficients

Recall our last example (n= 50, p= 30, and 2 = 1; 10 large truecoefficients, 20 small). Here is a visual representation of the ridge

regression coefficients for = 25:

0.5 0.0 0.5 1.0 1.5 2.0 2.5

0.0

0.5

1.0

Coefficients

True Linear Ridge

6


7/21

Important details

When including aninterceptterm in the regression, we usuallyleave this coefficientunpenalized. Otherwise we could add some

constant amount c to the vector y, and this would not result in thesame solution. Hence ridge regression with intercept solves

0,ridge = argmin

0R, Rpy 01 X

22+

22

If we center the columns ofX, then the intercept estimate ends upjust being 0= y, so we usually just assume that y, Xhave beencentered and dont include an intercept

Also, the penalty term 2

2=p

j=12

j is unfair is the predictorvariables arenot on the same scale. (Why?) Therefore, if we knowthat the variables are not measured in the same units, we typicallyscalethe columns ofX(to have sample variance 1), and then weperform ridge regression

7


8/21

Bias and variance of ridge regression

Thebiasandvarianceare not quite as simple to write down forridge regression as they were for linear regression, but closed-formexpressions are still possible (Homework 4). Recall that

ridge = argminRp

y X22+ 22

The general trend is:

The bias increases as (amount of shrinkage) increases

The variance decreases as (amount of shrinkage) increases

What is the bias at = 0? The variance at = ?

8


9/21

Example: bias and variance of ridge regression

Bias and variance for our last example (n= 50, p= 30, 2 = 1; 10large true coefficients, 20 small):

0 5 10 15 20 25

0.

0

0.

1

0.

2

0.

3

0.4

0.

5

0.

6

Bias^2

Var

9


10/21

Mean squared error for our last example:

0 5 10 15 20 25

0.

0

0.

2

0.

4

0.

6

0.

8

Linear MSERidge MSERidge Bias^2Ridge Var

Ridge regression in R: see the function lm.ridge in the packageMASS, or the glmnet function and package

10


11/21

What you may (should) be thinking now

Thought 1:

Yeah, OK, but this only works for some values of. So howwould wechoose in practice?

This is actually quite a hard question. Well talk about this indetail later

Thought 2:

What happens when wenoneof the coefficientsare small?

In other words, if all the true coefficients are moderately large, is it

still helpful to shrink the coefficient estimates? The answer is(perhaps surprisingly) still yes. But the advantage of ridgeregression here is less dramatic, and the corresponding range forgood values of is smaller

11


12/21

Example: moderate regression coefficients

Same setup as our last example: n= 50, p= 30, and 2 = 1.

Except now the true coefficients are all moderately large (between0.5 and 1). Histogram:

True coefficients

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

0

1

2

3

4

5



Why are these numbersessentially the sameas those from the lastexample, even though the true coefficients changed?

12


13/21

Ridge regression can still outperform linear regression in terms ofmean squared error:

0 5 10 15 20 25 30

0.

0

0.

5

1.

0

1.

5

2.

0

Linear MSERidge MSERidge Bias^2Ridge Var

Only works for

less than 5, otherwise it is very biased. (Why?) 13


14/21

Variable selection

To the other extreme (of a subset of small coefficients), supposethat there is a group of true coefficients that are identically zero.

This means that the mean response doesnt depend on thesepredictors at all; they are completely extraneous.

The problem of picking out the relevant variables from a larger setis calledvariable selection. In the linear model setting, this means

estimating some coefficients to be exactly zero. Aside frompredictive accuracy, this can be very important for the purposes ofmodel interpretation

Thought 3:

How does ridge regression perform if a group of the truecoefficients wasexactly zero?

The answer depends whether on we are interested in prediction orinterpretation. Well consider the former first

14


15/21

Example: subset of zero coefficients

Same general setup as our running example: n= 50, p= 30, and

2 = 1. Now, the true coefficients: 10 are large (between 0.5 and1) and 20 areexactly 0. Histogram:

True coefficients

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

0

5

10

15

20



Note again that these numbers havent changed

15


16/21

Ridge regression performs well in terms of mean-squared error:

0 5 10 15 20 25

0

.0

0.

2

0.

4

0.

6

0.8

Linear MSERidge MSE

Ridge Bias^2Ridge Var

Why is the bias not as large here for large ?

16


17/21

Remember that as we vary we get different ridge regressioncoefficients, the larger the the more shrunken. Here we plotthem again

0 5 10 15 20 25

0.5

0.0

0.5

1.0

Coefficients

True nonzero

True zero

The red paths correspond to thetrue nonzero coefficients; the gray

paths correspond to true zeros.The vertical dashed line at = 15marks the point above which ridgeregressions MSE starts losing tothat of linear regression

An important thing to notice is that the gray coefficient paths arenotexactly zero; they are shrunken, but still nonzero

17

f


18/21

Ridge regression doesnt perform variable selection

We can show that ridge regression doesnt set coefficients exactlyto zero unless = , in which case theyre all zero. Hence ridge

regressioncannot perform variable selection, and even though itperforms well in terms of prediction accuracy, it does poorly interms of offering a clear interpretation

E.g., suppose that we are studying the level ofprostate-specific antigen (PSA), which is oftenelevated in men who have prostate cancer. Welook at n = 97 men with prostate cancer, and

p = 8 clinical measurements.1 We are inter-

ested in identifying a small number of predic-tors, say 2 or 3, that drive PSA

2

1Data from Stamey et al. (1989), Prostate specific antigen in the diag...2

Figure from http://www.mens-hormonal-health.com/psa-score.html18

E l id i ffi i f d
http://www.mens-hormonal-health.com/psa-score.htmlhttp://www.mens-hormonal-health.com/psa-score.html


19/21

Example: ridge regression coefficients for prostate data

We perform ridge regression over a wide range of values (aftercentering and scaling). The resultingcoefficient profiles:

0 200 400 600 800 1000

0.0

0.2

0.4

0.6

Coefficients

lcavol

lweight

age

lbph

svi

lcp

leason

pgg45

0 2 4 6 8

0.0

0.2

0.4

0.6

df()

Coefficients

lcavol

lweight

age

lbph

svi

lcp

gleason

pgg45

This doesnt give us a clear answer to our question ...

19

R id i


20/21

Recap: ridge regression

We learnedridge regression, which minimizes the usual regressioncriterion plus a penalty term on the squared 2 norm of the

coefficient vector. As such, it shrinks the coefficients towards zero.This introduces some bias, but can greatly reduce the variance,resulting in a better mean-squared error

The amount of shrinkage is controlled by , thetuning parameter

that multiplies the ridge penalty. Large means more shrinkage,and so we get different coefficient estimates for different values of. Choosingan appropriate value of is important, and alsodifficult. Well return to this later

Ridge regression performs particularly well when there is a subsetof true coefficients that aresmallor evenzero. It doesnt do aswell when all of the true coefficients are moderately large; however,in this case it can still outperform linear regression over a prettynarrow range of (small) values

20

N t ti th l


21/21

Next time: the lasso

The lasso combines some of the shrinking advantages of ridge withvariable selection

(From ESL page 71)

21

modern regression - ridge regression

Documents