modern regression - ridge regression

Upload: stanley-lan

Post on 04-Jun-2018

273 views

Category:

Documents


1 download

TRANSCRIPT

  • 8/13/2019 Modern Regression - Ridge Regression

    1/21

    Modern regression 1: Ridge regression

    Ryan Tibshirani

    Data Mining: 36-462/36-662

    March 19 2013

    Optional reading: ISL 6.2.1, ESL 3.4.1

    1

  • 8/13/2019 Modern Regression - Ridge Regression

    2/21

    Reminder: shortcomings of linear regression

    Last time we talked about:

    1. Predictive ability: recall that we can decompose predictionerror into squared bias and variance. Linear regression has lowbias (zero bias) but suffers from high variance. So it may beworth sacrificing some bias to achieve a lower variance

    2. Interpretative ability: with a large number of predictors, it can

    be helpful to identify a smaller subset of important variables.Linear regression doesnt do this

    Also: linear regression is not defined when p > n (Homework 4)

    Setup: given fixed covariates xi Rp

    , i= 1, . . . n, we observeyi=f(xi) +i, i= 1, . . . n ,

    where f : Rp R is unknown (think f(xi) =xTi

    for a linearmodel) and i R with E[i] = 0, Var(i) =

    2, Cov(i, j) = 0

    2

  • 8/13/2019 Modern Regression - Ridge Regression

    3/21

    Example: subset of small coefficients

    Recall our example: we have n= 50, p= 30, and 2 = 1. The

    true model is linear with 10 large coefficients (between 0.5 and 1)and 20 small ones (between 0 and 0.3). Histogram:

    True coefficients

    Frequency

    0.0 0.2 0.4 0.6 0.8 1.0

    0

    1

    2

    3

    4

    5

    6

    7

    Thelinear regressionfit:

    Squared bias 0.006Variance 0.627Pred. error 1 + 0.006+0.627 1.633

    We reasoned that we can do better by shrinkingthe coefficients, toreduce variance

    3

  • 8/13/2019 Modern Regression - Ridge Regression

    4/21

    0 5 10 15 20 25

    1.5

    0

    1.5

    5

    1.6

    0

    1.6

    5

    1.70

    1.7

    5

    1.8

    0

    Amount of shrinkage

    Prediction

    error

    Low High

    Linear regression

    Ridge regression

    Linear regression:Squared bias 0.006Variance 0.627Pred. error 1 + 0.006 + 0.627Pred. error 1.633

    Ridge regression, at its best:Squared bias 0.077Variance 0.403Pred. error 1 + 0.077 + 0.403Pred. error 1.48

    4

  • 8/13/2019 Modern Regression - Ridge Regression

    5/21

    Ridge regression

    Ridge regressionis like least squares but shrinks the estimatedcoefficients towards zero. Given a response vector y Rn and a

    predictor matrix X Rnp, the ridge regression coefficients aredefined as

    ridge = argminRp

    ni=1

    (yi xTi )

    2 +

    pj=1

    2j

    = argminRp

    y X22 Loss

    + 22Penalty

    Here 0 is atuning parameter, which controls the strength ofthe penalty term. Note that:

    When = 0, we get the linear regression estimate

    When = , we get ridge = 0

    For in between, we are balancing two ideas: fitting a linearmodel ofy on X, and shrinking the coefficients

    5

  • 8/13/2019 Modern Regression - Ridge Regression

    6/21

    Example: visual representation of ridge coefficients

    Recall our last example (n= 50, p= 30, and 2 = 1; 10 large truecoefficients, 20 small). Here is a visual representation of the ridge

    regression coefficients for = 25:

    0.5 0.0 0.5 1.0 1.5 2.0 2.5

    0.0

    0.5

    1.0

    Coefficients

    True Linear Ridge

    6

  • 8/13/2019 Modern Regression - Ridge Regression

    7/21

    Important details

    When including aninterceptterm in the regression, we usuallyleave this coefficientunpenalized. Otherwise we could add some

    constant amount c to the vector y, and this would not result in thesame solution. Hence ridge regression with intercept solves

    0,ridge = argmin

    0R, Rpy 01 X

    22+

    22

    If we center the columns ofX, then the intercept estimate ends upjust being 0= y, so we usually just assume that y, Xhave beencentered and dont include an intercept

    Also, the penalty term 2

    2=p

    j=12

    j is unfair is the predictorvariables arenot on the same scale. (Why?) Therefore, if we knowthat the variables are not measured in the same units, we typicallyscalethe columns ofX(to have sample variance 1), and then weperform ridge regression

    7

  • 8/13/2019 Modern Regression - Ridge Regression

    8/21

    Bias and variance of ridge regression

    Thebiasandvarianceare not quite as simple to write down forridge regression as they were for linear regression, but closed-formexpressions are still possible (Homework 4). Recall that

    ridge = argminRp

    y X22+ 22

    The general trend is:

    The bias increases as (amount of shrinkage) increases

    The variance decreases as (amount of shrinkage) increases

    What is the bias at = 0? The variance at = ?

    8

  • 8/13/2019 Modern Regression - Ridge Regression

    9/21

    Example: bias and variance of ridge regression

    Bias and variance for our last example (n= 50, p= 30, 2 = 1; 10large true coefficients, 20 small):

    0 5 10 15 20 25

    0.

    0

    0.

    1

    0.

    2

    0.

    3

    0.4

    0.

    5

    0.

    6

    Bias^2

    Var

    9

  • 8/13/2019 Modern Regression - Ridge Regression

    10/21

    Mean squared error for our last example:

    0 5 10 15 20 25

    0.

    0

    0.

    2

    0.

    4

    0.

    6

    0.

    8

    Linear MSERidge MSERidge Bias^2Ridge Var

    Ridge regression in R: see the function lm.ridge in the packageMASS, or the glmnet function and package

    10

  • 8/13/2019 Modern Regression - Ridge Regression

    11/21

    What you may (should) be thinking now

    Thought 1:

    Yeah, OK, but this only works for some values of. So howwould wechoose in practice?

    This is actually quite a hard question. Well talk about this indetail later

    Thought 2:

    What happens when wenoneof the coefficientsare small?

    In other words, if all the true coefficients are moderately large, is it

    still helpful to shrink the coefficient estimates? The answer is(perhaps surprisingly) still yes. But the advantage of ridgeregression here is less dramatic, and the corresponding range forgood values of is smaller

    11

  • 8/13/2019 Modern Regression - Ridge Regression

    12/21

    Example: moderate regression coefficients

    Same setup as our last example: n= 50, p= 30, and 2 = 1.

    Except now the true coefficients are all moderately large (between0.5 and 1). Histogram:

    True coefficients

    Frequency

    0.0 0.2 0.4 0.6 0.8 1.0

    0

    1

    2

    3

    4

    5

    Thelinear regressionfit:

    Squared bias 0.006Variance 0.628Pred. error 1 + 0.006+0.628 1.634

    Why are these numbersessentially the sameas those from the lastexample, even though the true coefficients changed?

    12

  • 8/13/2019 Modern Regression - Ridge Regression

    13/21

    Ridge regression can still outperform linear regression in terms ofmean squared error:

    0 5 10 15 20 25 30

    0.

    0

    0.

    5

    1.

    0

    1.

    5

    2.

    0

    Linear MSERidge MSERidge Bias^2Ridge Var

    Only works for

    less than 5, otherwise it is very biased. (Why?) 13

  • 8/13/2019 Modern Regression - Ridge Regression

    14/21

    Variable selection

    To the other extreme (of a subset of small coefficients), supposethat there is a group of true coefficients that are identically zero.

    This means that the mean response doesnt depend on thesepredictors at all; they are completely extraneous.

    The problem of picking out the relevant variables from a larger setis calledvariable selection. In the linear model setting, this means

    estimating some coefficients to be exactly zero. Aside frompredictive accuracy, this can be very important for the purposes ofmodel interpretation

    Thought 3:

    How does ridge regression perform if a group of the truecoefficients wasexactly zero?

    The answer depends whether on we are interested in prediction orinterpretation. Well consider the former first

    14

  • 8/13/2019 Modern Regression - Ridge Regression

    15/21

    Example: subset of zero coefficients

    Same general setup as our running example: n= 50, p= 30, and

    2 = 1. Now, the true coefficients: 10 are large (between 0.5 and1) and 20 areexactly 0. Histogram:

    True coefficients

    Frequency

    0.0 0.2 0.4 0.6 0.8 1.0

    0

    5

    10

    15

    20

    Thelinear regressionfit:

    Squared bias 0.006Variance 0.627Pred. error 1 + 0.006+0.627 1.633

    Note again that these numbers havent changed

    15

  • 8/13/2019 Modern Regression - Ridge Regression

    16/21

    Ridge regression performs well in terms of mean-squared error:

    0 5 10 15 20 25

    0

    .0

    0.

    2

    0.

    4

    0.

    6

    0.8

    Linear MSERidge MSE

    Ridge Bias^2Ridge Var

    Why is the bias not as large here for large ?

    16

  • 8/13/2019 Modern Regression - Ridge Regression

    17/21

    Remember that as we vary we get different ridge regressioncoefficients, the larger the the more shrunken. Here we plotthem again

    0 5 10 15 20 25

    0.5

    0.0

    0.5

    1.0

    Coefficients

    True nonzero

    True zero

    The red paths correspond to thetrue nonzero coefficients; the gray

    paths correspond to true zeros.The vertical dashed line at = 15marks the point above which ridgeregressions MSE starts losing tothat of linear regression

    An important thing to notice is that the gray coefficient paths arenotexactly zero; they are shrunken, but still nonzero

    17

    f

  • 8/13/2019 Modern Regression - Ridge Regression

    18/21

    Ridge regression doesnt perform variable selection

    We can show that ridge regression doesnt set coefficients exactlyto zero unless = , in which case theyre all zero. Hence ridge

    regressioncannot perform variable selection, and even though itperforms well in terms of prediction accuracy, it does poorly interms of offering a clear interpretation

    E.g., suppose that we are studying the level ofprostate-specific antigen (PSA), which is oftenelevated in men who have prostate cancer. Welook at n = 97 men with prostate cancer, and

    p = 8 clinical measurements.1 We are inter-

    ested in identifying a small number of predic-tors, say 2 or 3, that drive PSA

    2

    1Data from Stamey et al. (1989), Prostate specific antigen in the diag...2

    Figure from http://www.mens-hormonal-health.com/psa-score.html18

    E l id i ffi i f d

    http://www.mens-hormonal-health.com/psa-score.htmlhttp://www.mens-hormonal-health.com/psa-score.html
  • 8/13/2019 Modern Regression - Ridge Regression

    19/21

    Example: ridge regression coefficients for prostate data

    We perform ridge regression over a wide range of values (aftercentering and scaling). The resultingcoefficient profiles:

    0 200 400 600 800 1000

    0.0

    0.2

    0.4

    0.6

    Coefficients

    lcavol

    lweight

    age

    lbph

    svi

    lcp

    leason

    pgg45

    0 2 4 6 8

    0.0

    0.2

    0.4

    0.6

    df()

    Coefficients

    lcavol

    lweight

    age

    lbph

    svi

    lcp

    gleason

    pgg45

    This doesnt give us a clear answer to our question ...

    19

    R id i

  • 8/13/2019 Modern Regression - Ridge Regression

    20/21

    Recap: ridge regression

    We learnedridge regression, which minimizes the usual regressioncriterion plus a penalty term on the squared 2 norm of the

    coefficient vector. As such, it shrinks the coefficients towards zero.This introduces some bias, but can greatly reduce the variance,resulting in a better mean-squared error

    The amount of shrinkage is controlled by , thetuning parameter

    that multiplies the ridge penalty. Large means more shrinkage,and so we get different coefficient estimates for different values of. Choosingan appropriate value of is important, and alsodifficult. Well return to this later

    Ridge regression performs particularly well when there is a subsetof true coefficients that aresmallor evenzero. It doesnt do aswell when all of the true coefficients are moderately large; however,in this case it can still outperform linear regression over a prettynarrow range of (small) values

    20

    N t ti th l

  • 8/13/2019 Modern Regression - Ridge Regression

    21/21

    Next time: the lasso

    The lasso combines some of the shrinking advantages of ridge withvariable selection

    (From ESL page 71)

    21