univariate shrinkage in the cox model for high dimensional...

Univariate shrinkage in the Cox model for high

dimensional data

Robert Tibshirani ∗

January 6, 2009

Abstract

We propose a method for prediction in Cox’s proportional model, when the num-ber of features (regressors) p exceeds the number of observations n. The methodassumes that features are independent in each risk set, so that the partial likeli-hood factors into a product. As such, it is analogous to univariate thresholdingin linear regression and nearest shrunken centroids in classification. We call theprocedure Cox univariate shrinkage and demonstrate its usefulness on real andsimulated data. The method has the attractive property of being essentially uni-variate in its operation: the features are entered into the model based on the sizeof their Cox score statistics. We illustrate the new method on real and simulateddata, and compare it to other proposed methods for survival prediction with alarge number of predictors.

Keywords: survival data, Proportional hazards model, Cox model, shrinkage,subset selection

1 Introduction

High-dimensional regression problems are challenging, and a current focus of sta-tistical research. One simplifying assumption that can be useful is that of indepen-dence of features. While this assumption will rarely be even approximately true,in practice, it leads to simple estimates with predictive performance sometimes asgood or better than more ambitious multivariate methods.

∗Depts. of Health, Research & Policy, and Statistics, Stanford Univ, Stanford CA, 94305;

[email protected]

1

In linear regression with squared error loss and a lasso penalty, independence ofthe features leads to the univariate soft-thresholding estimates, that often behavewell.

In classification problems, the idea of class-wise independence forms the basisfor diagonal discriminant analysis. In that setting, if we assume that the featuresare independent within each class, and incorporate a lasso penalty, the resultingestimator is the nearest shrunken centroid procedure proposed by Tibshirani et al.(2001): we review this fact in the next section. The nearest shrunken centroidclassifier is very simple but useful in practice, and has many attractive properties.It has a single tuning parameter that automatically selects the features to retainin the model, and it provides a single ranking of all features, that is independentof the value of the tuning parameter.

We review the details of the regression and classification problems in the nextsection. Then we extend the idea of class-wise feature independence to the Coxmodel for survival data: this is the main contribution of the paper. We study thisnew procedure on both real and simulated datasets, and compare it to competingmethods for this problem.

2 Regression and classification using feature in-

dependence

2.1 Regression

Consider the usual regression situation: we have data (xi, yi), i = 1, 2, . . . n wherexi = (xi1, . . . xip)

T and yi are the regressors and response for the ith observation.Assume the xij are standardized so that

∑

i xij/n = 0,∑

i x2ij/n = 1. The lasso

finds β = (β1, . . . βp)T to minimize

N∑

i=1

(yi −∑

j

βjxij)2 + λ

p∑

j=1

|βj|. (1)

Now if we assume that the features are independent (uncorrelated), it is easyto show that

β̂j = sign(β̂o

j )(|β̂o

j | − λ)+ (2)

where β̂oj are the ordinary (univariate) least squares estimates. These are univari-

ate soft-thresholded estimates, and perform particularly well when p � n. Zou &Hastie (2005) study these estimates as a special case of their elastic net procedure.See Fan & Fan (2008) for some theoretical support of univariate soft-thresholding.

2

2.2 Classification

Here we review the nearest shrunken centroid method for high-dimensional clas-sification and its connection to the lasso. Consider the diagonal-covariance lineardiscriminant rule for classification. The discriminant score for class k is

δk(x∗) = −

p∑

j=1

(x∗j − x̄kj)

2

s2j

+ 2 log πk. (3)

Here x∗ = (x∗1, x

∗2, . . . , x

∗p)

T is a vector of expression values for a test obser-vation, sj is the pooled within-class standard deviation of the jth gene, andx̄kj =

∑

i∈Ckxij/nk is the mean of the nk values for gene j in class k, with

Ck being the index set for class k. We call x̃k = (x̄k1, x̄k2, . . . x̄kp)T the centroid

of class k. The first part of (3) is simply the (negative) standardized squareddistance of x∗ to the kth centroid. The second part is a correction based on theclass prior probability πk, where

∑K

k=1πk = 1. The classification rule is then

C(x∗) = ` if δ`(x∗) = maxk δk(x

∗). (4)

We see that the diagonal LDA classifier is equivalent to a nearest centroid classifierafter appropriate standardization.

The nearest shrunken centroid classifier regularizes this further, and is definedas follows. Let

dkj =x̄kj − x̄j

mk(sj + s0), (5)

where x̄j is the overall mean for gene j, m2k = 1/nk−1/n and s0 is a small positive

constant, typically chosen to be the median of the sj values. This constant guardsagainst large dkj values that arise from expression values near zero.

Finally, we can derive the nearest shrunken centroid classifier as the solu-tion to a lasso-regularized problem that assumes independence of the features.Consider a (naive Bayes) Gaussian model for classification in which the featuresj = 1, 2, . . . , p are assumed to be independent within each class k = 1, 2, . . . , K.With observations i = 1, 2, . . . , n and Ck equal to the set of indices of the nk obser-vations in class k, we observe xij ∼ N(µj + µjk, σ

2j ) for i ∈ Ck with

∑K

k=1µjk = 0.

We set σ̂2j = s2

j , the pooled within-class variance for feature j, and consider thelasso-style minimization problem

min{µj ,µjk}

{

1

2

p∑

j=1

K∑

k=1

∑

i∈Ck

(xij − µj − µjk)2

s2j

+ λ√

nk

p∑

j=1

K∑

k=1

|µjk|sj

}

. (6)

By direct calculation, it is easy to show that the solution is equivalent to thenearest shrunken centroid estimator (5), with s0 set to zero, and Mk equal to1/nk instead of 1/nk − 1/n as before. Some interesting theoretical results fordiagonal linear discriminant analysis are given in Bickel & Levina (2004). Efron(2008) proposes an Empirical Bayes variation on nearest shrunken centroids.

3

3 Survival data and Cox’s proportional hazards

model

For the rest of this paper we consider the right-censored survival data setting.The data available are of the form (y1,x1, δ1, ), . . . , (yn,xn, δn), the survival timeyi being complete if δi = 1 and right censored if δi = 0, with xi denoting the usualvector of predictors (x1, x2, . . . xp) for the ith individual. Denote the distinctfailure times by t1 < · · · < tK , there being dk failures at time tk.

The proportional-hazards model for survival data, also known as the Coxmodel, assumes that

λ(t|x) = λ0(t) exp(

∑

j

xjβj

)

(7)

where λ(t|x) is the hazard at time t given predictor values x = (x1, . . . , xp), andλ0(t) is an arbitrary baseline hazard function.

One usually estimates the parameter β = (β1, β2, . . . βp)T in the proportional-

hazards model (7) without specification of λ0(t) through maximization of thepartial likelihood

PL(β) =∏

k∈D

exp(xkβ)

{∑

m∈Rkexp(xmβ)} .

Here D is the set of indices of the failure times, Rk is the set of indices of theindividuals at risk at time tk − 0, and we have assumed there are no ties in thesurvival times. In the case of ties, we use the “Breslow” approximation to thepartial likelihood. We also assume that the censoring is noninformative, so thatthe construction of the partial likelihood is justified.

If p � n then maximizer of the partial likelihood is not unique, and someregularization must be used. One possibility is to include a “lasso” penalty andjointly optimize over the parameters, as suggested by Tibshirani (1997).

Here we extend the idea of feature independence to the Cox model for sur-vival data. Now the partial likelihood term exp(xkβ)/

∑

m∈Rkexp(xmβ) equals

Pr(k|xi, i ∈ Rk) the probability that individual k is the one who fails at time tk,given the risk set and their feature vectors. Now we assume that both condition-ally on each risk set, and marginally, the features are independent of one another.Then using Bayes theorem we obtain Pr(k|xi, i ∈ Rk) ∼

∏

j Pr(k|xij, i ∈ Rk). Asa result, up to a proportionality constant, the partial likelihood becomes a simpleproduct

PL(β) = c ·p

∏

j=1

∏

k∈D

exp(xkjβj)

{∑m∈Rkexp(xmjβj)}

4

The log partial likelihood is

`(β) =

p∑

j=1

K∑

k=1

(

xkjβj − log∑

m∈Rk

exp(xmjβj))

(8)

≡p

∑

j=1

gj(βj). (9)

We propose the Cox univariate shrinkage (CUS) estimator as the maximizerof the penalized partial log-likelihood

J(β) =

p∑

j=1

gj(βj) − λ∑

|βj|. (10)

Here λ ≥ 0 is a tuning parameter, and we solve this problem for a range of λvalues. Note that is just a set of one-dimensional maximizations, since we canmaximize each function gj(βj) − λ|βj| separately. The minimizers of (10) are not

simply soft-thresholded versions of the unpenalized estimates β̂j, as they are inthe least squares regression case. We discuss and illustrate this fact in the nextsection. From this, we obtain solutions paths (β̂1(λ), β̂2(λ), . . . β̂p(λ)), and we canuse these to predict the survival in a separate test sample.

Note that the estimates β̂ will tend to be biased towards zero, especially forlarger values of λ. In the test set and cross-validation computations in this paper,we fit the predictor z = xβ̂ in the new (test or validation) data, and so obtaina scale factor to debias the estimate. If we are interested in the values of β̂themselves, a better approach would be fit the single predictor γxiβ̂ in a Coxmodel on the training data, and hence obtain the adjusted estimates γ̂β̂.

The lasso penalty above might not make sense if the predictors are in differentunits. Therefore we first standardize each predictor xj by sj, the square root ofthe observed (negative) Fisher information at βj = 0.

3.1 Example

Zhao et al. (2005) collected gene expression data on 14, 814 genes from 177 kidneypatients. Survival times (possibly censored) were also measured for each patient.The data were split into 88 samples to form the training set and the remaining89 formed the test set. Figure 1 shows the solution paths for Cox univariateshrinkage for 500 randomly chosen genes.

Notice that the solutions paths have a particularly nice form: they are mono-tonically decreasing in absolute value and once they hit zero, they stay at zero.It is easy to prove that this holds in general. The proof is given in the Appendix.

Note also that the profiles decrease at different rates. This is somewhat sur-prising. In the least squares regression case under orthogonality of the predictors,

5

0 10 20 30 40 50 60

−0.

4−

0.2

0.0

0.2

0.4

Coe

ffici

ents

PSfrag replacements λ

Figure 1: Kidney cancer example: solution paths for univariate shrinkage, for 500randomly chosen genes.

the lasso estimates are simply the soft-thresholded versions of the least squaresestimates β̂j:, that is, sign(β̂j)(|β̂j|− t)+. Thus every profile decreases at the same

rate. Here, each profile decreases at rate 1/|g ′(β̂j)|, which is the reciprocal of theobserved Fisher information.

Let Uj, Vj be the gradient of the (unpenalized) log-partial likelihood and the(negative) observed Fisher information. Then we can also show that

β̂λ 6= 0 ⇐⇒ |Uj|/√

Vj > λ. (11)

The proof is given in the Appendix. Therefore we have the following useful fact:

For any value of λ, the set of predictors that appear in the model estimated byCox univariate shrinkage are just those whose score statistic exceeds λ in absolutevalue.

That is, the Cox univariate shrinkage method used a fixed ranking of all fea-tures based on the Cox score statistic. This is very convenient for interpretation,as the Cox score is often used for determining the univariate significance of fea-tures (e.g. in the SAM procedure of Tusher et al. (2001)). However the non-zerocoefficients given to these predictors are not simply the score statistics or softthresholded versions of them.

Figure 3 shows the drop in test sample deviance for CUS (Cox univariateshrinkage), lasso, SPC (Supervised principal components) and UST (univariate

6

10 50 500 5000

1015

2025

Number of non−zero genes

Dro

p in

dev

ianc

e

Figure 2: Kidney cancer example: cross-validation curve. Drop in deviance onvalidation set, with plus and minus one standard error bars.

soft thresholding), plotted against the number of genes in the model. The CUSand SPC methods work best.

Figure 4 shows the univariate Cox coefficients plotted against the Cox scorestatistics, for the kidney cancer data. Although the two measures are correlated,this correlation is far from perfect, and again this underlines the difference betweenour proposed univariate shrinkage method, which enters features based on the sizeof the Cox score, and simple soft-thresholding of the univariate coefficients. Thelatter ranks features based on the size of their unregularized coefficient, while theformer looks at the standardized effect size at the null end of the scale.

To estimate the tuning parameter λ we can use cross-validation. However it isnot clear how to apply cross-validation in the setting of partial likelihood, since thisfunction is not a simple sum of independent terms. Verweij & van Houwelingen(1993) propose an interesting method for leave-one-out cross-validation of thepartial likelihood: for each left-out observation they compute the probability ofits sample path over the risk sets. For the fitting methods presented in thispaper, this leave-one-out cross-validation was too slow, requiring too many modelfits. When we instead tried 5 or 10- fold partial likelihood cross-validation, theresulting curve was quite unstable. So we settled on a simpler approach, 5-foldcross-validation where we fit a Cox model to the left out one-fifth and record thedrop in deviance. The results for the kidney cancer data is shown in Figure 2.The curve is a reasonable estimate of the test deviance curve of Figure 3.

In some microarray datasets, there are duplicate genes, or multiple clonesrepresenting the same gene. In this case, the data for the different features will

7

1 10 100 1000 10000

05

1015

Number of genes

Dro

p in

test

set

dev

ianc

e CUSLassoSPCUST

Figure 3: Kidney cancer example: drop in test sample deviance for CUS (Coxunivariate shrinkage), lasso, SPC (Supervised principal components) and UST(univariate soft thresholding).

−4 −2 0 2

−1.

00.

00.

5

Cox score

Uni

varia

te C

ox c

oeffi

cien

ts

Figure 4: Kidney cancer data: the univariate Cox coefficients plotted against theCox score statistics.

8

be highly correlated, and it makes sense to average the data for each gene, beforeapplying UC. One can also average genes whose pairwise correlation is very high,whether or not they’re supposed to represent the same gene. For example, whenwe averaged genes with correlation greater than 0.9 in the kidney cancer dataset,it reduced the number of genes by a few hundred and caused very little change inthe test deviance curve.

3.2 Selection of a more parsimonius model

The CUS procedure enters predictors into the model based on this univariate Coxscores. Thus if two predictors are strongly predictive and correlated with eachother, both will appear in the model. In some situations it is desirable pare downthe predictors to a smaller, mode independent set. The lasso does this selectionautomatically, but sometimes doesn’t predict as well as other methods (as inFigure 3.) An alternative approach is pre-conditioning (Paul et al. 2008), in whichwe apply the standard (regression) lasso to the fitted values from a model. In Paulet al. (2008) the initial fitting method used was supervised principal components.Here we start with the CUS fit using 500 genes, which approximately the bestmodel from Figure 2. We then apply the (regression version) of the lasso to thefitted values, and we report the drop in test deviance in Figure 5, along with thatfor CUS itself (green curve copied from Figure 3.) We see that pre-conditioninggives about the same drop in test deviance as the 500 gene CUS model, but usingfewer than 100 genes. And it performs better here than the Cox model lasso (bluecurve in Figure 3, a finding supported theoretically in Paul et al. (2008). Figure6 shows the univariate Cox scores for the predictors entered by each method. Wesee that the pre-conditioning approach enters only predictors that are individuallystrong, while the usual lasso enter many predictors that are individually weak.

4 Simulation study

We generated Gaussian samples of size n = 50 with p = 1000 predictors, withcorrelation ρ(xi, xj) between each pair of predictors. The outcome y was generatedas an exponential random variable with mean exp(

∑

1000

1xjβj), with a randomly

chosen s out of the p coefficients equal to value ±β and the rest equal to zero. Weconsidered two scenarios— s = 20 and s = 200—, and tried both ρ(xi, xj) = 0and ρ(xi, xj) = 0.5|i−j|. Three fitting methods for the Cox model were compared:lasso, supervised principal components, and CUS, over a range of the tuningparameter for each method. A test set of size 400 was generated and the test setdeviance of each fitting model was computed. Figure 7 shows the median testdeviance plus and minus one standard error over 10 simulations, plotted againstthe number of genes in each model.

9

1 10 100 1000 10000

05

1015

Number of genes

Dro

p in

test

set

dev

ianc

e

CUSPre−cond

Figure 5: Kidney cancer example: drop in test sample deviance for CUS (Coxunivariate shrinkage), and pre-conditioning applied to CUS. In the latter we applythe regression version of the lasso to the fitted values from CUS.

0 5000 10000 15000

−4

−2

02

Number of features entered

Uni

varia

te C

ox s

core

Cox model lasso

0 5000 10000 15000

−4

−2

02

Number of features entered

Uni

varia

te C

ox s

core

Lasso applied to CUS fit

Figure 6: Kidney cancer example: univariate Cox scores for all predictors (black)and those entered into the model (red). Panel on the left corresponds to L1-penalized partial likelihood, fit by coxpath. On the right is the standard regressionversion of the lasso applied to the fitted values from CUS.

10

1 5 10 50 500

05

1015

Number of genes

Dro

p in

test

set

dev

ianc

e

CUSLassoSPC

Corr= 0 , #nonzero= 20

1 5 10 50 5000

510

1520

Number of genes

Dro

p in

test

set

dev

ianc

e

CUSLassoSPC

Corr= 0.5 , #nonzero= 20

1 5 10 50 500

05

1015

20

Number of genes

Dro

p in

test

set

dev

ianc

e

CUSLassoSPC


1 5 10 50 500

05

1015

2025

3035

Number of genes

Dro

p in

test

set

dev

ianc

e

CUSLassoSPC


Figure 7: Simulation results: average drop in test set deviance (± one standarderror), for three different estimators, over four different simulation settings.

11

Figure 8 shows the number of genes correctly included in the model versus thenumber of genes in the model, for each of CUS and lasso. Neither method is verygood at isolating the true non-zero predictors, but surprisingly the lasso does nobetter than the simple CUS procedure.

0 200 400 600 800 1000

05

1015

20

Number of genes chosen

Num

ber

corr

ect

CUSLasso


0 200 400 600 800 10000

510

1520


Num

ber

corr

ect

CUSLasso


0 200 400 600 800 1000

050

100

150

200


Num

ber

corr

ect

CUSLasso


0 200 400 600 800 1000

050

100

150

200


Num

ber

corr

ect

CUSLasso


Figure 8: Simulation results: number of correctly included predictors in the model,for CUS and lasso, over the four different simulation settings.

12

5 Some real data examples

We applied three methods- lasso, supervised principal components, and CUS,to three real datasets. The first is the kidney data discussed earlier; the otherdatasets are the diffuse large cell lymphoma data of Rosenwald et al. (2002) (7399genes and 240 patients), and the breast cancer data of van’t Veer et al. (2002)(24881 genes, 295 patients). We randomly divided each dataset into training andtest sets ten times, and averaged the results. The first two datasets included testsets of size 88 and 80 respectively, so we retained those test set sizes. For thethird dataset, we took a test set of about one-third (98 patients), and to ease thecomputation load, restricted attention to the 10, 000 genes with largest variance.Figure 9 shows the median test deviance plus and minus one standard error over10 simulations, plotted against the number of genes in each model. The CUSmethod yields a larger drop in test deviance than the lasso, in at least two of thedatasets and performs a little better than SPC overall.

Acknowledgments. I thank Daniela Witten for helpful comments, and ac-knowledge support from National Science Foundation Grant DMS-9971405 andNational Institutes of Health Contract N01-HV-28183.

Appendix

Monotonicity of profiles. The subgradient equation for each βj is:

K∑

k=1

(

skj − dk

∑

m∈Rkxmj exp(xmjβj)

∑

m∈Rkexp(xmjβj)

)

− λ · tj = 0 (12)

where tj ∈ sign(βj), that is tj = sign(βj) if βj 6= 0 and tj ∈ [−1, 1] if βj = 0

Claim: |β̂j(λ)| is strictly decreasing in λ when β̂j(λ) is non-zero, and if β̂j(λ) =

0 then β̂j(λ′) = 0 for all λ′ > λ.

Proof: for ease of notation, suppose we have a single β having the sub-gradientequation

g(β) − λt = 0 (13)

with t ∈ sign(β).Then g′(β) is the negative of the observed information and it is easy to show

that g′(β) < 0.Denote the solution to (13) for parameter λ by β̂(λ). Suppose that we have a

solution β̂(λ) 6= 0 for some λ. WLOG assume β̂(λ) > 0. Then by (13) we have

g(β̂(λ)) = λ

13

1 5 10 50 500

05

1015

20

Number of genes

Dro

p in

test

set

dev

ianc

e CUSLassoSPC

1 10 100 1000 10000

05

1015

Number of genes

Dro

p in

test

set

dev

ianc

e CUSLassoSPC

1 10 100 1000 10000

05

1015

Number of genes

Dro

p in

test

set

dev

ianc

e

CUSLassoSPC

Figure 9: Top to bottom: results for Kidney, Lymphoma and Breast cancerdatasets; shown is the average drop in test set deviance (± one standard error)for CUS (Cox univariate shrinkage), lasso and SPC (supervised principal compo-nents).

14

Then if λ′ > λ, we can’t have β̂(λ′) ≥ β̂(λ) since this would imply g(β̂(λ′)) <g(β̂(λ)) = λ < λ′. Hence β̂(λ′) < β̂(λ).

On the other hand, if β̂(λ) = 0 then

g(0) − λt = 0

for some t with |t| ≤ 1. Then if λ′ > λ the equation g(0)− λ′t′ = 0 can be solvedby taking t′ = t(λ/λ′). Hence β̂(λ′) = 0 .

Proof of (11) Let

fj(βj) =K

∑

k=1

(

skj − dk

∑

m∈Rkxmj exp(xmjβj)

∑

m∈Rkexp(xmjβj)

)

Suppose β̂j(0) > 0. Then β̂j(λ) 6= 0 ⇐⇒ the equation fj(βj) − λ = 0 has a

solution in (0, β̂j(0)). Since the xj have been standardized by√

Vj, this equivalent

to fj(0)/√

Vj > λ. This latter quantity is the (square root) of the usual scorestatistic for βj = 0.

References

Bickel, P. & Levina, E. (2004), ‘Some theory for Fisher’s linear discriminant func-tion,“naive Bayes”, and some alternatives when there are many more vari-ables than observations’, Bernoulli 10, 989–1010.

Efron, B. (2008), Empirical bayes estimates for large scale prediction problems.Unpublished.

Fan, J. & Fan, Y. (2008), ‘Sure independence screening for ultra-high dimensionalfeature space’, Journal of the Royal Statistical Society Series B, to appear .

Paul, D., Bair, E., Hastie, T. & Tibshirani, R. (2008), “‘pre-conditioning” forfeature selection and regression in high-dimensional problems’, Annals ofStatistics 36(4), 1595–1618.

Rosenwald, A., Wright, G., Chan, W. C., Connors, J. M., Campo, E., Fisher, R. I.,Gascoyne, R. D., Muller-Hermelink, H. K., Smeland, E. B. & Staudt, L. M.(2002), ‘The use of molecular profiling to predict survival after chemotherapyfor diffuse large b-cell lymphoma’, The New England Journal of Medicine346, 1937–1947.

Tibshirani, R. (1997), ‘The lasso method for variable selection in the cox model’,Statistics in Medicine 16, 385–395.

15

Tibshirani, R., Hastie, T., Narasimhan, B. & Chu, G. (2001), ‘Diagnosis of mul-tiple cancer types by shrunken centroids of gene expression’, Proceedings ofthe National Academy of Sciences 99, 6567–6572.

Tusher, V., Tibshirani, R. & Chu, G. (2001), ‘Significance analysis of microarraysapplied to transcriptional responses to ionizing radiation’, Proc. Natl. Acad.Sci. USA. 98, 5116–5121.

van’t Veer, L. J., van de Vijver, H. D. M. J., He, Y. D., Hart, A. A., Mao, M.,Peterse, H. L., van der Kooy, K., Marton, M. J., Anke T. Witteveen and, G.J. S., Kerkhoven, R. M., Roberts, C., Linsley, P. S., Bernards, R., & Friend,S. H. (2002), ‘Gene expression profiling predicts clinical outcome of breastcancer’, Nature 415, 530–536.

Verweij, P. & van Houwelingen, H. (1993), ‘Cross-validation in survival analysis’,Statistics in Medicine 12, 2305–2314.

Zhao, H., Tibshirani, R. & Brooks, J. (2005), ‘Gene expression profiling predictssurvival in conventional renal cell carcinoma’, PLOS Medicine pp. 511–533.

Zou, H. & Hastie, T. (2005), ‘Regularization and variable selection via the elasticnet’, Journal of the Royal Statistical Society Series B. 67(2), 301–320.

16

univariate shrinkage in the cox model for high dimensional...

Documents