lars based s estimator

Robust Model Selection with LARS Based onS-estimators

Claudio Agostinelli1 and Matias Salibian-Barrera2

1 Dipartimento di StatisticaCa Foscari UniversityVenice, Italy [email protected]

2 Department of StatisticsThe University of British ColumbiaVancouver, BC, Canada [email protected]

Abstract. We consider the problem of selecting a parsimonious subset of explana-tory variables from a potentially large collection of covariates. We are concernedwith the case when data quality may be unreliable (e.g. there might be outliersamong the observations). When the number of available covariates is moderatelylarge, fitting all possible subsets is not a feasible option. Sequential methods likeforward or backward selection are generally greedy and may fail to include im-portant predictors when these are correlated. To avoid this problem Efron et al.(2004) proposed the Least Angle Regression algorithm to produce an ordered listof the available covariates (sequencing) according to their relevance. We introduceoutlier robust versions of the LARS algorithm based on S-estimators for regression(Rousseeuw and Yohai (1984)). This algorithm is computationally efficient and suit-able even when the number of variables exceeds the sample size. Simulation studiesshow that it is also robust to the presence of outliers in the data and comparesfavourably to previous proposals in the literature.

Keywords: robustness, model selection, LARS, S-estimators, robust regres-sion

1 Introduction

As a result of the recent dramatic increase in the ability to collect data,researchers sometimes have a very large number of potentially relevant ex-planatory variables available to them. Typically, some of these covariatesare correlated among themselves and hence not all of them need to be in-cluded in a statistical model with good prediction performance. In addition,models with few variables are generally easier to interpret than models withmany ones. Model selection refers to the process of finding a parsimoniousmodel with good prediction properties. Many model selection methods con-sist on sequentially fitting models from a pre-specified list and comparingtheir goodness-of-fit, prediction properties, or a combination of both. In thispaper we consider the case where a proportion of the data may not satisfy the

Y. Lechevallier, G. Saporta (eds.), Proceedings of COMPSTAT2010,DOI 10.1007/978-3-7908-2604-3 6, c Springer-Verlag Berlin Heidelberg 2010

70 Agostinelli, C. and Salibian-Barrera, M.

model assumptions and we are interested in predicting the non-outlying ob-servations. Therefore, we consider model selection methods for linear modelsbased on robust methods.

As it is the case with point estimation and other inference procedures,likelihood-type model selection methods (e.g. AIC (Akaike (1970)), MallowsCp (Mallows (1973)), and BIC (Schwarz (1978)) may be severely affectedby a small proportion of atypical observations in the data. These outliersmay not necessarily consist of large values, but might not follow the modelthat applies to the majority of the data. Model selection procedures that areresistant to the presence of outliers in the sample have only recently startedto receive some attention in the literature. Seminal papers include Hampel(1983), Ronchetti (1985, 1997) and Ronchetti and Staudte (1994). Other pro-posals include Sommer and Staudte (1995), Ronchetti, Field and Blanchart(1997), Qian and Kunsch (1998), Agostinelli (2002a, 2002b), Agostinelli andMarkatou (2005), Morgenthaler, Welsch and Zenide (2003). See also the re-cent book by Maronna, Martin and Yohai (2006). These proposals are basedon robustified versions of classical selection criteria (e.g. robust Cp, robustfinal prediction error, etc.). More recently Muller and Welsh (2005) proposeda model selection criterion that combines a measure of goodness-of-fit, apenalty term to avoid over-fitting and and the expected prediction error con-ditional on the data. Salibian-Barrera and Van Aelst (2008) use the fast androbust bootstrap of Salibian-Barrera and Zamar (2002) to obtain a fasterboostrap-based model selection method that is feasible to calculate for largernumber of covariates. Although less expensive from a computational point ofview than the stratified bootstrap of Muller and Welsh (2005), this method,as the previous ones, needs to compute the estimator on the full model.

A different approach to variable selection that is attractive when the num-ber of explanatory variables is large is based on ordering the covariates ac-cording to their estimated importance in the full model. Forward stepwiseand backward elimination procedures are examples of this approach, wherebyin each step of the procedure a variable may enter or leave the linear model(see, e.g. Weisberg (1985) or Miller (2002)). With backward elimination onestarts with the full model and then finds the best possible submodel withone less covariate in it. This procedure is repeated until we fit a model witha single covariate or a criterion is reached. A similar procedure is forwardstepwise, where we first select the covariate (say x1) with the highest abso-lute correlation with the response variable y. We take the residuals of theregression of y on x1 as our new response, project all covariates orthogo-nally to x1 and add the variable with the highest absolute correlation to themodel. At the same step, variables in the model may be deleted according toa criterion. These steps are repeated until no variables are added or deleted.Unfortunately, when p is large (p = 100, for example), these procedure be-comes unfeasible for highly-robust estimators, furthermore these algorithms

LARS Based on S-estimators 71

are known to be greedy and may relegate important covariates if they arecorrelated with those selected earlier in the sequence.

The Least Angle Regression (LARS) of Efron et al. (2004) is a general-ization of stepwise methods, where the length of the step is selected so asto strike a balance between fast-but-greedy and slow-but-conservativealternatives, as those in stagewise selection (see, e.g. Hastie, Tibshirani andFriedman (2001)). It is easy to verify that this method is not robust to thepresence of a small amount of atypical observations. McCann and Welsch(2007) proposed to add an indicator variable for each observation and thenrun the usual LARS on the extended set of covariates. When high-leverageoutliers are possible, they suggest building models from randomly drawnsubsamples of the data, and then selecting the best of them based on their(robustly estimated) prediction error. Khan, Van Aelst and Zamar (2007b)showed that the LARS algorithm can be expressed in terms of the pairwisesample correlations between covariates and the response variable, and pro-posed to apply this algorithm using robust correlation estimates. This is aplug-in proposal in the sense that it takes a method derived using leastsquares or L2 estimators and replaces the required point estimates by robustcounterparts.

In this paper we derive an algorithm based on LARS, but using a S-regression estimator (Rousseeuw and Yohai (1984)). Section 2 contains a briefdescription of the LARS algorithm, while Section 3 describes our proposal.Simulation results are discussed in Section 4 and concluding remarks can befound in Section 5.

2 Review of Least Angle Regression

Let (y1, x1), . . . , (yn, xn) be n independent observations, where yi R andxi Rp, i = 1, . . . , n. We are interested in fitting a linear model of the form

yj = + xj + j j = 1, . . . , n,

where Rp and the errors j are assumed to be independent with zeromean and constant variance 2. In what follows, we will assume, withoutloss of generality that the variables have been centered and standardized tosatisfy:

ni=1

yi = 0ni=1

xi,j = 0ni=1

x2i,j = 1 for 1 j p .

so that the linear model above does not contain the intercept term.The Least Angle Regression algorithm (LARS) is a generalization of the

Forward Stagewise procedure. The latter is an iterative technique that startswith the predictor vector = 0 Rn, and at each step sets

= + sign(cj)x(j)


where j = argmax1ip cor(y , x(i)), x(i) Rn denotes the i-th columnof the design matrix, cj = cor(y , x(j)), and > 0 is a small constant.Typically the parameter controls the speed and greediness of the method:small values produce better results at a large computational cost, while largevalues result in a faster algorithm that may relegate an important covariateif it happens to be correlated with one that has entered the model earlier.

The LARS iterations can be described as follows. Start with the predictor = 0. Let A be the current predictor and let

c = X (y A) ,

where X Rnp denotes the design matrix. In other words, c is the vec-tor of current correlations cj , j = 1, . . . , p. Let A denote the active set,which corresponds to those covariates with largest absolute correlations:C = maxj{|cj |} and A = {j : |cj | = C}. Assume, without loss of gener-ality, that A = {1, . . . ,m}. Let sj = sign(cj) for j A, and let XA Rnmbe the matrix formed by the corresponding signed columns of the designmatrix X, sj x(j). Note that the vector uA = vA/vA, where

vA = XA (X AXA)1 1A ,

satisfiesX A uA = AA 1A , (1)

where AA = 1/vA R. In other words, the unit vector uA makes equalangles with the columns of XA. LARS updates A to

A A + uA ,

where is taken to be the smallest positive value such that a new covariatejoins the active set A of explanatory variables with largest absolute correla-tion. More specifically, note that, if for each we let () = A+uA, thenfor each j = 1, . . . , p we have

cj() = cor(y (), x(j)

)= x(j)(y ()) = cj aj ,

where aj = x(j)uA. For j A, equation (1) implies that

|cj()| = C AA ,

so all maximal current correlations decrease at a constant rate along thisdirection. We then determine the smallest positive value of that makes thecorrelations between the current active covariates and the residuals equal tothat of another covariate x(k) not in the active set A. This variable entersthe model, the active set becomes

A A {k} ,

and the correlations are updated to C C AA. We refer the interestedreader to Efron et al. (2004) for more details.


3 LARS based on S-estimators

S-regression estimators (Rousseeuw and Yohai (1984)) are defined as thevector of coefficients that produce the smallest residuals in the sense ofminimizing a robust M-scale estimator. Formally we have:

= arg minRp

() ,

where () satisfies1n

ni=1

(ri()()

)= b ,

: R R+ is a symmetric, bounded, non-decreasing and continuous func-tion, and b (0, 1) is a fixed constant. The choice b = EF0() ensures that theresulting estimator is consistent when the errors have distribution functionF0.

For a given active set A of k covariates let A, 0A, A be the S-estimatorsof regressing the current residuals on the k active variables with indices inA. Consider the parameter vector = (, 0, ) that satisfies

1n k 1

ni=1

(ri xi,k(A) 0

)= b .

A robust measure of covariance between the residuals associated with andthe j-th covariate is given by

covj() =ni=1

(ri xi,k(A) 0

)xij ,

and the corresponding correlation is

j() = covj()

/ni=1

(ri xi,k(A) 0

)2.

Our algorithm can be described as follows:

1. Set k = 0 and compute the S-estimators 0 = (0, 00, 0) by regressing yagainst the intercept. The first variable to enter is that associated withthe largest robust correlation:

1 = max1jp

j(0) .Without loss of generality, assume that it corresponds to the first covari-ate.


2. set k = k + 1 and compute the current residuals

ri,k = ri,k1 xti,k1(k1k1) 0,k1 .

3. let k, 0k, k be the Sestimators of regressing rk against xk.4. For each j in the inactive set find j such that j = |j | = |m| for all 1 m k,n

i=1 ((ri,k xti,k(kk) 0k)/k

)= 0, and

n

i=1 ((ri,k xti,k(kk) 0k)/k

)= b(n k 1).

5. Let k+1 = maxj>k j , the associated index, say v corresponds to thenext variable to enter the active set. Let k = v .

6. Repeat until k = d.

Given an active set A, the above algorithm finds the length k such thatthe robust correlation between the current residuals and the active covariatesmatches that of an explanatory variable yet to enter the model. The variablethat achieves this with the smallest step is included in the model, and theprocedure is then iterated. It is in this sense that our proposal is based onLARS.

4 Simulation results

To study the performance of our proposal we conducted a simulation studyusing a similar design to that reported by Khan et al. (2007b). We generatedthe response variable y according to the following model:

y = L1 + L2 + + Lk + ,

where Lj , j = 1, . . . , k and are independent random variables with a stan-dard normal distribution. The value of is chosen to obtain a signal tonoise ratio of 3. We then generate d candidate covariates as follows

Xi = Li + i , i = 1, . . . , k ,Xk+1 = L1 + k+1Xk+2 = L1 + k+2Xk+3 = L2 + k+3Xk+4 = L2 + k+4

...X3k1 = Lk + 3k1X3k = Lk + 3k ,

and Xj = j for j = 3k, 3k+1, . . . , d. The choices = 5 and = 0.3 result incor(X1, Xk+1) = cor(X1, Xk+2) = cor(X2, Xk+2) = cor(X2, Xk+3) = =cor(Xk, X3k) = 0.5. We consider the following contamination cases:


a. N (0, 1), no contamination;b. 0.90N (0, 1) + 0.10N (0, 1)/U(0, 1), 10 % of symmetric outliers with

the Slash distribution;c. 0.90N (0, 1) + 0.10N (20, 1), 10 % of asymmetric Normal outliers;d. 10% of high leverage asymmetric Normal outliers (the corresponding co-

variates were sampled from a N (50, 1) distribution).

For each case we generated 500 independent samples with n = 150, k = 6and d = 50. In each of these datasets we sorted the 50 covariates in theorder in which they were listed to enter the model. We used the usual LARSalgorithm as implemented in the R package lars, our proposal (LARSROB)and the robust plug-in algorithm of Khan et al. (2007b) (RLARS).

For case (a) where no outliers were present in the data, all methods per-formed very close to each other. The results of our simulation for cases (b),(c) and (d) above are summarized in Figures 1 to 3. For each sequence ofcovariates consider the number tm of target explanatory variables includedin the first m covariates entering the model, m = 1, . . . , d. Ideally we wouldlike a sequence that satisfies tm = k for m k. In Figures 1 to 3 we plotthe average tm over the 500 samples, as a function of the model size m,for each of the three methods. We see that for symmetric low-leverage out-liers LARSROB and RLARS are very close to each other, with both givingbetter results that the classical LARS. For asymmetric outliers LARSROBperformed marginally better than RLARS, while for high-leverage outliersthe performance of LARSROB deteriorates noticeably.

5 Conclusion

We have proposed a new robust algorithm to select covariates for a linearmodel. Our method is based on the LARS procedure of Efron et al. (2004).Rather than replacing classical correlation estimates by robust ones and ap-plying the same LARS algorithm, we derived our method directly followingthe intuition behind LARS but starting from robust S-regression estimates.Simulation studies suggest that our method is robust to the presence of low-leverage outliers in the data, and that in this case it compares well withthe plug-in approach of Khan et al. (2007b). A possible way to make ourproposal more resistant to high-leverage outliers is to downweight extremevalues of the covariates in the robust correlation measure we utilize. Furtherresearch along these lines is ongoing.

An important feature of our approach is that it naturally extends therelationship between the LARS algorithm and the sequence of LASSO solu-tions (Tibshirani (1996)). Hence, with our approach we can obtain a resistantalgorithm to calculate the LASSO based on S-estimators. Details of the algo-rithm discussed here, and its connection with a robust LASSO method willbe published separately.


0 10 20 30 40 50

01

23

45

6

MODEL SIZE

NU

MB

ER

OF

CO

RR

EC

T C

OV

AR

IAT

ES

Fig. 1. Case (b) - Average number of correctly selected covariates as a function ofthe model size. The solid line corresponds to LARS, the dashed line to our proposal(LARSROB) and the dotted line to the RLARS algorithm of Khan et al. (2007b).

0 10 20 30 40 50

01

23

45

6

MODEL SIZE

NU

MB

ER

OF

CO

RR

EC

T C

OV

AR

IAT

ES

Fig. 2. Case (c) - Average number of correctly selected covariates as a function ofthe model size. The solid line corresponds to LARS, the dashed line to our proposal(LARSROB) and the dotted line to the RLARS algorithm of Khan et al. (2007b).


0 10 20 30 40 50

01

23

45

6

MODEL SIZE

NU

MB

ER

OF

CO

RR

EC

T C

OV

AR

IAT

ES

Fig. 3. Case (d) - Average number of correctly selected covariates as a function ofthe model size. The solid line corresponds to LARS, the dashed line to our proposal(LARSROB) and the dotted line to the RLARS algorithm of Khan et al. (2007b).

References

AGOSTINELLI, C. (2002a): Robust model selection in regression via weightedlikelihood methodology. Statistics and Probability Letters, 56 289-300.

AGOSTINELLI, C. (2002b): Robust stepwise regression. Journal of Applied Statis-tics, 29(6) 825-840.

AGOSTINELLI, C. and MARKATOU, M. (2005): M. Robust model selection bycross-validation via weighted likelihood. Unpublished manuscript.

AKAIKE, H. (1970): Statistical predictor identification. Annals of the Institute ofStatistical Mathematics, 22 203-217.

EFRON, B., HASTIE, T., JOHNSTONE, I. and TIBSHIRANI, R. (2004): Leastangle regression. The Annals of Statistics 32(2), 407-499.

HAMPEL, F.R. (1983): Some aspects of model choice in robust statistics. In: Pro-ceedings of the 44th Session of the ISI, volume 2, 767-771. Madrid.

HASTIE, T., TIBSHIRANI, R. and FRIEDMAN, J. (2001): The Elements of Sta-tistical Learning. Springer-Verlag, New York.

KHAN, J.A., VAN AELST, S., and ZAMAR, R.H. (2007a): Building a robustlinear model with forward selection and stepwise procedures. ComputationalStatistics and Data Analysis 52, 239-248.

KHAN, J.A., VAN AELST, S., and ZAMAR, R.H. (2007b): Robust Linear ModelSelection Based on Least Angle Regression. Journal of the American StatisticalAssociation 102, 1289-1299.

MALLOWS, C.L. (1973): Some comments on Cp. Technometrics 15, 661-675.MARONNA, R.A., MARTIN, D.R. and YOHAI, V.J. (2006): Robust Statistics:

Theory and Methods. Wiley, Ney York.


McCANN, L. and WELSCH, R.E. (2007): Robust variable selection using leastangle regression and elemental set sampling. Computational Statistical andData Analysis 52, 249-257.

MILLER, A.J. (2002): Subset selection in regression. Chapman-Hall, New York.MORGENTHALER, S., WELSCH, R.E. and ZENIDE, A. (2003): Algorithms for

robust model selection in linear regression. In: M. Hubert, G. Pison, A. Struyfand S. Van Aelst (Eds.): Theory and Applications of Recent Robust Methods.Brikhauser-Verlag, Basel, 195-206.

MULLER, S. and WELSH, A. H. (2005): Outlier robust model selection in linearregression. Journal of the American Statistical Association 100, 1297-1310.

QIAN, G. and KUNSCH, H.R. (1998): On model selection via stochastic complexityin robust linear regression. Journal of Statistical Planning and Inference 75,91-116.

RONCHETTI, E. (1985): Robust model selection in regression. Statistics and Prob-ability Letters 3, 21-23.

RONCHETTI, E. (1997): Robustness aspects of model choice. Statistica Sinica 7,327-338.

RONCHETTI, E. and STAUDTE, R.G. (1994): A robust version of Mallows Cp.Journal of the American Statistical Association 89, 550-559.

RONCHETTI, E., FIELD, C. and BLANCHARD, W. (1997): Robust linear modelselection by cross-validation. Journal of the American Statistical Association92, 1017-1023.

ROUSSEEUW, P.J. and YOHAI, V.J. (1984). Robust regression by means of S-estimators. In: J. Franke, W. Hardle and D. Martin (Eds.): Robust and Nonlin-ear Time Series, Lecture Notes in Statistics 26. Springer-Verlag, Berlin, 256-272.

SALIBIAN-BARRERA, M. and VAN AELST, S. (2008): Robust model selectionusing fast and robust bootstrap. Computational Statistics and Data Analysis52 5121-5135.

SALIBIAN-BARRERA, M. and ZAMAR, R.H. (2002): Bootstrapping robust esti-mates of regression. The Annals of Statistics 30, 556-582.

SCHWARTZ, G. (1978): Estimating the dimensions of a model. The Annals ofStatistics 6, 461-464.

SOMMER, S. and STAUDTE, R.G. (1995): Robust variable selection in regressionin the presence of outliers and leverage points. Australian Journal of Statistics37, 323-336.

TIBSHIRANI, R. (1996): Regression shrinkage and selection via the lasso. Journalof the Royal Statistical Society, Series B: Methodological 58, 267-288.

WEISBERG, S. (1985): Applied linear regression. Wiley, New York.

Robust Model Selection with LARS Based on S-estimators1 Introduction2 Review of Least Angle Regression3 LARS based on S-estimators4 Simulation results5 ConclusionReferences

lars based s estimator

Documents