penalized regression models with autoregressive error terms · linear regression models when the...

Penalized Regression Models with Autoregressive

Error Terms

Young Joo Yoon

Department of Applied Statistics, Daejeon University, Daejeon, 300-716, Korea

Cheolwoo Park

Department of Statistics, University of Georgia, Athens, GA 30602, USA

Taewook Lee

Department of Information Statistics, Hankuk University of Foreign Studies, Korea

February 22, 2012

Abstract

Penalized regression methods have recently gained enormous attention in statistics and the

field of machine learning due to their ability of reducing the prediction error and identifying

important variables at the same time. Numerous studies have been conducted for penalized re-

gression, but most of them are limited to the case when the data are independently observed. In

this paper, we study a variable selection problem in penalized regression models with autoregres-

sive error terms. We consider three estimators, adaptive LASSO (Least Absolute Shrinkage and

Selection Operator), bridge, and SCAD (Smoothly Clipped Absolute Deviation), and propose a

computational algorithm that enables us to select a relevant set of variables and also the order of

autoregressive error terms simultaneously. In addition, we provide their asymptotic properties

such as consistency, selection consistency, and asymptotic normality. The performances of the

three estimators are compared one another using simulated and real examples.

Key words: Asymptotic normality, Autoregressive error models, Consistency, Oracle property,

Penalized regression, Variable selection.

1

1 Introduction

The high dimensional nature of many current data sets has brought statisticians’ attention to

variable selection methods over the years. Variable selection focuses on searching for the best set

of relevant variables to include in a model. It is a key way to reduce a large number of variables to

relatively few and produce a sparse model that predicts well.

In linear regression subset selection and penalized methods are two popular variable selection

approaches. Subset selection methods including forward selection, backward elimination or step-

wise procedure are practically useful, but they often show high variability and do not reduce the

prediction error of the full model (Hastie et al., 2009). Penalized regression methods such as ridge,

LASSO (Least Absolute Shrinkage and Selection Operator), SCAD (Smoothly Clipped Absolute

Deviation) and bridge are a continuous process and select variables and estimate coefficients simul-

taneously. As a result, they can be more stable than subset selection.

One of the recent popular penalized methods is the LASSO that uses the L1 penalty (Tibshirani,

1996). The LASSO shrinks the coefficients toward zero, which results in reducing the variance and

identifying important variables. Fan and Li (2001) develop a variable selection method based on

the SCAD penalty function. The estimator possesses the unbiasedness property unlike the LASSO.

It is also shown to have the oracle property; it works as well as if the correct submodel were known.

The adaptive LASSO is developed by Zou (2006) and Zhang and Lu (2007), which permits different

weights for different parameters, and it is also shown to have the oracle property. Bridge regression

(Frank and Friedman, 1993) utilizes the Lγ (γ > 0) penalty, and thus it includes the ridge (Hoerl

and Kennard, 1970) (γ = 2) and the LASSO (γ = 1) as special cases. It is known that bridge

estimators produce sparse models when 0 < γ ≤ 1.

Other recent developments of penalized regression methods are rich. The elastic net (Zou and

Hastie, 2005) combines the L1 and L2 penalties and possesses a grouping effect, i.e. if there is a set

of variables among which the pairwise correlations are high, the elastic net groups the correlated

variables together. Park and Casella (2008) propose the Bayesian LASSO which provides interval

estimates that can guide variable selection. Zou and Yuan (2008) impose the F∞ norm on support

vector machines in the classification context. Zou and Zhang (2009) propose the adaptive elastic

net that combines the strengths of the quadratic regularization and the adaptively weighted LASSO

shrinkage.

2

While these methods are widely used in practice, most studies on their theoretical and empirical

properties have been done under the assumption that observations are independent with each other.

However, if the data are collected sequentially in time, the existing penalized methods may suffer

from the temporal correlation. Wang et al. (2007) consider linear regression with autoregressive

errors (REGAR) model (Tsay, 1984). They employ the modified LASSO-type penalty not only on

regression coefficients but also on autoregression coefficients, which results in selecting relevant co-

variates and some of AR error terms. Hsu et al. (2008) also develop a subset selection method using

the LASSO for vector autoregressive (VAR) processes. Recently, Alquier and Doukhan (2011) study

extension of the LASSO and other L1-penalized methods to the case of dependent observations.

In this paper we also consider penalized regression in the REGAR model. The proposed ap-

proach offers two major contributions that differentiate it from the approach of Wang et al. (2007);

(i) Since the order of the AR error terms is not usually available in advance, we estimate it from the

data, which makes the proposed method readily applicable. Hence, the proposed computational

algorithm allows one to select an important set of variables and also the order of autoregressive

error terms efficiently; (ii) We construct three different penalized regression methods, adaptive

LASSO, SCAD and bridge, in the REGAR model. We study asymptotic properties of the three

methods and compare their finite sample performances in various numerical settings. According to

our numerical study, the bridge method shows the superior performance.

The rest of the paper is organized as follows. Section 2 reviews the adaptive LASSO, SCAD,

and bridge estimators. In Section 3, we introduce a computational algorithm of penalized regression

with AR error terms with the three methods, and study asymptotic properties of the estimators and

their proofs are provided in Section 5. We also discuss how penalized regression can be implemented

in ARCH models. Section 4 presents the comparison of the three estimators using simulated and

real examples.

2 Penalized regression

We consider a linear regression model with p predictors and T0 observations:

Yt = x′tβ + et, t = 1, . . . , T0, (2.1)

3

where β = (β1, · · · , βp)′, xt = (xt1, · · ·xtp)′ and the error term et follows the autoregressive process

et = φ1et−1 + φ2et−2 + · · ·+ φqet−q + εt, (2.2)

where φ = (φ1, . . . , φq)′ is the autoregression coefficients and εt are independent and identically

distributed random variables with mean 0 and variance σ2. More conditions on et will be discussed

in the next section. We assume that Yt’s are centered and the covariates xt’s are standardized,

that is,

T0∑

t=1

Yt = 0,

T0∑

t=1

xtj = 0 and1T0

T0∑

t=1

x2tj = 1, j = 1, . . . , p.

Penalized regression estimates β by minimizing the penalized least squared objective function:

QIT0

(β) =T0∑

t=1

(Yt − x′tβ)2 + T0

p∑

j=1

pλj (|βj |), (2.3)

where pλj (·) is a penalty function and λj ’s are penalty parameters. In this paper, we consider three

forms of pλj(·):

(i) (adaptive LASSO) pλj (|β|) = λj |β|,

(ii) (SCAD) pλj (|β|) = λ

|β| if |β| ≤ λ,

− (β2−2aλ|β|+λ2)2(a−1)λ if λ < |β| ≤ aλ,

12(a + 1)λ if |β| > aλ,

(iii) (bridge) pλj (|β|) = λ|β|γ .

Since Fan and Li (2001) show that the Bayes risks are not sensitive to the choice of a for the SCAD

penalty and a = 3.7 is a good choice for various problems, we also use the same a value in our

numerical examples. Note that the adaptive LASSO imposes different amount of penalties on each

coefficient while the other two utilize the same parameter λ. If λj = λ in the adaptive LASSO, it

leads to the ordinary LASSO.

Zou (2006) shows that if different tuning parameters λj ’s are data-dependent and well cho-

sen, then the oracle property can be obtained in the adaptive LASSO. Zou (2006) suggests λj =

λ|βOLSj |−τ as the penalty parameters and achieves the better result in variable selection than the

LASSO. The adaptive LASSO model can be extended to generalized linear models (Zou, 2006) and

4

Cox proportional hazard models (Zhang and Lu, 2007). Wang et al. (2007) propose the adaptive

LASSO estimators for the REGAR model using λj = λ log(T0)/T0|βOLSj | as penalty parameters.

The adaptive LASSO with τ = 1 is equivalent to this modified method for independent and identical

error models.

The SCAD penalty possesses the following three important properties; (i) unbiasedness for a

large true coefficient to avoid excessive estimation bias; (ii) sparsity (estimating a small coefficient

as zero) to reduce model complexity; and (iii) continuity to avoid unnecessary variation in model

prediction. The SCAD penalty has proved to be successful in many other statistical contexts such

as regression (Fan and Li, 2001), classification (Zhang et al., 2006), Cox model (Fan and Li, 2002),

and varying coefficient models (Wang et al., 2008).

The bridge estimator is a generalization of ridge and LASSO. It does variable selection when

0 < γ ≤ 1, and shrinks the coefficients when γ > 1. Frank and Friedman (1993) introduce bridge

regression, but do not solve for the estimator of bridge regression for any given γ > 0. Fu (1998)

studies the structure of bridge estimators and proposes a general algorithm to solve for γ ≥ 1.

Knight and Fu (2000) show asymptotic properties of bridge estimators with γ > 0 when p is fixed.

Huang et al. (2008) study asymptotic properties of bridge estimators in sparse, high dimensional,

linear regression models when the number of covariates may increase along with the sample size.

Liu et al. (2007) introduce an Lγ support vector machine algorithm which selects γ from the data.

Park and Yoon (2011) also consider an adaptive choice of γ in a linear regression model, and propose

an algorithm that selects grouped variables.

Figure 1 shows the three thresholding functions with λj = λ = 2 for each penalty function.

Here, we take a simple linear regression model with one parameter θ and a single observation

z = θ + ε, where ε is a normal random error with mean 0 and variance σ2. The thresholding

functions depicted in Figure 1 are as follows:

(i) (adaptive LASSO (τ = 1)) θ =

0, |z| ≤ √2,

sgn(z)(|z| − 2|z|), |z| > √

2.

(ii) (SCAD) θ =

sgn(z)(|z| − 2)+, |z| ≤ 4

{(a− 1)z − sgn(z)a · 2}/(a− 2), 4 < |z| ≤ 2a,

z |z| > 2a.

5

(iii) (bridge (γ = 0.5)) θ =

0, |z| ≤ 22/3(

32

),

Solution of sgn(z)(|θ|+ 1√|θ|

) = z and 1− 1

|θ|√|θ|

> 0, |z| > 22/3(

32

).

For the bridge, the computational algorithm for the solution can be found in Knight and Fu (2000).

−10 −5 0 5 10

−10

−5

05

10

z

−10 −5 0 5 10

−10

−5

05

10

z

−10 −5 0 5 10

−10

−5

05

10

z

(a) adaptive LASSO (τ = 1) (b) SCAD (a = 3.7) (c) bridge (γ = 0.5)

Figure 1: Plot of thresholding functions with λj = λ = 2 for each penalty function.

3 Penalized regression in the REGAR model

In Section 3.1, we propose our main algorithm by incorporating the information about the depen-

dence structure of a given time series under the REGAR model (2.1) with the error et in (2.2).

In practice, there are some obstacles to find the estimators because the objective function is not

convex and the AR order q is not available in advance. The proposed algorithm utilizes the local

quadratic approximation to obtain local convexity and applies the BIC for finding the correct q

along with other optimal tuning parameters. In Section 3.2, we present asymptotic properties of

the estimators assuming that the tuning parameters including the AR order q can be properly

identified. We show that the estimators are consistent and enjoy selection consistency as well as

asymptotic normality. In Section 3.3, we briefly discuss extension to penalized regression in ARCH

models.

6

3.1 Computation in the REGAR model

In estimating the coefficients β and φ, the estimator can be obtained by minimizing the following

objective function:

Q∗T (β,φ) =

T0∑

t=q+1

{Yt − x′tβ −

q∑

j=1

φj(Yt−j − x′t−jβ)}2

+ T

p∑

j=1

pλj (|βj |) + T

q∑

j=1

pδj (|φj |), (3.1)

where T = T0 − q. Wang et al. (2007) consider the same model with the modified LASSO penalty

and fixed order q. Their algorithm implicitly suggests that the correct order of AR error terms

could be identified by setting some of their coefficients zero if q is chosen larger than the true order.

Instead, the proposed computational algorithm includes the procedure of explicitly selecting

the order q as well as important sets of variables. For a simple computation, we do not penalize the

autoregression coefficients φ in (3.1) because selecting AR error terms is relatively less important.

Therefore, we minimize the following objective function:

QT (β, φ) =T0∑

t=q+1

{Yt − x′tβ −

q∑

j=1


+ T

p∑

j=1

pλj (|βj |). (3.2)

One can always include T∑q

j=1 pδj (|φj |) back to the algorithm if necessary.

In what follows we introduce an algorithm for the three penalized methods (adaptive LASSO,

SCAD and bridge). For the adaptive LASSO, SCAD and bridge with 0 < γ ≤ 1, the minimization

problem in (3.2) is either not differentiable at the origin or concave, so we use the local quadratic

approximation proposed by Fan and Li (2001) to circumvent this issue. Under some mild conditions,

the penalty term can be locally approximated at β(0) = (β(0)1 , · · · , β

(0)p )′ by a quadratic function:

pλj (|βj |) ≈ pλj (|β(0)j |) +

12

p′λj(|β(0)

j |)|β(0)

j |(β2

j − β(0)2j )

=12

p′λj(|β(0)

j |)|β(0)

j |β2

j + pλj (|β(0)j |)− 1

2

p′λj(|β(0)

j |)|β(0)

j |β

(0)2j , (3.3)

where p′ is the derivative of the penalty function p. In (3.3), note that the last two constant terms

are not related to the minimization problem in (3.2). By using the local quadratic approximation

at β(0), the minimization problem near β(0) can be reduced to a quadratic minimization problem:

T0∑

t=q+1

{Yt − xT

t β −q∑

j=1


+T

2

p∑

j=1

p′λj(|β(0)

j |)|β(0)

j |β2

j . (3.4)

7

Because there are both regression and autoregression parameters in equation (3.4), we solve it

iteratively by minimizing the following two objective functions:

β(k)

= arg minβ

T0∑

t=q+1

{Yt − x′tβ −

q∑

j=1

φ(k−1)j (Yt−j − x′t−jβ)

}2

+ T

p∑

j=1

p′λj(|β(k−1)

j )

2|β(k−1)j |

β2j

with a fixed φ(k−1)

and

φ(k)

= arg minφ

T0∑

t=q+1

{Yt − x′tβ

(k) −q∑

j=1

φj(Yt−j − x′t−jβ(k)

)}2

with a fixed β(k)

. The algorithm stops when there is little change in β(k)

, for example max{|β(k)1 −

β(k−1)1 |, · · · , |β(k)

p − β(k−1)p |, |φ(k)

1 − φ(k−1)1 |, · · · , |φ(k)

q − φ(k−1)q |} < ε, where ε is a pre-selected positive

value. In our numerical analysis, ε = 10−3 is used. During the iteration if |β(k)j | < 10−5, we delete

the jth variable to make the algorithm stable and also exclude it from the final model.

We use the ordinary least squares estimator with no consideration of autocorrelation structure

as an initial estimator for the regression coefficient vector β:

β(0)

= (X ′X)−1(X ′Y ).

Then we compute the residuals, et = Yt − x′tβ(0)

, and obtain the initial estimator for the autore-

gression coefficient φ as follows:

φ(0)

= (W ′W )−1(W ′V ),

where V = (eq+1, · · · , eT )′ and W is a T × q matrix with t-th row given by (et+q−1, · · · , et).

After obtaining these initial estimators, we select the tuning parameters λ for the adaptive

LASSO (we use λj = λ|βOLS

j | as suggested by Zou (2006)) and SCAD or (λ, γ) for the bridge, and

the autoregressive order q. As in Wang et al. (2007), we apply the BIC:

BIC = log(σ2) + dflog(T )

T,

where

σ2 =1T

T0∑

t=q+1

{Yt − x′tβ −

q∑

j=1


,

and df is the number of nonzero coefficients of (β, φ). Therefore, we choose the tuning parameter(s)

and the order q that minimize the BIC.

8

3.2 Asymptotic properties

In this subsection we study asymptotic behaviors of the bridge and SCAD estimators assuming

that the AR order q is properly chosen. See Wang et al. (2007) for the properties of the adaptive

LASSO. Since we do not penalize the autoregression coefficients φ, we focus on the behavior of the

regression coefficients estimator β obtained from (3.2). All proofs are delayed to Section 5.

Suppose that β0 = (β01, . . . , β0p) and φ0 = (φ01, . . . , φ0q) are true coefficient vectors and there

are p0 ≤ p nonzero regression coefficients. We further assume the following conditions.

(i) The sequence {xt} is independent of {εt}.

(ii) All roots of the polynomial 1−∑qj=1 φ0jz

j are outside the unit circle.

(iii) The error εt has a finite fourth-order moment.

(iv) The covariate xt is strictly stationary and ergodic with finite second-order moment. Further-

more, the following matrix is positive definite:

B = E

xt −

q∑

j=1

φ0jxt−j

xt −

q∑

j=1

φ0jxt−j

′ .

Note that λ1 = · · · = λp = λT for the penalty term since we do not consider the adaptive LASSO

in this subsection. Define Σ = diag(B, C) where C = (ξ|i−j|) and ξk = E(etet+k). Let Σβ be the

submatrix of Σ corresponding to β.

Theorem 1. For some λ0 ≥ 0, we assume that λT T 1−γ/2 → λ0 where 0 < γ < 1 for the bridge

and λT

√T → λ0 for the SCAD. Then, under the conditions (i)-(iv),

√T (β − β0)

d→ arg min(V )

where

V (u) = −2u′w + u′Σβu + λ0

p∑

j=1

|uj |γI(βj = 0)

for the bridge with 0 < γ < 1 and

V (u) = −2u′w + u′Σβu + λ0

p∑

j=1

[ujsgn(βj)I(βj 6= 0) + |uj |I(βj = 0)]

for the SCAD. Here, w ∼ N(0, σ2Σβ).

9

Theorem 1 describes the limiting distribution of β as similarly shown in Knight and Fu (2000).

It implies that λT cannot grow faster than T−1/2 for the SCAD. For the bridge with 0 < γ < 1,

one can estimate nonzero regression coefficients at the usual rate without asymptotic bias while

shrinking the estimates of zero regression coefficients to zero with positive probability.

The following lemma shows the√

T -consistency of the two estimators.

Lemma 1. Assume that aT = max{p′λT(|β0j |) : β0j 6= 0} = o(1) as T →∞. If max{p′′λT

(|β0j |) :

β0j 6= 0} → 0, then, under the conditions (i)-(iv), there is a local minimizer (β, φ) of QT (β, φ)

such that

||β − β0|| = Op(T−1/2 + aT ).

The next theorem shows that the bridge and SCAD estimators possess the selection consistency

with probability 1 for insignificant regression coefficients. Let β0 = (β′01, β′02)

′ and assume that

β02 = 0 without loss of generality. Denote their estimators by β1 and β2, respectively.

Theorem 2. Assume that

lim infT→∞

lim infθ→0+

p′λT(θ)/λT > 0, (3.5)

and ||β − β0|| = Op(T−1/2). If λT → 0 and√

TλT →∞ as T →∞, then

P (β2 = 0) = 1.

Finally, the following theorem describes the asymptotic distribution of the bridge and SCAD

estimators. It implies that the estimators can be as efficient as the oracle estimator in an asymptotic

sense.

Theorem 3. Assume that the penalty function pλT(|θ|) satisfies condition (3.5). If λT → 0 and

√TλT →∞ as T →∞, then

√T (β1 − β01)

d→ N(0, σ2Σ−10β ),

where Σ0β is the submatrix of Σβ corresponding to β01.

10

Remark The algorithm in Section 3.1 divides the original objective function (3.2) into two parts to

estimate the coefficients of regression and AR error terms iteratively. By using a similar argument

in Wang et al. (2007), it can be shown that this iterative process leads to the unique local minimizer.

3.3 Extension to ARCH models

In this subsection, we briefly discuss how penalized regression can be developed for ARCH models.

Let {εt} be the sequence of random variables following an ARCH(r) model:

εt = ztσt(θ), σ2t (θ) = c0 +

r∑

i=1

ciε2t−i.

Here, θ = (c0, . . . , cr)′ denotes the ARCH model parameter where its parameter space is given as

(0,∞) × [0,∞)r. Also, zt, t = 1, . . . , T , is a sequence of independent and identically distributed

random variables with zero mean and unit variance, and r represents the order of an ARCH model.

First, we consider the parameter estimation for the pure ARCH models. If r is known in advance,

one may consider the least squares estimator for θ, obtained from the AR(r) representation for ε2t :

ε2t = c0 +r∑

i=1

ciε2t−i + ut (3.6)

where u2t = ε2t − σ2

t (θ) and ut is a martingale difference. However, it is more likely that the ARCH

order r is unknown in practice. In such case, traditional model selection criteria including AIC

and BIC can be employed for selecting r, since ARCH models can be expressed as AR models in

(3.6). It is worth pointing out that these selection criteria may be implemented in the penalized

regression frame such as the adaptive LASSO, bridge, and SCAD considered in this paper.

Second, we consider a linear regression model with AR-ARCH error terms, where {et} in (2.1)

is the sequence of random variables following an AR(q)-ARCH(r) model:

et =q∑

i=1

φiet−i + εt, εt = ztσt(θ), σ2t (θ) = c0 +

r∑

i=1

ciε2t−i.

Note that this is a general extension of the REGAR model. Then, the objective function in (3.2)

can be modified as follows:

QT (β,φ, θ) =T0∑

t=q+1

{Yt − x′tβ −

∑qj=1 φj(Yt−j − x′t−jβ)σt (θ)

}2

+ T

p∑

j=1

pλj (|βj |).

11

Since the conditional variance σ2t (θ) is involved in this objective function, the proposed algorithm

in Section 3.1 may not be appropriate. We suggest development of a new algorithm for the order

selection and parameter estimation for this model and investigation of asymptotic properties as

future work.

4 Numerical Analysis

4.1 Simulation

We present a Monte Carlo simulation study to evaluate the finite sample performance. We compare

the least squares (LS) without penalty, adaptive LASSO, SCAD and bridge under various AR error

models. The simulation settings are as follows.

1. Setting I : We generate the data from the model (2.1) where β = (3, 1.5, 0, 0, 2, 0, 0, 0)′ and et

follows the AR(q) process as in (2.2). The covariates xt = (xt1, · · · , xt8)′ are independently

generated from the multivariate normal distribution with mean 08×1, and the pairwise cor-

relation between xt,j1 and xt,j2 is 0.5|j1−j2|. In the simulation, q = 1, 2, 6 , σ = 1, 3 and two

sample sizes (T0 = 100, 300) are considered. We use φ = 0.9 for q = 1, φ = (0.4, 0.4)′ for

q = 2, and φ = (0.4, 0.2, 0.1, 0.05, 0.025, 0.0125)′ for q = 6. A similar setting is considered by

Wang et al. (2007), but we add a higher order case.

2. Setting II : We generate the data from the model (2.1) where the regression coefficient are

set to be:

β = (0, · · · , 0︸︷︷︸10

, 2, · · · , 2︸︷︷︸10

, 0, · · · , 0︸︷︷︸10

, 2, · · · , 2︸︷︷︸10

)′

In addition, we assume the same structure as Setting I for the variable et. The covariates

xt = (xt1, · · · , xt40)′ are independently generated from the multivariate normal distribution

with mean 040×1, and the pairwise correlation between xt,j1 and xt,j2 is 0.5|j1−j2|. In this

setting, q = 1, 2 , σ = 1, 3 and two sample sizes (T0 = 100, 300) are considered. We use

φ = 0.9 for q = 1 and φ = (0.4, 0.4)′ for q = 2.

3. Setting III : We consider a seasonal AR model with order q, that is, the error term et follows

12

the autoregressive process

et = φ1et−s + φ2et−2s + · · ·+ φqet−qs + εt.

In this simulation, q = 2 with φ = (0.4, 0.4)′ and s = 4 (quarterly) are considered. The other

setups are the same as those in Setting I.

In the three settings, the ranges of the tuning parameters are λ = 2k−15 where k = 1, 2, . . . , 20

for the three penalized methods, γ = 0.1, 0.4, 0.7, 1 for the bridge, and q = 1, 2, 3, 4 except for the

higher order case in Setting I where we consider q = 1, 2, . . . , 8. We choose the best of λ (for the

adaptive LASSO and SCAD) or the combination of (λ, γ) (for the bridge) and the AR order q

which produce the lowest BIC. For each setting, we repeat 100 times and compare the average of

model error ME = (β − β)T E(xx′)(β − β), the average numbers of correct and incorrect zeros in

β, and the number of correctly estimated autoregressive order.

Table 1: Results of Setting I (q = 1)

Reg. coef. AR order Reg. coef. AR order

Method ME Cor Incor No. cor ME Cor Incor No. cor

T0 = 100 σ = 1 σ = 3

LS 0.053(0.003) 0 0 94 0.449(0.025) 0 0 83

ALASSO 0.026(0.002) 4.73 0 88 0.277(0.023) 4.06 0 86

SCAD 0.026(0.002) 4.45 0 84 0.282(0.024) 3.80 0 66

BRIDGE 0.025(0.002) 4.75 0 92 0.219(0.020) 4.81 0 88

T0 = 300 σ = 1 σ = 3

LS 0.016(0.001) 0 0 95 0.148(0.009) 0 0 94

ALASSO 0.009(0.001) 4.67 0 96 0.097(0.008) 4.33 0 93

SCAD 0.008(0.001) 4.52 0 73 0.092(0.009) 3.99 0 82

BRIDGE 0.008(0.001) 4.87 0 96 0.065(0.007) 4.90 0 95

Tables 1, 2, and 3 report the results of Setting I with q = 1 and 2, and 6, respectively. In terms

of ME, the bridge shows the best performance in all cases while the LS is the worst performer. The

SCAD performs well when the noise level is small. It can be seen that the adaptive LASSO also

13




T0 = 100 σ = 1 σ = 3

LS 0.071(0.004) 0 0 84 0.611(0.033) 0 0 82

ALASSO 0.034(0.003) 4.21 0 90 0.391(0.034) 3.94 0 85

SCAD 0.030(0.003) 4.36 0 82 0.455(0.037) 3.71 0 78

BRIDGE 0.032(0.003) 4.78 0 90 0.284(0.027) 4.84 0 85

T0 = 300 σ = 1 σ = 3

LS 0.021(0.001) 0 0 92 0.176(0.009) 0 0 99

ALASSO 0.014(0.001) 4.68 0 95 0.119(0.011) 4.29 0 95

SCAD 0.009(0.001) 4.54 0 86 0.120(0.012) 4.24 0 91

BRIDGE 0.010(0.001) 4.92 0 94 0.080(0.008) 4.96 0 98




T0 = 100 σ = 1 σ = 3

LS 0.103(0.006) 0 0 48 0.929(0.056) 0 0 48

ALASSO 0.060(0.005) 4.01 0 52 0.650(0.054) 3.71 0.01 49

SCAD 0.044(0.005) 4.24 0 56 0.627(0.056) 3.83 0.01 49

BRIDGE 0.050(0.005) 4.66 0 53 0.525(0.054) 4.64 0.04 53

T0 = 300 σ = 1 σ = 3

LS 0.025(0.001) 0 0 94 0.221(0.011) 0 0 94

ALASSO 0.013(0.001) 4.62 0 92 0.145(0.012) 4.36 0 90

SCAD 0.010(0.001) 4.54 0 91 0.149(0.012) 4.45 0 88

BRIDGE 0.010(0.001) 4.91 0 94 0.093(0.008) 4.91 0 94

14

performs competitively. The bridge method correctly selects important variables corresponding to

nonzero coefficients, and makes most of true zero coefficients zero, whose average numbers are closer

to the true value 5 compared to the others. The SCAD and adaptive LASSO select more variables

including noninformative variables than the bridge especially for the higher noise level, so this may

explain the worse performance of the two methods in terms of ME. All the methods select the AR

order correctly in most cases for the lower order cases, but the bridge shows a slightly better result.

For the higher order case q = 6, which is more challenging, all the estimators produce higher ME

values, and lower correct number of zero coefficients and AR orders compared to the lower order

cases particularly when the sample size is small. However, the main findings remain the same.

Table 4: Results of Setting II (q = 1)



T0 = 100 σ = 1 σ = 3

LS 0.860(0.025) 0 0 11 9.067(0.516) 0 0 8

ALASSO 0.307(0.029) 16.08 0 52 4.922(0.406) 13.42 0.06 39

SCAD 0.304(0.024) 17.45 0 47 4.591(0.415) 13.75 0.05 25

BRIDGE 0.289(0.020) 17.58 0 55 4.169(0.297) 15.64 0.06 38

T0 = 300 σ = 1 σ = 3

LS 0.095(0.002) 0 0 78 0.837(0.021) 0 0 83

ALASSO 0.049(0.002) 19.08 0 91 0.461(0.017) 18.48 0 87

SCAD 0.043(0.002) 19.96 0 89 0.483(0.020) 19.21 0 68

BRIDGE 0.045(0.002) 19.74 0 90 0.411(0.014) 19.70 0 91

In Tables 4 and 5 the results of Setting II with q = 1 and 2 are shown. Note that there are more

variables in Setting II than Setting I. In terms of ME, the three penalized methods perform much

better than the LS. The bridge is the best or second best performer in all cases and the SCAD is

comparable to the bridge in most cases especially for the small noises. In terms of variable selection

the bridge again shows the best performance by correctly selecting relevant variables and making

more coefficients zero than the others, which corresponds to true 20 zero coefficients. The selection

15

Table 5: Results of Setting II (q = 2)



T0 = 100 σ = 1 σ = 3

LS 0.937(0.048) 0 0 14 8.729(0.412) 0 0 17

ALASSO 0.359(0.036) 17.38 0 53 4.576(0.367) 15.13 0.05 38

SCAD 0.347(0.038) 18.40 0 54 4.036(0.328) 14.99 0.06 34

BRIDGE 0.388(0.031) 17.98 0 46 3.797(0.311) 17.86 0.09 48

T0 = 300 σ = 1 σ = 3

LS 0.121(0.003) 0 0 91 1.102(0.029) 0 0 92

ALASSO 0.064(0.003) 19.07 0 91 0.650(0.023) 18.26 0 84

SCAD 0.055(0.002) 19.98 0 96 0.806(0.032) 19.38 0 84

BRIDGE 0.060(0.002) 19.65 0 96 0.560(0.021) 19.62 0 94

of the AR order q in this setting becomes more challenging and all the methods do not work well

for T0 = 100. For T0 = 300, the adaptive LASSO and bridge achieve about 90% accuracy.

Table 6 shows the results of Setting III with a seasonal component. It can be seen that the

results are similar to those of Setting I with q = 2; the penalized regression methods perform better

than the LS and the bridge performs the best.

In summary, the three penalized regression methods provide more accurate results compared

to the least squares in terms of prediction, variable selection, and AR order selection. Among

the three, the bridge has a slight edge over the other two because it has a more flexible penalty

form and it is adaptive in the sense that the order of the penalty function, γ, is estimated from

the data. Also, Park and Yoon (2011) conduct a thorough simulation study for the comparison of

several penalized methods in a regression problem and show the superior performance of the bridge

estimators in various settings.

16

Table 6: Results of Setting III



T0 = 100 σ = 1 σ = 3

LS 0.083(0.005) 0 0 73 0.746(0.040) 0 0 73

ALASSO 0.043(0.004) 4.10 0 73 0.484(0.044) 3.88 0.02 74

SCAD 0.038(0.004) 4.35 0 69 0.470(0.042) 3.72 0.01 69

BRIDGE 0.038(0.003) 4.73 0 77 0.399(0.042) 4.70 0.03 76

T0 = 300 σ = 1 σ = 3

LS 0.021(0.001) 0 0 90 0.188(0.001) 0 0 90

ALASSO 0.013(0.001) 4.71 0 91 0.126(0.012) 4.41 0 91

SCAD 0.009(0.001) 4.46 0 80 0.120(0.010) 4.27 0 87

BRIDGE 0.009(0.001) 4.92 0 90 0.083(0.008) 4.92 0 90

4.2 Real data

In this subsection we analyze the data set from Ramanathan (1998), the consumption of electricity

by residential customers served by San Diego Gas and Electric Company. This data set contains

87 quarterly observations from the second quarter of 1972 through fourth quarter of 1993. The

response variable is the electricity consumption as measured by the logarithm of the kwh sales

per residential customer. The independent variables are the per-capita income (LY), the price of

electricity (LPRICE), cooling degree days (CCD) and heating degree days (HDD).

The basic model considered in Ramanathan (1998) is given as:

LKWH = β0 + β1LY + β2LPRICE + β3CDD + β4HDD + et.

The expected signs for the β’s (except β0) are as follows (Ramanathan, 1998):

β1 > 0, β2 < 0, β3 > 0, β4 > 0.

We first apply the OLS (ordinary least squares) method and report the estimated coefficients

in the second column of Table 7. The signs of LPRICE, CCD and HHD are consistent with the

17

expected ones, but LY has the reverse sign. This unexpected result may happen due to ignoring the

autocorrelation structure of the error term, and hence Ramanathan (1998) suggests the REGAR

model. We next consider the REGAR model with autoregressive orders q = 1, 2, 3, 4 and select

one of these orders which minimizes BIC. We again compare the LS, adaptive LASSO, SCAD and

bridge as in Section 4.1.

Table 7: The estimated coefficients and AR orders

Variable OLS LS ALASSO SCAD BRIDGE

LY -0.00127 0.10256 0.04304 - -

LPRICE -0.01807 -0.09824 -0.09453 -0.09775 -0.09772

CCD 0.06374 0.00028 0.00028 0.00028 0.00028

HDD 0.09849 0.00023 0.00023 0.00023 0.00023

AR order - 4 4 4 4

Table 7 further reports the estimated coefficients of the four methods and the selected au-

toregressive order of the four methods. When the LS and adaptive LASSO are used, the sign of

β1 is positive as Ramanathan (1998) suggests. In cases of the SCAD and bridge, β1 is exactly

zero, which suggests that the per-capita income (LY) does not contribute to the model. For the

other variables, all the estimators produce similar coefficient values. The autoregressive order q is

consistently estimated as 4 by the four methods.

We next compare the performances of the four methods using the one-step ahead forecast. We

estimate the regression and autoregressive coefficients using previous 65 observations, and then

forecast the electricity consumptions for the next quarter.

Table 8 summarizes the mean, standard error (s.e.) of the one-step forecast errors, and the

number of forecasts included at 95% confidence intervals of the four methods. While the four

methods perform similarly, the SCAD and bridge produce the slightly lower forecast errors and

have more forecasts included in the 95% confidence intervals.

18

Table 8: The one-step ahead forecast errors and the number of forecasts included in the 95%

confidence intervals

LS ALASSO SCAD BRIDGE

mean 0.00263 0.00259 0.00252 0.00247

s.e 0.00093 0.00089 0.00091 0.00088

No 16/22 15/22 17/22 17/22

5 Proofs

In this section, we provide the proofs for the theorems in Section 3.2. We will follow the similar

steps given in Wang et al. (2007).

Define

LT ((β,φ)) =T0∑

t=q+1

{Yt − x′tβ −

q∑

j=1


,

and then

QT (β, φ) = LT (β, φ) + T

p∑

j=1

pλj (|βj |).

Since we focus on β, we omit φ from our notation, that is,

QT (β) = LT (β) + T

p∑

j=1

pλj (|βj |).

5.1 Proof of Theorem 1

Define

VT (u) = QT (β0 + u/√

T )−QT (β0)

= LT (β0 + u/√

T )− LT (β0) + T

p∑

j=1

[pλT(|β0j + uj/

√T |)− pλT

(|β0j |)].

It can be shown from Knight and Fu (2000) that

T

p∑

j=1

[pλT(|β0j + uj/

√T |)− pλT

(|β0j |)] → λ0

p∑

j=1

|uj |γI(βj = 0)

19

for the bridge and

T

p∑

j=1

[pλT(|β0j + uj/

√T |)− pλT

(|β0j |)] → λ0

p∑

j=1

[ujsgn(βj)I(βj 6= 0) + |uj |I(βj = 0)]

for the SCAD. Also, taking a similar approach of Wang et al. (2007),

LT (β0 + u/√

T )− LT (β0)

=∑

t

yt − x′t(β0 + u/

√T )−

q∑

j=1

φ0j{yt−j − x′t−j(β0 + u/√

T )}

2

−∑

t

ε2t

=∑

t

εt − T−1/2u′

xt −

q∑

j=1

φ0jxt−j

2

−∑

t

ε2t

= S1 + S2,

where

S1 = −2T−1/2u′∑

t

εt

xt −

q∑

j=1

φ0jxt−j

,

S2 = T−1u′∑

t

xt −

q∑

j=1

φ0jxt−j

xt −

q∑

j=1

φ0jxt−j

′

u.

Using the martingale central limit theorem and the ergodic theorem, it can be shown that

LT (β0 + u/√

T )− LT (β0)d→ −2u′w + u′Σβu.

To prove that

arg min(VT (u)) d→ arg min(V ),

it is sufficient to show that arg min(VT ) = Op(1) (Kim and Pollard, 1990). Note that for the bridge,

VT (u) ≥∑

t

εt − T−1/2u′

xt −

q∑

j=1

φ0jxt−j

2

− ε2t

− T

p∑

j=1

pλT(|uj/

√T |)

≥∑

t

εt − T−1/2u′

xt −

q∑

j=1

φ0jxt−j

2

− ε2t

− (λ0 + ε0)

p∑

j=1

|uj |γ + op(1)

= VT (u)

where ε0 is a positive constant. Using the same argument in the proof of Theorem 3 in Knight and

Fu (2000), the conclusion follows. One can use a similar argument for the SCAD.

20

5.2 Proof of Lemma 1

Let αT = T−1/2 + aT and {β0 + αT u : ||u|| ≤ C} be the ball around β0. As in Fan and Li (2001),

we will show that for any given ε > 0, there exists a large constant C such that

P

{sup

||u||=CQT (β0 + αT u) < QT (β0)

}≥ 1− ε, (5.1)

which implies that there exists a local maximum in the ball with probability at least 1− ε.

Using pλT(0) = 0, we have

DT (u) ≡ QT (β0 + αT u)−QT (β0)

≥ LT (β0 + αT u)− LT (β0) + T

p0∑

j=1

{pλT(|β0j + αT uj |)− pλT

(|β0j |)}

≥ LT (β0 + αT u)− LT (β0)−p0∑

j=1

[TαT p′λT

(|β0j |)sgn(β0j)uj + Tα2T p′′λT

(|β0j |)u2j{1 + o(1)}]

≥ LT (β0 + αT u)− LT (β0)−√

p0TαT aT ||u||+ Tα2T max{|p′′λT

(|β0j |) : β0j 6= 0}||u||2. (5.2)

Here,

LT (β0 + αT u)− LT (β0) =∑

t

εt − αT u′

xt −

q∑

j=1

φ0jxt−j

2

−∑

t

ε2t

= (I) + (II),

where

(I) = α2T u′

∑t

xt −

q∑

j=1

φ0jxt−j

xt −

q∑

j=1

φ0jxt−j

′

u,

(II) = −2αT u′∑

t

εt

xt −

q∑

j=1

φ0jxt−j

.

From the proof of Lemma 1 in Wang et al. (2007), we have

(I) = Tα2T {u′Σβu + op(1)},

(II) = u′Op(Tα2T ),

and since (I) dominates (II) and the last term in (5.2), it implies (5.1). This shows that there exists

a local minimizer such that ||β − β0|| = Op(αT ).

21


For j = p0 + 1, . . . , p, the local minimizer β satisfies the equation

∂QT (β)∂βj

=∂LT (β)

∂βj− Tp′λT

(|βj |)sgn(βj)

=∂LT (β0)

∂βj+ TΣj(β − β0){1 + op(1)} − Tp′λT

(|βj |)sgn(βj)

= TλT

{Op

(T−1/2/λT

)− λ−1

T p′λT(|βj |)sgn(βj)

}

where Σj denotes the jth row of Σβ. The last equation follows from the fact that the first and second

terms are Op(T 1/2) by the central limit theorem and the condition in Theorem 2, respectively. Since

lim infT→∞ lim infθ→0+ λ−1T p′λT

(θ) > 0 and T−1/2/λT → 0, the sign of the derivative is determined

by that of βj . Hence, βj = 0 in probability for j = p0 + 1, . . . , p.


By Lemma 1 and Theorem 2, P (β2 = 0) → 1. Therefore, the minimizer of QT (β) is the same as

that of QT (β1) with probability tending to 1. This implies that β1 satisfies the equation

∂QT (β)∂β1

∣∣∣∣β=

ˆβ1

= 0.

Note that β1 is a consistent estimator by Lemma 1, and thus we have

0 =1√T

∂LT (β1)∂β1

+√

Tp′λT(|βj |)sgn(βj)

=1√T

∂LT (β01)∂β1

+ Σ0β

√T (β1 − β01) + op(1)

+√

T (p′λT(|β0j |)sgn(β0j) + {p′′λT

(|β0j |) + op(1)}(βj − β0j)).

By Slutsky’s theorem and the central limit theorem, we have

√T (β1 − β01) =

Σ−10β√T

∂LT (β01)∂β1

+ op(1) d→ N(0, σ2Σ−10β ).

References

Alquier, P. and Doukhan, P. (2011). Sparsity considerations for dependent variables. Electronic

Journal of Statistics, 5:750–774.

22

Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle

properties. Journal of the American Statistical Association, 96:1348–1360.

Fan, J. and Li, R. (2002). Variable selection for Cox’s proportional hazards model and frailty

model. The Annals of Statistics, 30:74–99.

Frank, I. E. and Friedman, J. H. (1993). A statistical view of some chemometrics regression tools.

Technometrics, 35:109–148.

Fu, W. J. (1998). Penalized regression: the Bridge versus the Lasso. Journal of Computational and

Graphical Statistics, 7:397–416.

Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning: Data

Mining, Inference, and Prediction. Springer-Verlag, New York, second edition.

Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: biased estimation for nonorthogonal

problems. Technometrics, 12:55–67.

Hsu, N.-J., Hung, H.-L., and Chang, Y.-M. (2008). Subset selection for vector autoregressive

processes using LASSO. Computationnal Statistics and Data Analysis, 52:3645–3647.

Huang, J., Horowitz, J. L., and Ma, S. (2008). Asymptotic properties of bridge estimators in sparse

high-dimensional regression models. The Annals of Statistics, 36:587–613.

Kim, J. and Pollard, D. (1990). Cube root asymptotics. The Annals of Statistics, 18:191–219.

Knight, K. and Fu, W. J. (2000). Asymptotics for Lasso-type estimators. The Annals of Statistics,

28:1356–1378.

Liu, Y., Zhang, H., Park, C., and Ahn, J. (2007). Support vector machines with adaptive Lq

penalty. Computational Statistics and Data Analysis, 51:6380–6394.

Park, C. and Yoon, Y. J. (2011). Bridge regression: adaptivity and group selection. Journal of

Statistical Planning and Inference, 141:3506–3519.

Park, T. and Casella, G. (2008). The Bayesian Lasso. Journal of the American Statistical Associ-

ation, 103:681–686.

23

Ramanathan, R. (1998). Introductory Econometrics with Applications. Fort Worth: Dryden: Har-

court Brace College Publishers.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal

Statistical Society Series B, 58:267–288.

Tsay, R. S. (1984). Regression models with time series errors. Journal of the American Statistical

Association, 79:118–124.

Wang, H., Li, G., and Tsai, C. (2007). Regression coefficient and autoregressive order shrinkage

and selection via the lasso. Journal of the Royal Statistical Society Series B, 69:63–78.

Wang, L., Li, H., and Huang, J. (2008). Variable selection in nonparametric varying-coeffcient

models for analysis of repeated measurements. Journal of the American Statistical Association,

103:1556–1569.

Zhang, H., Ahn, J., Lin, X., and Park, C. (2006). Gene selection using support vector machines

with nonconvex penalty. Bioinformatics, 22:88–95.

Zhang, H. H. and Lu, W. (2007). Adaptive-LASSO for Cox’s proportional hazard model. Bio-

metrika, 94:691–703.

Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical

Association, 101:1418–1429.

Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of

the Royal Statistical Society Series B, 67:301–320.

Zou, H. and Yuan, M. (2008). The F∞-norm support vector machine. Statistica Sinica, 18:379–398.

Zou, H. and Zhang, H. H. (2009). On the adaptive elastic-net with a diverging number of parame-

ters. The Annals of Statistics, 37:1733–1751.

24

penalized regression models with autoregressive error terms · linear regression models when the...

Documents