penalized regression models with autoregressive error terms · linear regression models when the...
TRANSCRIPT
Penalized Regression Models with Autoregressive
Error Terms
Young Joo Yoon
Department of Applied Statistics, Daejeon University, Daejeon, 300-716, Korea
Cheolwoo Park
Department of Statistics, University of Georgia, Athens, GA 30602, USA
Taewook Lee
Department of Information Statistics, Hankuk University of Foreign Studies, Korea
February 22, 2012
Abstract
Penalized regression methods have recently gained enormous attention in statistics and the
field of machine learning due to their ability of reducing the prediction error and identifying
important variables at the same time. Numerous studies have been conducted for penalized re-
gression, but most of them are limited to the case when the data are independently observed. In
this paper, we study a variable selection problem in penalized regression models with autoregres-
sive error terms. We consider three estimators, adaptive LASSO (Least Absolute Shrinkage and
Selection Operator), bridge, and SCAD (Smoothly Clipped Absolute Deviation), and propose a
computational algorithm that enables us to select a relevant set of variables and also the order of
autoregressive error terms simultaneously. In addition, we provide their asymptotic properties
such as consistency, selection consistency, and asymptotic normality. The performances of the
three estimators are compared one another using simulated and real examples.
Key words: Asymptotic normality, Autoregressive error models, Consistency, Oracle property,
Penalized regression, Variable selection.
1
1 Introduction
The high dimensional nature of many current data sets has brought statisticians’ attention to
variable selection methods over the years. Variable selection focuses on searching for the best set
of relevant variables to include in a model. It is a key way to reduce a large number of variables to
relatively few and produce a sparse model that predicts well.
In linear regression subset selection and penalized methods are two popular variable selection
approaches. Subset selection methods including forward selection, backward elimination or step-
wise procedure are practically useful, but they often show high variability and do not reduce the
prediction error of the full model (Hastie et al., 2009). Penalized regression methods such as ridge,
LASSO (Least Absolute Shrinkage and Selection Operator), SCAD (Smoothly Clipped Absolute
Deviation) and bridge are a continuous process and select variables and estimate coefficients simul-
taneously. As a result, they can be more stable than subset selection.
One of the recent popular penalized methods is the LASSO that uses the L1 penalty (Tibshirani,
1996). The LASSO shrinks the coefficients toward zero, which results in reducing the variance and
identifying important variables. Fan and Li (2001) develop a variable selection method based on
the SCAD penalty function. The estimator possesses the unbiasedness property unlike the LASSO.
It is also shown to have the oracle property; it works as well as if the correct submodel were known.
The adaptive LASSO is developed by Zou (2006) and Zhang and Lu (2007), which permits different
weights for different parameters, and it is also shown to have the oracle property. Bridge regression
(Frank and Friedman, 1993) utilizes the Lγ (γ > 0) penalty, and thus it includes the ridge (Hoerl
and Kennard, 1970) (γ = 2) and the LASSO (γ = 1) as special cases. It is known that bridge
estimators produce sparse models when 0 < γ ≤ 1.
Other recent developments of penalized regression methods are rich. The elastic net (Zou and
Hastie, 2005) combines the L1 and L2 penalties and possesses a grouping effect, i.e. if there is a set
of variables among which the pairwise correlations are high, the elastic net groups the correlated
variables together. Park and Casella (2008) propose the Bayesian LASSO which provides interval
estimates that can guide variable selection. Zou and Yuan (2008) impose the F∞ norm on support
vector machines in the classification context. Zou and Zhang (2009) propose the adaptive elastic
net that combines the strengths of the quadratic regularization and the adaptively weighted LASSO
shrinkage.
2
While these methods are widely used in practice, most studies on their theoretical and empirical
properties have been done under the assumption that observations are independent with each other.
However, if the data are collected sequentially in time, the existing penalized methods may suffer
from the temporal correlation. Wang et al. (2007) consider linear regression with autoregressive
errors (REGAR) model (Tsay, 1984). They employ the modified LASSO-type penalty not only on
regression coefficients but also on autoregression coefficients, which results in selecting relevant co-
variates and some of AR error terms. Hsu et al. (2008) also develop a subset selection method using
the LASSO for vector autoregressive (VAR) processes. Recently, Alquier and Doukhan (2011) study
extension of the LASSO and other L1-penalized methods to the case of dependent observations.
In this paper we also consider penalized regression in the REGAR model. The proposed ap-
proach offers two major contributions that differentiate it from the approach of Wang et al. (2007);
(i) Since the order of the AR error terms is not usually available in advance, we estimate it from the
data, which makes the proposed method readily applicable. Hence, the proposed computational
algorithm allows one to select an important set of variables and also the order of autoregressive
error terms efficiently; (ii) We construct three different penalized regression methods, adaptive
LASSO, SCAD and bridge, in the REGAR model. We study asymptotic properties of the three
methods and compare their finite sample performances in various numerical settings. According to
our numerical study, the bridge method shows the superior performance.
The rest of the paper is organized as follows. Section 2 reviews the adaptive LASSO, SCAD,
and bridge estimators. In Section 3, we introduce a computational algorithm of penalized regression
with AR error terms with the three methods, and study asymptotic properties of the estimators and
their proofs are provided in Section 5. We also discuss how penalized regression can be implemented
in ARCH models. Section 4 presents the comparison of the three estimators using simulated and
real examples.
2 Penalized regression
We consider a linear regression model with p predictors and T0 observations:
Yt = x′tβ + et, t = 1, . . . , T0, (2.1)
3
where β = (β1, · · · , βp)′, xt = (xt1, · · ·xtp)′ and the error term et follows the autoregressive process
et = φ1et−1 + φ2et−2 + · · ·+ φqet−q + εt, (2.2)
where φ = (φ1, . . . , φq)′ is the autoregression coefficients and εt are independent and identically
distributed random variables with mean 0 and variance σ2. More conditions on et will be discussed
in the next section. We assume that Yt’s are centered and the covariates xt’s are standardized,
that is,
T0∑
t=1
Yt = 0,
T0∑
t=1
xtj = 0 and1T0
T0∑
t=1
x2tj = 1, j = 1, . . . , p.
Penalized regression estimates β by minimizing the penalized least squared objective function:
QIT0
(β) =T0∑
t=1
(Yt − x′tβ)2 + T0
p∑
j=1
pλj (|βj |), (2.3)
where pλj (·) is a penalty function and λj ’s are penalty parameters. In this paper, we consider three
forms of pλj(·):
(i) (adaptive LASSO) pλj (|β|) = λj |β|,
(ii) (SCAD) pλj (|β|) = λ
|β| if |β| ≤ λ,
− (β2−2aλ|β|+λ2)2(a−1)λ if λ < |β| ≤ aλ,
12(a + 1)λ if |β| > aλ,
(iii) (bridge) pλj (|β|) = λ|β|γ .
Since Fan and Li (2001) show that the Bayes risks are not sensitive to the choice of a for the SCAD
penalty and a = 3.7 is a good choice for various problems, we also use the same a value in our
numerical examples. Note that the adaptive LASSO imposes different amount of penalties on each
coefficient while the other two utilize the same parameter λ. If λj = λ in the adaptive LASSO, it
leads to the ordinary LASSO.
Zou (2006) shows that if different tuning parameters λj ’s are data-dependent and well cho-
sen, then the oracle property can be obtained in the adaptive LASSO. Zou (2006) suggests λj =
λ|βOLSj |−τ as the penalty parameters and achieves the better result in variable selection than the
LASSO. The adaptive LASSO model can be extended to generalized linear models (Zou, 2006) and
4
Cox proportional hazard models (Zhang and Lu, 2007). Wang et al. (2007) propose the adaptive
LASSO estimators for the REGAR model using λj = λ log(T0)/T0|βOLSj | as penalty parameters.
The adaptive LASSO with τ = 1 is equivalent to this modified method for independent and identical
error models.
The SCAD penalty possesses the following three important properties; (i) unbiasedness for a
large true coefficient to avoid excessive estimation bias; (ii) sparsity (estimating a small coefficient
as zero) to reduce model complexity; and (iii) continuity to avoid unnecessary variation in model
prediction. The SCAD penalty has proved to be successful in many other statistical contexts such
as regression (Fan and Li, 2001), classification (Zhang et al., 2006), Cox model (Fan and Li, 2002),
and varying coefficient models (Wang et al., 2008).
The bridge estimator is a generalization of ridge and LASSO. It does variable selection when
0 < γ ≤ 1, and shrinks the coefficients when γ > 1. Frank and Friedman (1993) introduce bridge
regression, but do not solve for the estimator of bridge regression for any given γ > 0. Fu (1998)
studies the structure of bridge estimators and proposes a general algorithm to solve for γ ≥ 1.
Knight and Fu (2000) show asymptotic properties of bridge estimators with γ > 0 when p is fixed.
Huang et al. (2008) study asymptotic properties of bridge estimators in sparse, high dimensional,
linear regression models when the number of covariates may increase along with the sample size.
Liu et al. (2007) introduce an Lγ support vector machine algorithm which selects γ from the data.
Park and Yoon (2011) also consider an adaptive choice of γ in a linear regression model, and propose
an algorithm that selects grouped variables.
Figure 1 shows the three thresholding functions with λj = λ = 2 for each penalty function.
Here, we take a simple linear regression model with one parameter θ and a single observation
z = θ + ε, where ε is a normal random error with mean 0 and variance σ2. The thresholding
functions depicted in Figure 1 are as follows:
(i) (adaptive LASSO (τ = 1)) θ =
0, |z| ≤ √2,
sgn(z)(|z| − 2|z|), |z| > √
2.
(ii) (SCAD) θ =
sgn(z)(|z| − 2)+, |z| ≤ 4
{(a− 1)z − sgn(z)a · 2}/(a− 2), 4 < |z| ≤ 2a,
z |z| > 2a.
5
(iii) (bridge (γ = 0.5)) θ =
0, |z| ≤ 22/3(
32
),
Solution of sgn(z)(|θ|+ 1√|θ|
) = z and 1− 1
|θ|√|θ|
> 0, |z| > 22/3(
32
).
For the bridge, the computational algorithm for the solution can be found in Knight and Fu (2000).
−10 −5 0 5 10
−10
−5
05
10
z
−10 −5 0 5 10
−10
−5
05
10
z
−10 −5 0 5 10
−10
−5
05
10
z
(a) adaptive LASSO (τ = 1) (b) SCAD (a = 3.7) (c) bridge (γ = 0.5)
Figure 1: Plot of thresholding functions with λj = λ = 2 for each penalty function.
3 Penalized regression in the REGAR model
In Section 3.1, we propose our main algorithm by incorporating the information about the depen-
dence structure of a given time series under the REGAR model (2.1) with the error et in (2.2).
In practice, there are some obstacles to find the estimators because the objective function is not
convex and the AR order q is not available in advance. The proposed algorithm utilizes the local
quadratic approximation to obtain local convexity and applies the BIC for finding the correct q
along with other optimal tuning parameters. In Section 3.2, we present asymptotic properties of
the estimators assuming that the tuning parameters including the AR order q can be properly
identified. We show that the estimators are consistent and enjoy selection consistency as well as
asymptotic normality. In Section 3.3, we briefly discuss extension to penalized regression in ARCH
models.
6
3.1 Computation in the REGAR model
In estimating the coefficients β and φ, the estimator can be obtained by minimizing the following
objective function:
Q∗T (β,φ) =
T0∑
t=q+1
{Yt − x′tβ −
q∑
j=1
φj(Yt−j − x′t−jβ)}2
+ T
p∑
j=1
pλj (|βj |) + T
q∑
j=1
pδj (|φj |), (3.1)
where T = T0 − q. Wang et al. (2007) consider the same model with the modified LASSO penalty
and fixed order q. Their algorithm implicitly suggests that the correct order of AR error terms
could be identified by setting some of their coefficients zero if q is chosen larger than the true order.
Instead, the proposed computational algorithm includes the procedure of explicitly selecting
the order q as well as important sets of variables. For a simple computation, we do not penalize the
autoregression coefficients φ in (3.1) because selecting AR error terms is relatively less important.
Therefore, we minimize the following objective function:
QT (β, φ) =T0∑
t=q+1
{Yt − x′tβ −
q∑
j=1
φj(Yt−j − x′t−jβ)}2
+ T
p∑
j=1
pλj (|βj |). (3.2)
One can always include T∑q
j=1 pδj (|φj |) back to the algorithm if necessary.
In what follows we introduce an algorithm for the three penalized methods (adaptive LASSO,
SCAD and bridge). For the adaptive LASSO, SCAD and bridge with 0 < γ ≤ 1, the minimization
problem in (3.2) is either not differentiable at the origin or concave, so we use the local quadratic
approximation proposed by Fan and Li (2001) to circumvent this issue. Under some mild conditions,
the penalty term can be locally approximated at β(0) = (β(0)1 , · · · , β
(0)p )′ by a quadratic function:
pλj (|βj |) ≈ pλj (|β(0)j |) +
12
p′λj(|β(0)
j |)|β(0)
j |(β2
j − β(0)2j )
=12
p′λj(|β(0)
j |)|β(0)
j |β2
j + pλj (|β(0)j |)− 1
2
p′λj(|β(0)
j |)|β(0)
j |β
(0)2j , (3.3)
where p′ is the derivative of the penalty function p. In (3.3), note that the last two constant terms
are not related to the minimization problem in (3.2). By using the local quadratic approximation
at β(0), the minimization problem near β(0) can be reduced to a quadratic minimization problem:
T0∑
t=q+1
{Yt − xT
t β −q∑
j=1
φj(Yt−j − x′t−jβ)}2
+T
2
p∑
j=1
p′λj(|β(0)
j |)|β(0)
j |β2
j . (3.4)
7
Because there are both regression and autoregression parameters in equation (3.4), we solve it
iteratively by minimizing the following two objective functions:
β(k)
= arg minβ
T0∑
t=q+1
{Yt − x′tβ −
q∑
j=1
φ(k−1)j (Yt−j − x′t−jβ)
}2
+ T
p∑
j=1
p′λj(|β(k−1)
j )
2|β(k−1)j |
β2j
with a fixed φ(k−1)
and
φ(k)
= arg minφ
T0∑
t=q+1
{Yt − x′tβ
(k) −q∑
j=1
φj(Yt−j − x′t−jβ(k)
)}2
with a fixed β(k)
. The algorithm stops when there is little change in β(k)
, for example max{|β(k)1 −
β(k−1)1 |, · · · , |β(k)
p − β(k−1)p |, |φ(k)
1 − φ(k−1)1 |, · · · , |φ(k)
q − φ(k−1)q |} < ε, where ε is a pre-selected positive
value. In our numerical analysis, ε = 10−3 is used. During the iteration if |β(k)j | < 10−5, we delete
the jth variable to make the algorithm stable and also exclude it from the final model.
We use the ordinary least squares estimator with no consideration of autocorrelation structure
as an initial estimator for the regression coefficient vector β:
β(0)
= (X ′X)−1(X ′Y ).
Then we compute the residuals, et = Yt − x′tβ(0)
, and obtain the initial estimator for the autore-
gression coefficient φ as follows:
φ(0)
= (W ′W )−1(W ′V ),
where V = (eq+1, · · · , eT )′ and W is a T × q matrix with t-th row given by (et+q−1, · · · , et).
After obtaining these initial estimators, we select the tuning parameters λ for the adaptive
LASSO (we use λj = λ|βOLS
j | as suggested by Zou (2006)) and SCAD or (λ, γ) for the bridge, and
the autoregressive order q. As in Wang et al. (2007), we apply the BIC:
BIC = log(σ2) + dflog(T )
T,
where
σ2 =1T
T0∑
t=q+1
{Yt − x′tβ −
q∑
j=1
φj(Yt−j − x′t−jβ)}2
,
and df is the number of nonzero coefficients of (β, φ). Therefore, we choose the tuning parameter(s)
and the order q that minimize the BIC.
8
3.2 Asymptotic properties
In this subsection we study asymptotic behaviors of the bridge and SCAD estimators assuming
that the AR order q is properly chosen. See Wang et al. (2007) for the properties of the adaptive
LASSO. Since we do not penalize the autoregression coefficients φ, we focus on the behavior of the
regression coefficients estimator β obtained from (3.2). All proofs are delayed to Section 5.
Suppose that β0 = (β01, . . . , β0p) and φ0 = (φ01, . . . , φ0q) are true coefficient vectors and there
are p0 ≤ p nonzero regression coefficients. We further assume the following conditions.
(i) The sequence {xt} is independent of {εt}.
(ii) All roots of the polynomial 1−∑qj=1 φ0jz
j are outside the unit circle.
(iii) The error εt has a finite fourth-order moment.
(iv) The covariate xt is strictly stationary and ergodic with finite second-order moment. Further-
more, the following matrix is positive definite:
B = E
xt −
q∑
j=1
φ0jxt−j
xt −
q∑
j=1
φ0jxt−j
′ .
Note that λ1 = · · · = λp = λT for the penalty term since we do not consider the adaptive LASSO
in this subsection. Define Σ = diag(B, C) where C = (ξ|i−j|) and ξk = E(etet+k). Let Σβ be the
submatrix of Σ corresponding to β.
Theorem 1. For some λ0 ≥ 0, we assume that λT T 1−γ/2 → λ0 where 0 < γ < 1 for the bridge
and λT
√T → λ0 for the SCAD. Then, under the conditions (i)-(iv),
√T (β − β0)
d→ arg min(V )
where
V (u) = −2u′w + u′Σβu + λ0
p∑
j=1
|uj |γI(βj = 0)
for the bridge with 0 < γ < 1 and
V (u) = −2u′w + u′Σβu + λ0
p∑
j=1
[ujsgn(βj)I(βj 6= 0) + |uj |I(βj = 0)]
for the SCAD. Here, w ∼ N(0, σ2Σβ).
9
Theorem 1 describes the limiting distribution of β as similarly shown in Knight and Fu (2000).
It implies that λT cannot grow faster than T−1/2 for the SCAD. For the bridge with 0 < γ < 1,
one can estimate nonzero regression coefficients at the usual rate without asymptotic bias while
shrinking the estimates of zero regression coefficients to zero with positive probability.
The following lemma shows the√
T -consistency of the two estimators.
Lemma 1. Assume that aT = max{p′λT(|β0j |) : β0j 6= 0} = o(1) as T →∞. If max{p′′λT
(|β0j |) :
β0j 6= 0} → 0, then, under the conditions (i)-(iv), there is a local minimizer (β, φ) of QT (β, φ)
such that
||β − β0|| = Op(T−1/2 + aT ).
The next theorem shows that the bridge and SCAD estimators possess the selection consistency
with probability 1 for insignificant regression coefficients. Let β0 = (β′01, β′02)
′ and assume that
β02 = 0 without loss of generality. Denote their estimators by β1 and β2, respectively.
Theorem 2. Assume that
lim infT→∞
lim infθ→0+
p′λT(θ)/λT > 0, (3.5)
and ||β − β0|| = Op(T−1/2). If λT → 0 and√
TλT →∞ as T →∞, then
P (β2 = 0) = 1.
Finally, the following theorem describes the asymptotic distribution of the bridge and SCAD
estimators. It implies that the estimators can be as efficient as the oracle estimator in an asymptotic
sense.
Theorem 3. Assume that the penalty function pλT(|θ|) satisfies condition (3.5). If λT → 0 and
√TλT →∞ as T →∞, then
√T (β1 − β01)
d→ N(0, σ2Σ−10β ),
where Σ0β is the submatrix of Σβ corresponding to β01.
10
Remark The algorithm in Section 3.1 divides the original objective function (3.2) into two parts to
estimate the coefficients of regression and AR error terms iteratively. By using a similar argument
in Wang et al. (2007), it can be shown that this iterative process leads to the unique local minimizer.
3.3 Extension to ARCH models
In this subsection, we briefly discuss how penalized regression can be developed for ARCH models.
Let {εt} be the sequence of random variables following an ARCH(r) model:
εt = ztσt(θ), σ2t (θ) = c0 +
r∑
i=1
ciε2t−i.
Here, θ = (c0, . . . , cr)′ denotes the ARCH model parameter where its parameter space is given as
(0,∞) × [0,∞)r. Also, zt, t = 1, . . . , T , is a sequence of independent and identically distributed
random variables with zero mean and unit variance, and r represents the order of an ARCH model.
First, we consider the parameter estimation for the pure ARCH models. If r is known in advance,
one may consider the least squares estimator for θ, obtained from the AR(r) representation for ε2t :
ε2t = c0 +r∑
i=1
ciε2t−i + ut (3.6)
where u2t = ε2t − σ2
t (θ) and ut is a martingale difference. However, it is more likely that the ARCH
order r is unknown in practice. In such case, traditional model selection criteria including AIC
and BIC can be employed for selecting r, since ARCH models can be expressed as AR models in
(3.6). It is worth pointing out that these selection criteria may be implemented in the penalized
regression frame such as the adaptive LASSO, bridge, and SCAD considered in this paper.
Second, we consider a linear regression model with AR-ARCH error terms, where {et} in (2.1)
is the sequence of random variables following an AR(q)-ARCH(r) model:
et =q∑
i=1
φiet−i + εt, εt = ztσt(θ), σ2t (θ) = c0 +
r∑
i=1
ciε2t−i.
Note that this is a general extension of the REGAR model. Then, the objective function in (3.2)
can be modified as follows:
QT (β,φ, θ) =T0∑
t=q+1
{Yt − x′tβ −
∑qj=1 φj(Yt−j − x′t−jβ)σt (θ)
}2
+ T
p∑
j=1
pλj (|βj |).
11
Since the conditional variance σ2t (θ) is involved in this objective function, the proposed algorithm
in Section 3.1 may not be appropriate. We suggest development of a new algorithm for the order
selection and parameter estimation for this model and investigation of asymptotic properties as
future work.
4 Numerical Analysis
4.1 Simulation
We present a Monte Carlo simulation study to evaluate the finite sample performance. We compare
the least squares (LS) without penalty, adaptive LASSO, SCAD and bridge under various AR error
models. The simulation settings are as follows.
1. Setting I : We generate the data from the model (2.1) where β = (3, 1.5, 0, 0, 2, 0, 0, 0)′ and et
follows the AR(q) process as in (2.2). The covariates xt = (xt1, · · · , xt8)′ are independently
generated from the multivariate normal distribution with mean 08×1, and the pairwise cor-
relation between xt,j1 and xt,j2 is 0.5|j1−j2|. In the simulation, q = 1, 2, 6 , σ = 1, 3 and two
sample sizes (T0 = 100, 300) are considered. We use φ = 0.9 for q = 1, φ = (0.4, 0.4)′ for
q = 2, and φ = (0.4, 0.2, 0.1, 0.05, 0.025, 0.0125)′ for q = 6. A similar setting is considered by
Wang et al. (2007), but we add a higher order case.
2. Setting II : We generate the data from the model (2.1) where the regression coefficient are
set to be:
β = (0, · · · , 0︸ ︷︷ ︸10
, 2, · · · , 2︸ ︷︷ ︸10
, 0, · · · , 0︸ ︷︷ ︸10
, 2, · · · , 2︸ ︷︷ ︸10
)′
In addition, we assume the same structure as Setting I for the variable et. The covariates
xt = (xt1, · · · , xt40)′ are independently generated from the multivariate normal distribution
with mean 040×1, and the pairwise correlation between xt,j1 and xt,j2 is 0.5|j1−j2|. In this
setting, q = 1, 2 , σ = 1, 3 and two sample sizes (T0 = 100, 300) are considered. We use
φ = 0.9 for q = 1 and φ = (0.4, 0.4)′ for q = 2.
3. Setting III : We consider a seasonal AR model with order q, that is, the error term et follows
12
the autoregressive process
et = φ1et−s + φ2et−2s + · · ·+ φqet−qs + εt.
In this simulation, q = 2 with φ = (0.4, 0.4)′ and s = 4 (quarterly) are considered. The other
setups are the same as those in Setting I.
In the three settings, the ranges of the tuning parameters are λ = 2k−15 where k = 1, 2, . . . , 20
for the three penalized methods, γ = 0.1, 0.4, 0.7, 1 for the bridge, and q = 1, 2, 3, 4 except for the
higher order case in Setting I where we consider q = 1, 2, . . . , 8. We choose the best of λ (for the
adaptive LASSO and SCAD) or the combination of (λ, γ) (for the bridge) and the AR order q
which produce the lowest BIC. For each setting, we repeat 100 times and compare the average of
model error ME = (β − β)T E(xx′)(β − β), the average numbers of correct and incorrect zeros in
β, and the number of correctly estimated autoregressive order.
Table 1: Results of Setting I (q = 1)
Reg. coef. AR order Reg. coef. AR order
Method ME Cor Incor No. cor ME Cor Incor No. cor
T0 = 100 σ = 1 σ = 3
LS 0.053(0.003) 0 0 94 0.449(0.025) 0 0 83
ALASSO 0.026(0.002) 4.73 0 88 0.277(0.023) 4.06 0 86
SCAD 0.026(0.002) 4.45 0 84 0.282(0.024) 3.80 0 66
BRIDGE 0.025(0.002) 4.75 0 92 0.219(0.020) 4.81 0 88
T0 = 300 σ = 1 σ = 3
LS 0.016(0.001) 0 0 95 0.148(0.009) 0 0 94
ALASSO 0.009(0.001) 4.67 0 96 0.097(0.008) 4.33 0 93
SCAD 0.008(0.001) 4.52 0 73 0.092(0.009) 3.99 0 82
BRIDGE 0.008(0.001) 4.87 0 96 0.065(0.007) 4.90 0 95
Tables 1, 2, and 3 report the results of Setting I with q = 1 and 2, and 6, respectively. In terms
of ME, the bridge shows the best performance in all cases while the LS is the worst performer. The
SCAD performs well when the noise level is small. It can be seen that the adaptive LASSO also
13
Table 2: Results of Setting I (q = 2)
Reg. coef. AR order Reg. coef. AR order
Method ME Cor Incor No. cor ME Cor Incor No. cor
T0 = 100 σ = 1 σ = 3
LS 0.071(0.004) 0 0 84 0.611(0.033) 0 0 82
ALASSO 0.034(0.003) 4.21 0 90 0.391(0.034) 3.94 0 85
SCAD 0.030(0.003) 4.36 0 82 0.455(0.037) 3.71 0 78
BRIDGE 0.032(0.003) 4.78 0 90 0.284(0.027) 4.84 0 85
T0 = 300 σ = 1 σ = 3
LS 0.021(0.001) 0 0 92 0.176(0.009) 0 0 99
ALASSO 0.014(0.001) 4.68 0 95 0.119(0.011) 4.29 0 95
SCAD 0.009(0.001) 4.54 0 86 0.120(0.012) 4.24 0 91
BRIDGE 0.010(0.001) 4.92 0 94 0.080(0.008) 4.96 0 98
Table 3: Results of Setting I (q = 6)
Reg. coef. AR order Reg. coef. AR order
Method ME Cor Incor No. cor ME Cor Incor No. cor
T0 = 100 σ = 1 σ = 3
LS 0.103(0.006) 0 0 48 0.929(0.056) 0 0 48
ALASSO 0.060(0.005) 4.01 0 52 0.650(0.054) 3.71 0.01 49
SCAD 0.044(0.005) 4.24 0 56 0.627(0.056) 3.83 0.01 49
BRIDGE 0.050(0.005) 4.66 0 53 0.525(0.054) 4.64 0.04 53
T0 = 300 σ = 1 σ = 3
LS 0.025(0.001) 0 0 94 0.221(0.011) 0 0 94
ALASSO 0.013(0.001) 4.62 0 92 0.145(0.012) 4.36 0 90
SCAD 0.010(0.001) 4.54 0 91 0.149(0.012) 4.45 0 88
BRIDGE 0.010(0.001) 4.91 0 94 0.093(0.008) 4.91 0 94
14
performs competitively. The bridge method correctly selects important variables corresponding to
nonzero coefficients, and makes most of true zero coefficients zero, whose average numbers are closer
to the true value 5 compared to the others. The SCAD and adaptive LASSO select more variables
including noninformative variables than the bridge especially for the higher noise level, so this may
explain the worse performance of the two methods in terms of ME. All the methods select the AR
order correctly in most cases for the lower order cases, but the bridge shows a slightly better result.
For the higher order case q = 6, which is more challenging, all the estimators produce higher ME
values, and lower correct number of zero coefficients and AR orders compared to the lower order
cases particularly when the sample size is small. However, the main findings remain the same.
Table 4: Results of Setting II (q = 1)
Reg. coef. AR order Reg. coef. AR order
Method ME Cor Incor No. cor ME Cor Incor No. cor
T0 = 100 σ = 1 σ = 3
LS 0.860(0.025) 0 0 11 9.067(0.516) 0 0 8
ALASSO 0.307(0.029) 16.08 0 52 4.922(0.406) 13.42 0.06 39
SCAD 0.304(0.024) 17.45 0 47 4.591(0.415) 13.75 0.05 25
BRIDGE 0.289(0.020) 17.58 0 55 4.169(0.297) 15.64 0.06 38
T0 = 300 σ = 1 σ = 3
LS 0.095(0.002) 0 0 78 0.837(0.021) 0 0 83
ALASSO 0.049(0.002) 19.08 0 91 0.461(0.017) 18.48 0 87
SCAD 0.043(0.002) 19.96 0 89 0.483(0.020) 19.21 0 68
BRIDGE 0.045(0.002) 19.74 0 90 0.411(0.014) 19.70 0 91
In Tables 4 and 5 the results of Setting II with q = 1 and 2 are shown. Note that there are more
variables in Setting II than Setting I. In terms of ME, the three penalized methods perform much
better than the LS. The bridge is the best or second best performer in all cases and the SCAD is
comparable to the bridge in most cases especially for the small noises. In terms of variable selection
the bridge again shows the best performance by correctly selecting relevant variables and making
more coefficients zero than the others, which corresponds to true 20 zero coefficients. The selection
15
Table 5: Results of Setting II (q = 2)
Reg. coef. AR order Reg. coef. AR order
Method ME Cor Incor No. cor ME Cor Incor No. cor
T0 = 100 σ = 1 σ = 3
LS 0.937(0.048) 0 0 14 8.729(0.412) 0 0 17
ALASSO 0.359(0.036) 17.38 0 53 4.576(0.367) 15.13 0.05 38
SCAD 0.347(0.038) 18.40 0 54 4.036(0.328) 14.99 0.06 34
BRIDGE 0.388(0.031) 17.98 0 46 3.797(0.311) 17.86 0.09 48
T0 = 300 σ = 1 σ = 3
LS 0.121(0.003) 0 0 91 1.102(0.029) 0 0 92
ALASSO 0.064(0.003) 19.07 0 91 0.650(0.023) 18.26 0 84
SCAD 0.055(0.002) 19.98 0 96 0.806(0.032) 19.38 0 84
BRIDGE 0.060(0.002) 19.65 0 96 0.560(0.021) 19.62 0 94
of the AR order q in this setting becomes more challenging and all the methods do not work well
for T0 = 100. For T0 = 300, the adaptive LASSO and bridge achieve about 90% accuracy.
Table 6 shows the results of Setting III with a seasonal component. It can be seen that the
results are similar to those of Setting I with q = 2; the penalized regression methods perform better
than the LS and the bridge performs the best.
In summary, the three penalized regression methods provide more accurate results compared
to the least squares in terms of prediction, variable selection, and AR order selection. Among
the three, the bridge has a slight edge over the other two because it has a more flexible penalty
form and it is adaptive in the sense that the order of the penalty function, γ, is estimated from
the data. Also, Park and Yoon (2011) conduct a thorough simulation study for the comparison of
several penalized methods in a regression problem and show the superior performance of the bridge
estimators in various settings.
16
Table 6: Results of Setting III
Reg. coef. AR order Reg. coef. AR order
Method ME Cor Incor No. cor ME Cor Incor No. cor
T0 = 100 σ = 1 σ = 3
LS 0.083(0.005) 0 0 73 0.746(0.040) 0 0 73
ALASSO 0.043(0.004) 4.10 0 73 0.484(0.044) 3.88 0.02 74
SCAD 0.038(0.004) 4.35 0 69 0.470(0.042) 3.72 0.01 69
BRIDGE 0.038(0.003) 4.73 0 77 0.399(0.042) 4.70 0.03 76
T0 = 300 σ = 1 σ = 3
LS 0.021(0.001) 0 0 90 0.188(0.001) 0 0 90
ALASSO 0.013(0.001) 4.71 0 91 0.126(0.012) 4.41 0 91
SCAD 0.009(0.001) 4.46 0 80 0.120(0.010) 4.27 0 87
BRIDGE 0.009(0.001) 4.92 0 90 0.083(0.008) 4.92 0 90
4.2 Real data
In this subsection we analyze the data set from Ramanathan (1998), the consumption of electricity
by residential customers served by San Diego Gas and Electric Company. This data set contains
87 quarterly observations from the second quarter of 1972 through fourth quarter of 1993. The
response variable is the electricity consumption as measured by the logarithm of the kwh sales
per residential customer. The independent variables are the per-capita income (LY), the price of
electricity (LPRICE), cooling degree days (CCD) and heating degree days (HDD).
The basic model considered in Ramanathan (1998) is given as:
LKWH = β0 + β1LY + β2LPRICE + β3CDD + β4HDD + et.
The expected signs for the β’s (except β0) are as follows (Ramanathan, 1998):
β1 > 0, β2 < 0, β3 > 0, β4 > 0.
We first apply the OLS (ordinary least squares) method and report the estimated coefficients
in the second column of Table 7. The signs of LPRICE, CCD and HHD are consistent with the
17
expected ones, but LY has the reverse sign. This unexpected result may happen due to ignoring the
autocorrelation structure of the error term, and hence Ramanathan (1998) suggests the REGAR
model. We next consider the REGAR model with autoregressive orders q = 1, 2, 3, 4 and select
one of these orders which minimizes BIC. We again compare the LS, adaptive LASSO, SCAD and
bridge as in Section 4.1.
Table 7: The estimated coefficients and AR orders
Variable OLS LS ALASSO SCAD BRIDGE
LY -0.00127 0.10256 0.04304 - -
LPRICE -0.01807 -0.09824 -0.09453 -0.09775 -0.09772
CCD 0.06374 0.00028 0.00028 0.00028 0.00028
HDD 0.09849 0.00023 0.00023 0.00023 0.00023
AR order - 4 4 4 4
Table 7 further reports the estimated coefficients of the four methods and the selected au-
toregressive order of the four methods. When the LS and adaptive LASSO are used, the sign of
β1 is positive as Ramanathan (1998) suggests. In cases of the SCAD and bridge, β1 is exactly
zero, which suggests that the per-capita income (LY) does not contribute to the model. For the
other variables, all the estimators produce similar coefficient values. The autoregressive order q is
consistently estimated as 4 by the four methods.
We next compare the performances of the four methods using the one-step ahead forecast. We
estimate the regression and autoregressive coefficients using previous 65 observations, and then
forecast the electricity consumptions for the next quarter.
Table 8 summarizes the mean, standard error (s.e.) of the one-step forecast errors, and the
number of forecasts included at 95% confidence intervals of the four methods. While the four
methods perform similarly, the SCAD and bridge produce the slightly lower forecast errors and
have more forecasts included in the 95% confidence intervals.
18
Table 8: The one-step ahead forecast errors and the number of forecasts included in the 95%
confidence intervals
LS ALASSO SCAD BRIDGE
mean 0.00263 0.00259 0.00252 0.00247
s.e 0.00093 0.00089 0.00091 0.00088
No 16/22 15/22 17/22 17/22
5 Proofs
In this section, we provide the proofs for the theorems in Section 3.2. We will follow the similar
steps given in Wang et al. (2007).
Define
LT ((β,φ)) =T0∑
t=q+1
{Yt − x′tβ −
q∑
j=1
φj(Yt−j − x′t−jβ)}2
,
and then
QT (β, φ) = LT (β, φ) + T
p∑
j=1
pλj (|βj |).
Since we focus on β, we omit φ from our notation, that is,
QT (β) = LT (β) + T
p∑
j=1
pλj (|βj |).
5.1 Proof of Theorem 1
Define
VT (u) = QT (β0 + u/√
T )−QT (β0)
= LT (β0 + u/√
T )− LT (β0) + T
p∑
j=1
[pλT(|β0j + uj/
√T |)− pλT
(|β0j |)].
It can be shown from Knight and Fu (2000) that
T
p∑
j=1
[pλT(|β0j + uj/
√T |)− pλT
(|β0j |)] → λ0
p∑
j=1
|uj |γI(βj = 0)
19
for the bridge and
T
p∑
j=1
[pλT(|β0j + uj/
√T |)− pλT
(|β0j |)] → λ0
p∑
j=1
[ujsgn(βj)I(βj 6= 0) + |uj |I(βj = 0)]
for the SCAD. Also, taking a similar approach of Wang et al. (2007),
LT (β0 + u/√
T )− LT (β0)
=∑
t
yt − x′t(β0 + u/
√T )−
q∑
j=1
φ0j{yt−j − x′t−j(β0 + u/√
T )}
2
−∑
t
ε2t
=∑
t
εt − T−1/2u′
xt −
q∑
j=1
φ0jxt−j
2
−∑
t
ε2t
= S1 + S2,
where
S1 = −2T−1/2u′∑
t
εt
xt −
q∑
j=1
φ0jxt−j
,
S2 = T−1u′∑
t
xt −
q∑
j=1
φ0jxt−j
xt −
q∑
j=1
φ0jxt−j
′
u.
Using the martingale central limit theorem and the ergodic theorem, it can be shown that
LT (β0 + u/√
T )− LT (β0)d→ −2u′w + u′Σβu.
To prove that
arg min(VT (u)) d→ arg min(V ),
it is sufficient to show that arg min(VT ) = Op(1) (Kim and Pollard, 1990). Note that for the bridge,
VT (u) ≥∑
t
εt − T−1/2u′
xt −
q∑
j=1
φ0jxt−j
2
− ε2t
− T
p∑
j=1
pλT(|uj/
√T |)
≥∑
t
εt − T−1/2u′
xt −
q∑
j=1
φ0jxt−j
2
− ε2t
− (λ0 + ε0)
p∑
j=1
|uj |γ + op(1)
= VT (u)
where ε0 is a positive constant. Using the same argument in the proof of Theorem 3 in Knight and
Fu (2000), the conclusion follows. One can use a similar argument for the SCAD.
20
5.2 Proof of Lemma 1
Let αT = T−1/2 + aT and {β0 + αT u : ||u|| ≤ C} be the ball around β0. As in Fan and Li (2001),
we will show that for any given ε > 0, there exists a large constant C such that
P
{sup
||u||=CQT (β0 + αT u) < QT (β0)
}≥ 1− ε, (5.1)
which implies that there exists a local maximum in the ball with probability at least 1− ε.
Using pλT(0) = 0, we have
DT (u) ≡ QT (β0 + αT u)−QT (β0)
≥ LT (β0 + αT u)− LT (β0) + T
p0∑
j=1
{pλT(|β0j + αT uj |)− pλT
(|β0j |)}
≥ LT (β0 + αT u)− LT (β0)−p0∑
j=1
[TαT p′λT
(|β0j |)sgn(β0j)uj + Tα2T p′′λT
(|β0j |)u2j{1 + o(1)}]
≥ LT (β0 + αT u)− LT (β0)−√
p0TαT aT ||u||+ Tα2T max{|p′′λT
(|β0j |) : β0j 6= 0}||u||2. (5.2)
Here,
LT (β0 + αT u)− LT (β0) =∑
t
εt − αT u′
xt −
q∑
j=1
φ0jxt−j
2
−∑
t
ε2t
= (I) + (II),
where
(I) = α2T u′
∑t
xt −
q∑
j=1
φ0jxt−j
xt −
q∑
j=1
φ0jxt−j
′
u,
(II) = −2αT u′∑
t
εt
xt −
q∑
j=1
φ0jxt−j
.
From the proof of Lemma 1 in Wang et al. (2007), we have
(I) = Tα2T {u′Σβu + op(1)},
(II) = u′Op(Tα2T ),
and since (I) dominates (II) and the last term in (5.2), it implies (5.1). This shows that there exists
a local minimizer such that ||β − β0|| = Op(αT ).
21
5.3 Proof of Theorem 2
For j = p0 + 1, . . . , p, the local minimizer β satisfies the equation
∂QT (β)∂βj
=∂LT (β)
∂βj− Tp′λT
(|βj |)sgn(βj)
=∂LT (β0)
∂βj+ TΣj(β − β0){1 + op(1)} − Tp′λT
(|βj |)sgn(βj)
= TλT
{Op
(T−1/2/λT
)− λ−1
T p′λT(|βj |)sgn(βj)
}
where Σj denotes the jth row of Σβ. The last equation follows from the fact that the first and second
terms are Op(T 1/2) by the central limit theorem and the condition in Theorem 2, respectively. Since
lim infT→∞ lim infθ→0+ λ−1T p′λT
(θ) > 0 and T−1/2/λT → 0, the sign of the derivative is determined
by that of βj . Hence, βj = 0 in probability for j = p0 + 1, . . . , p.
5.4 Proof of Theorem 3
By Lemma 1 and Theorem 2, P (β2 = 0) → 1. Therefore, the minimizer of QT (β) is the same as
that of QT (β1) with probability tending to 1. This implies that β1 satisfies the equation
∂QT (β)∂β1
∣∣∣∣β=
ˆβ1
= 0.
Note that β1 is a consistent estimator by Lemma 1, and thus we have
0 =1√T
∂LT (β1)∂β1
+√
Tp′λT(|βj |)sgn(βj)
=1√T
∂LT (β01)∂β1
+ Σ0β
√T (β1 − β01) + op(1)
+√
T (p′λT(|β0j |)sgn(β0j) + {p′′λT
(|β0j |) + op(1)}(βj − β0j)).
By Slutsky’s theorem and the central limit theorem, we have
√T (β1 − β01) =
Σ−10β√T
∂LT (β01)∂β1
+ op(1) d→ N(0, σ2Σ−10β ).
References
Alquier, P. and Doukhan, P. (2011). Sparsity considerations for dependent variables. Electronic
Journal of Statistics, 5:750–774.
22
Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle
properties. Journal of the American Statistical Association, 96:1348–1360.
Fan, J. and Li, R. (2002). Variable selection for Cox’s proportional hazards model and frailty
model. The Annals of Statistics, 30:74–99.
Frank, I. E. and Friedman, J. H. (1993). A statistical view of some chemometrics regression tools.
Technometrics, 35:109–148.
Fu, W. J. (1998). Penalized regression: the Bridge versus the Lasso. Journal of Computational and
Graphical Statistics, 7:397–416.
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. Springer-Verlag, New York, second edition.
Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: biased estimation for nonorthogonal
problems. Technometrics, 12:55–67.
Hsu, N.-J., Hung, H.-L., and Chang, Y.-M. (2008). Subset selection for vector autoregressive
processes using LASSO. Computationnal Statistics and Data Analysis, 52:3645–3647.
Huang, J., Horowitz, J. L., and Ma, S. (2008). Asymptotic properties of bridge estimators in sparse
high-dimensional regression models. The Annals of Statistics, 36:587–613.
Kim, J. and Pollard, D. (1990). Cube root asymptotics. The Annals of Statistics, 18:191–219.
Knight, K. and Fu, W. J. (2000). Asymptotics for Lasso-type estimators. The Annals of Statistics,
28:1356–1378.
Liu, Y., Zhang, H., Park, C., and Ahn, J. (2007). Support vector machines with adaptive Lq
penalty. Computational Statistics and Data Analysis, 51:6380–6394.
Park, C. and Yoon, Y. J. (2011). Bridge regression: adaptivity and group selection. Journal of
Statistical Planning and Inference, 141:3506–3519.
Park, T. and Casella, G. (2008). The Bayesian Lasso. Journal of the American Statistical Associ-
ation, 103:681–686.
23
Ramanathan, R. (1998). Introductory Econometrics with Applications. Fort Worth: Dryden: Har-
court Brace College Publishers.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society Series B, 58:267–288.
Tsay, R. S. (1984). Regression models with time series errors. Journal of the American Statistical
Association, 79:118–124.
Wang, H., Li, G., and Tsai, C. (2007). Regression coefficient and autoregressive order shrinkage
and selection via the lasso. Journal of the Royal Statistical Society Series B, 69:63–78.
Wang, L., Li, H., and Huang, J. (2008). Variable selection in nonparametric varying-coeffcient
models for analysis of repeated measurements. Journal of the American Statistical Association,
103:1556–1569.
Zhang, H., Ahn, J., Lin, X., and Park, C. (2006). Gene selection using support vector machines
with nonconvex penalty. Bioinformatics, 22:88–95.
Zhang, H. H. and Lu, W. (2007). Adaptive-LASSO for Cox’s proportional hazard model. Bio-
metrika, 94:691–703.
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical
Association, 101:1418–1429.
Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of
the Royal Statistical Society Series B, 67:301–320.
Zou, H. and Yuan, M. (2008). The F∞-norm support vector machine. Statistica Sinica, 18:379–398.
Zou, H. and Zhang, H. H. (2009). On the adaptive elastic-net with a diverging number of parame-
ters. The Annals of Statistics, 37:1733–1751.
24