estimation for a partial-linear single-index modelanson.ucdavis.edu/~wang/plsi_zhu08.pdf · in this...
TRANSCRIPT
ESTIMATION FOR A PARTIAL-LINEAR
SINGLE-INDEX MODEL
Jane-Ling Wang1, Liugen Xue2, Lixing Zhu3, and Yun Sam Chong4
1University of California at Davis
2Beijing University of Technology, Beijing, China
3Hong Kong Baptist University, Hong Kong, China
4Wecker Associate
In this paper, we study the estimation for a partial-linear single-index model.
A two-stage estimation procedure is proposed to estimate the link function for
the single index and the parameters in the single index, as well as the parame-
ters in the linear component of the model. Asymptotic normality is established
for both parametric components. For the index, a constrained estimating equa-
tion leads to an asymptotically more efficient estimator than existing estimators
in the sense that it is of a smaller limiting variance. The estimator of the non-
parametric link function achieves optimal convergence rates; and the structural
error variance is obtained. In addition, the results facilitate the construction
of confidence regions and hypothesis testing for the unknown parameters. A
simulation study is performed and an application to a real dataset is illustrated.
The extension to multiple indices is briefly sketched.
0Lixing Zhu is the corresponding author. Email: [email protected] Wang’s research was partially supported by NSF grant DMS-0406430. Liugen Xue’s research
was supported by the National Natural Science Foundation of China (10871013), the Natural Science Founda-tion of Beijing (1072004) and Ph. D. Program Foundation of Ministry of Education of China (20070005003).Lixing Zhu’s research was supported by a grant of The Research Grant Council of Hong Kong, Hong Kong,China (HKBU7060/04P and HKBU 2030/07P). The authors thank the Editor, the Associate Editor, andthe two referees for their insightful comments and suggestions which have led to substantial improvementsin the presentation of the manuscript.
0AMS 2000 subject classifications. Primary 62G05; secondary 62G20.0Key words and phrases. Dimension reduction, local linear smoothing, bandwidth, two-stage estimation,
kernel smoother.
1 Introduction
Partial linear models have attracted lots of attention due to their flexibility to combine
traditional linear models with nonparametric regression models. See, e.g. Heckman (1986),
Rice (1986), Chen (1988), Bhattacharya and Zhao (1997), Xia and Hardle (2006), and the
recent comprehensive books by Hardle, Gao, and Liang (2000) and Ruppert, Wand and
Carroll (2003) for additional references. However, the nonparametric components are subject
to the curse of dimensionality and can only accommodate low dimensional covariates X. To
remedy this, a dimension reduction model which assumes that the influence of the covariate
X can be collapsed to a single index, XTβ, through a nonparametric link function g is a
viable option and termed the partial-linear single-index model. Specifically, it takes the form:
Y = ZTθ0 + g(XTβ0) + e, (1.1)
where (X, Z) ∈ Rp × Rq are covariates of the response variable Y , g is an unknown link
function for the single index, and e is the error term with E(e) = 0 and 0 < Var(e) = σ2 < ∞.
For the sake of identifiability, it is often assumed that ‖β0‖ = 1 and the rth component of
β0 is positive, where ‖ · ‖ denotes the Euclidean metric.
This model is quite general, it includes the aforementioned partial-linear model when
the dimension of X is one and also the popular single-index model in the absence of the
linear covariate Z. There is an extensive literature for the single-index model with three
main approaches: projection pursuit regression (PPR) [Friedman and Stuetzle (1981), Hall
(1989), Hardle, Hall and Ichimura (1993)]; the average derivative approach [Stoker (1986),
Doksum and Samarov (1995), and Hristache, Juditsky and Spokoiny (2001)]; and sliced
inverse regression (SIR) and related methods [Li (1991), Cook and Li (2002), Xia, Tong, Li
and Zhu (2002), and Yin and Cook (2002)]. All these approaches rely on the assumption
that the predictors in X are continuous variables, while model (1.1) compensates for this by
allowing discrete or other continuous variables to be linearly associated with the response
variable. To our knowledge, Carroll, Fan, Gijbels and Wand (1997) were the first to explore
model (1.1) and they actually considered a generalized version, where a known link function
is employed in the regression function while model (1.1) assumes an identity link function.
2
However, their approaches may become computationally unstable as observed by Yu and
Ruppert (2002) and confirmed by our simulations in Section 3. The theory of Carroll, Fan,
Gijbels and Wand (1997) also relies on the strong assumption that their estimator for θ0 is
already√
n-consistent. Yu and Ruppert (2002) alleviated both difficulties by employing a
link function g which falls in a finite-dimensional spline space, yielding essentially a flexible
parametric model. Xia and Hardle (2006) used a method that is based on a local polynomial
smoother and a modified version of least squares in Hardle, Hall and Ichimura (1993).
In this paper, we propose a new estimation procedure. Our approach requires no iteration
and works well under the mild condition that a few indices based on X suffice to explain Z.
Namely,
Z = φ(XTβZ) + η, (1.2)
where φ(·) is an unknown function from Rd to Rq, βZ is a p × d matrix with orthonormal
columns, η has mean zero and is independent of X. The dimension d is often much smaller
than the dimension p of X. Such an assumption is not stringent and common in most
dimension reduction approaches in the literature. A theoretical justification is provided in
Li, Wen and Zhu (2008). Model (1.2) implies that a few indices of X suffice to summarize
all the information carried in X to predict Z, which is often the case in reality, such as
for the Boston Housing data in section 4, where a single index was selected for model (1.2)
and Z is a discrete variable. In this data, first analyzed in Harrison and Rubinfeld (1978),
the response variable is the median value of houses in 506 census tracts in the Boston area.
The covariates include: average number of rooms, the proportion of houses built before
1940, eight variables describing the neighborhood, two variables describing the accessibility
to highways and employment centers, and two variables describing air pollution. A key
covariate of interest is a binary variable that specifies whether a house borders the river or
not. Our analysis presented in Section 4 based on the dimension reduction assumptions of
(1.1) and with Z equal to this binary variable in (1.2) demonstrates the advantages of our
model assumption, only one index (d = 1) was needed in model (1.2) for this data.
To avoid the computational complications that we experienced with the procedure in
Carroll et al. (1997), who aim at estimating β0 and θ0 simultaneously, we choose to estimate
3
β0 and θ0 sequentially. The idea is simple: θ0 can be estimated optimally through approaches
developed for partial linear models once we have a√
n estimate of β0 and plug it in (1.1).
However, β0 and θ0 may be correlated, leading to difficulties in identifying β0. This is where
model (1.2) comes in handy, as it allows us to remove the part of Z that is related to X so
that the residual η in (1.2) is independent of X. Again, we need to impose the identifiability
condition that βZ has norm one and a positive first component. The procedure is as follows:
First estimate βZ via any dimension reduction approach, such as SIR or PPR for q = 1,
and the projective resampling method in Li, Wen and Zhu (2008) for q > 1. Once βZ has
been estimated we proceed to estimate φ via a d-dimensional smoother and then obtain the
residual for η. Since η = Z − φ(XTβZ), plugging this into (1.1) we get
Y = ηTθ0 + h(XTβ0, XTβZ) + e,
where h is an unknown function, but now η and X are independent of each other. It is thus
possible to employ a least squares approach to estimate θ0 and the resulting estimate will
be√
n-consistent. We then employ a dimension reduction procedure to Y − ZTθ0 and X to
obtain an estimate for β0 and g. This concludes the first stage , where the resulting estimates
for θ0 and β0 are already√
n consistent but will serve the role as initial estimates for the next
stage, where we update all the estimates but use a more sophisticated approach. Specifically
for θ0 we apply the profile method, also called partial regression in Speckman (1988), to
estimate θ0. Theoretical results in Section 2.2 indicate that the two-stage procedure is fully
efficient, so there is no need for iteration. More importantly, to estimate the index β0, we use
an estimating equation to obtain asymptotic normality, which takes the constraint ‖β0‖ = 1
into account. The estimator based on this new estimating equation performs better in several
ways, summarized as follows.
1. Our estimation procedure directly targets the model parameters θ0, β0, βZ , φ(·) and
g(·) and no iteration is needed.
2. We obtain the asymptotic normality of the estimator of β0 and the optimal convergence
rate of the estimator of g(·), as well as the asymptotic normality of the estimator of
θ0. The most attractive feature of this new method is that the estimator of β0 has
4
smaller limiting variance when compared to three existing approaches in : Hardle et
al. (1993) when the model is reduced to the single-index model, Carroll et al.(1997)
if their link function is the identity function, and Xia and Hardle (2006) when their
model is homoscedastic. This is the first result providing such a small limiting variance
in this area.
3. We also provide the asymptotic normality of the estimator of σ2. It allows us to
consider the construction of confidence regions and hypothesis testing for θ0 and β0.
The rest of the paper is organized as follows. In Section 2, we elaborate on the new
methodology and then present the asymptotic properties for the estimators. Section 3 reports
the results of a simulation study and Section 4 an application to a real data example for
illustration. Section 5 gives the proofs of the main theorems. Some lemmas and their proofs
are relegated to the Appendix.
2 Methodology and Main Results
2.1 Estimating Procedures
The observations are {(Xi, Yi, Zi); 1 ≤ i ≤ n}, a sequence of independent and identically
distributed (i.i.d.) samples from (1.1), i.e.
Yi = ZTi θ0 + g(XT
i β0) + ei, i = 1, . . . , n,
where e1, · · · , en are i.i.d. random errors with E(ei) = 0 and Var(ei) = σ2 > 0, {εi; 1 ≤ i ≤ n}are independent of {(Xi, Zi); 1 ≤ i ≤ n}, Xi = (Xi1, . . . , Xip)
T, Zi = (Zi1, . . . , Ziq)T, β0 ∈ Rp
and θ0 ∈ Rq. For simplicity of presentation, we initially assume that Z can be recovered
from a single-index of X. That is, d = 1 in (1.2). The general case will be explored at the
end of this section in Remarks 2. Below we first outline the steps for each stage and then
elaborate on each of these steps.
Algorithm for Stage One:
5
1. Apply a dimension reduction method for the regression of Zi versus Xi to find an
estimator βZ of βZ ;
2. Smooth the Zi over XTi βZ to get an estimator φ(·) of φ(·), then compute the residuals
ηi = Zi − φ(XTi βZ);
3. Perform a linear regression of Yi versus ηi’s to find an initial estimator θ0 of θ0;
4. Apply a dimension reduction method to the regression of Yi − ZTi θ0 versus Xi to find
an initial estimator β0 of β0
5. Smooth the Yi−ZTi θ0 versus the XT
i β0 to obtain an estimator for g and for its deriva-
tive g′.
Algorithm for Stage Two:
6. Use the initial estimate β0 from Step 4 to update the estimate of θ0 through a profile
approach for the partial linear model by minimizing (2.5).
7. Use the updated estimate θ of θ0 from Step 6 to form the new residual Y −ZTθ, then
update the estimate of β0 by solving the estimating equation (2.10).
8. Use the updated estimates of θ0 and β0 in Steps 6 and 7 to update the estimate of g,
following the procedure as described in Step 5.
This completes the algorithm and, as we show in Section 2.2, the resulting estimators
are already theoretically efficient. However, the practical performance can be improved by
iterating Steps 6 and 7 one or more times. Our experience, through simulation studies not
reported in this paper, reveals limited benefits when iterating more than once.
Next, we elaborate on each of the steps in the above algorithms for the simple case
of a single index (d = 1). For the dimension reduction method in Step 4, one can use
any of several existing methods, such as SIR or one of its variants, PPR, or the minimum
average variance estimator (MAVE) of Xia, Tong, Li and Zhu (2002). These methods are
for univariate responses and hence can also be applied in Step 1 when q = 1. However,
6
when q > 1, a different method is needed in Step 1 for the case of a multivariate response,
and we recommend the dimension reduction method in Li, Wen and Zhu (2008). This and
other results in the literature already demonstrate the√
n-consistency of these dimension
reduction methods.
For the smoothing involved in Step 5, one can choose any one-dimensional smoother. We
employ the local polynomial smoother (Fan and Gijbels, 1996) to obtain estimators of the
link function g and its derivative g′, which will be used in the second stage of the estimation
procedure. Specifically, for a kernel function K(·) on R1 and a bandwidth sequence b = bn,
define Kb(·) = b−1K(·/b). For a fixed β and θ, the local linear smoother aims at minimizing
the weighted sum of squares
n∑
i=1
[Yi − ZTi θ − d0 − d1(X
Ti β − t)]2Kb(X
Ti β − t)
with respect to the parameters dν , ν = 0, 1. Let h = hn and h1 = h1n denote the bandwidths
for estimating g(·) and g′(·), respectively. A simple calculation shows that the local linear
smoother with these specifications can be represented as
g(t; β, θ) =n∑
i=1
Wni(t, β)(Yi − ZTi θ), (2.1)
and
g′(t; β, θ) =n∑
i=1
Wni(t, β)(Yi − ZTi θ), (2.2)
where
Wni(t; β) =Kh(X
Ti β − t)[Sn,2(t; β, h)− (XT
i β − t)Sn,1(t; β, h)]
Sn,0(t; β, h)Sn,2(t; β, h)− S2n,1(t; β, h)]
, (2.3)
Wni(t; β) =Kh1(X
Ti β − t)[(XT
i β − t)Sn,0(t; β, h1)− Sn,1(t; β, h1)]
Sn,0(t; β, h1)Sn,2(t; β, h1)− S2n,1(t; β, h1)]
, (2.4)
and
Sn,l(t; β, h) =1
n
n∑
i=1
(XTi β − t)lKh(X
Ti β − t), l = 0, 1, 2.
The above estimators are for generic fixed values of β and θ. To obtain the estimates
needed in Step 5, one replaces them with the initial values β0 obtained in Step 1 and θ0
obtained in Step 3, respectively. We will show in Theorem 2 that this results in standard
convergence rates for the estimate of g.
7
Likewise, a local linear smoother can be employed in Step 2 for estimating the unknown
function φ in model (1.2). The resulting estimator is defined as
φ(t; βZ) =n∑
i=1
Wni(t; βZ)Zi.
Several possibilities are available for the estimator of θ0 in Step 6, such as the profile
approach (termed “partial regression” in Speckman, 1988) or the partial spline approach
(Heckman, 1986). Here the the partial spline approach is not suitable for correlated X and
Z, so we adopt a profile approach and a local linear smoother. In short, this amounts to
minimizing, over all θ, the sum of squared errors,
n∑
i=1
[Yi − ZTi θ − g(XT
i β0; β0, θ)]2, (2.5)
where g is the estimator in (2.1) of g, obtained by smoothing Yi−ZTi θ versus XT
i β0, and β0
is an initial estimator of β0, which could be the initial estimator β0 in Step 4 or the refined
estimator from Step 7 when an iterated estimator for θ0 is desirable. Because this smoother
is expressed as a function of θ, the estimate derived from (2.5) is a profile estimate. More
details about the derivation and advantages of the profile approach can be found in Speckman
(1988). Specifically, let β0 be the current estimator, Y = (Y1, . . . , Yn)T, Z = (Z1, . . . , Zn)T,
where
Yi = Yi − g1(XTi β0; β0), Zi = Zi − g2(X
Ti β0; β0),
g1(t; β0) =∑n
i=1 Wni(t; β0)Yi, g2(t; β0) =∑n
i=1 Wni(t; β0)Zi,
with g1 and g2 the respective estimators of g1(t) = E(Y |XTβ0 = t) and g2(t) = E(Z|XTβ0 =
t). The resulting partial regression estimator is thus
θ = (ZTZ)−1ZTY. (2.6)
For the estimator of β0 in Step 7, we propose a novel method that takes advantage of the
constraint ‖β0‖ = 1 and hence is more efficient than existing approaches, including the
PPR approach in Hardle et al (1993), the MAVE method in Xia et al. (2002), and the
least squares approaches of Carroll et al (1997) and Xia and Hardle (2006) for the single-
index partial linear model in (1.1). It is worth mentioning that Xia and Hardle (2006)
allow possible heteroscadestic structure in (1.1), and least squares approaches have been
8
standard dimension methods and lead to the same asymptotic variances for estimators of
β0. For instance, in the homoscadestic case, the estimator in Xia and Hardle (2006) has an
asymptotic variance that is identical to that of Hardle et al (1993). Our approach, based
on an estimating equation under the constraint ‖β0‖ = 1, is computationally stable and
asymptotically more efficient, i.e., its asymptotic variance is smaller. The efficiency gain can
be attributed to a re-parametrization, making use of the constraint ‖β0‖ = 1 by transferring
restricted least squares to un-restricted least squares, which makes it possible to search for
the solution of the estimating equation over a restricted region in the Euclidean space Rp−1.
Without loss of generality, we may assume that the true parameter β0 has a positive
component (otherwise, consider −β0), say β0r > 0 for β0 = (β01, . . . , β0p)T and 1 ≤ r ≤ p.
For β = (β1, . . . , βp)T, let β(r) = (β1, . . . , βr−1, βr+1, . . . , βp)
T be a p−1 dimensional parameter
vector after removing the rth component βr in β. Then we may write
β = β(β(r)) = (β1, . . . , βr−1, (1− ‖β(r)‖2)1/2, βr+1, . . . , βp)T. (2.7)
The true parameter β(r)0 must satisfy the constraint ‖β(r)
0 ‖ < 1, and β is infinitely differen-
tiable in a neighborhood of β(r)0 . This “remove-one-component” method for β has also been
applied in Yu and Ruppert (2002).
To obtain the estimator, consider a Jacobian matrix of β with respect to β(r),
Jβ(r) =∂β
∂β(r)= (γ1, . . . , γp)
T, (2.8)
where γs (1 ≤ s ≤ p, s 6= r) is a p − 1 dimensional unit vector with sth component 1, and
γr = −(1 − ‖β(r)‖2)−1/2β(r). To motivate the estimating equation, we start with the least
squares criterion:
D(β) :=n∑
i=1
[Yi − ZTi θ − g(XT
i β; β, θ)]2. (2.9)
From (2.7) and (2.9) we find D(β) = D(β(β(r))) = D(β(r)). Therefore, we may obtain an
estimator of β(r)0 , say β(r), by minimizing D(β(r)), and then obtain an estimator of β0, β, via
a transformation. This means that we transform a restricted least squares problem to an
unrestricted least squares problem by solving the estimation equation:
n∑
i=1
[Yi − ZTi θ − g(XT
i β; β, θ)]g′(XTi β; β, θ)JT
β(r)Xi = 0. (2.10)
9
We define the resulting estimator β of β0 as the final target estimator. Theorem 3 implies
that our estimator for β0 has a smaller limiting variance than the estimators in Xia and
Hardle (2006) and Carroll et al. (1997).
With θ and β, the final estimator g∗ of g in Step 8 can be defined by
g∗(t) := g(t; β, θ) =n∑
i=1
Wni(t; β)(Yi − ZTi θ),
and the estimator σ2 of σ2 by σ2 = 1n
∑ni=1 [Yi − ZT
i θ − g∗(XTi β)]2. Asymptotic results for
the final parameter estimates of θ and β are established in Theorem 1 and Theorem 2, and
results for the link estimate of g follow from Theorem 4.
Remark 1 We consider a homoscedastic model of (1.1) with d = 1 in model (1.2). While
the estimation procedure can be extended easily to heteroscedastic errors, an additional
dimension reduction assumption on the variance function of of η, given X, is needed to avoid
the curse of high dimensional smoother needed in Step 2 to estimate φ. This assumption
requires that this variance function is also a function of a few indices based on X. Moreover,
the extension of asymptotic theory is not straightforward. For instance, the asymptotic
efficiency of the estimator β0 is technically challenging in the heteroscedastic case and its
study is beyond the scope of this paper.
Remark 2 So far, we have assumed that d = 1. This assumption can be extended
without difficulty to the general case where d might be greater than 1. In this case, a
multivariate smoother will be employed for estimating φ(·). The asymptotic results for the
parameter estimates of β and θ remain unchanged, except that the rate of convergence for
the link estimate of φ(·) changes with the dimension of d.
Remark 3 Other dimension reduction approaches, such as MAVE (Xia et al., 2002) and
other variants of SIR, such as SIR2 (Li, 1991) and SAVE (Cook and Wiseberg, 1991), could
be employed in Steps 1 and 4 for the case of q = d = 1 in (1.2), especially when SIR fails
for the case of symmetric design of X. While MAVE is perhaps the most efficient method
of all, the benefits over SIR are limited, as all estimates are updated in Stage 2, and it is
in this step where the major efficiency gains occur. In addition, MAVE is computationally
10
more intensive than SIR and encounters difficulties in estimating βZ , unless the covariate Z
is one-dimensional and the dimension d of βZ is also small. In fact, the√
n-consistency may
not hold when d > 3 in (1.2) as shown in Xia, Tong, Li and Zhu (2002).
Also, SIR2/SAVE was shown in Li and Zhu (2007) to be not√
n-consistent, unless a
bias correction is performed. In contrast, either SIR or pHd (Li, 1992) can be employed to
identify the directions when d > 1 and q = 1, and both lead to√
n-consistency.
Remark 4 When the dimension q of Z is greater than 1, a multivariate extension of
SIR (Li et al., 2003) can be employed conceptually in Step 1 of the algorithm. However,
the number of observations per slice may become sparse, so we recommend an alternative
multivariate approach as in Li, Wen and Zhu (2008) or Zhu, Zhu, Ferre and Wang (2008) in
Step 1.
Remark 5 The single-index assumption in (1.1) can be easily extended to multiple
indices through SIR or its variants, but the estimation of the multivariate link function g
would encounter the curse of high dimensionality. Since no more than three indices will be
needed in many applications, the approach in this paper can indeed be extended in practice
to multiple indices.
2.2 Main results
In this section, the√
n asymptotics for initial estimates of β and θ in Stage 1 are taken for
granted as they follow from existing results, so we do not formally list the needed assumptions
for this to hold but have provided sources after Theorem 1 below. However, the asymptotics
for the initial estimate of g and each of the parametric and nonaprametric estimates in Stage
2 are fully developed in Section 2.2 with detailed assumptions listed for each estimator.
In order to study the asymptotic behavior of the estimators, we list the following condi-
tions:
C1. (i) The distribution of X has a compact support set A.
(ii) The density function of XTβ is positive and satisfies a Lipschitz condition of order
1 for β in a neighborhood of β0. Further, XTβ0 has a positive and bounded density
function f(t) on T , where T = {t = xTβ0 : x ∈ A}.
11
C2. (i) The functions g and g2i have two bounded and continuous derivatives, where g2i is
the ith component of g2(t), 1 ≤ i ≤ q;
(ii) g3j satisfies a Lipschitz condition of order 1, where g3j is the jth component of g3(t),
and g3(t) = E(X|XTβ0 = t), 1 ≤ j ≤ p.
C3. (i) The kernel K is a bounded, continuous and symmetric probability density function,
satisfying ∫ ∞
−∞u2K(u)du 6= 0,
∫ ∞
−∞|u|2K(u)du < ∞;
(ii) K satisfies a Lipschitz condition on R1.
C4. (i) supt E(‖Z‖2|XT1 β0 = t) < ∞;
(ii) E(e) = 0, Var(e) = σ2 < ∞, E(e4) < ∞.
C5. (i) nh2/ log2 n →∞, lim supn→∞
nh5 < ∞;
(ii) nhh31/ log2 n →∞, nh4 → 0, lim sup
n→∞nh5
1 < ∞.
C6. (i) Σ = Cov(Z − E(Z|XTβ0)) is a positive definite matrix;
(ii) V = E[g′(XTβ0)2JT
β(r)0
XXTJβ
(r)0
] is a positive definite matrix, where Jβ
(r)0
is defined
by (2.8).
Remark 6 The Lipschitz condition and the two derivatives in C1 and C2 are standard
smoothness conditions. C3 is the usual assumption for second-order kernels. C1 is used to
bound the density function of XTβ away from zero. This ensures that the denominators
of g(t; β, θ0) and g′(t; β, θ0) are, with high probability, bounded away from 0 for t = xTβ,
x ∈ A and β near β0. C4 is a necessary condition for the asymptotic normality of an
estimator. In C5(i), the range of h for the estimators θ and g is fairly large and contains the
rate n−1/5 of “optimal” bandwidths. However, when analyzing the asymptotic properties
of the estimator β of β0, we have to estimate the derivative g′ of g. As is well known, the
convergence rate of the estimator of g′ is slower than that of the estimator of g if the same
bandwidth is used. This leads to a slower convergence rate for β than√
n, unless we use a
kernel of order 3 or undersmoothing to deal with the bias of the estimator. This motivates
the introduction of another bandwidth h1 in C5(ii) to control the variability of the estimator
of g′, and condition C5(ii) for bandwidths h and h1. Chiou and Muller (1998) also consider
12
the use of two bandwidths to construct the estimator of β in a relevant model. C6 ensures
that the limiting variances for the estimators θ and β exist.
The following theorems state the asymptotic behavior of the estimators proposed in
Section 2.1. We first establish the asymptotic efficiency of θ.
Theorem 1 Suppose that conditions C1, C2(i), C3(i), C4(i), C5(i) and C6(i) hold.
When ‖βZ − βZ‖ = OP (n−1/2) and ‖β0 − β0‖ = OP (n−1/2), we have
√n(θ − θ0)
D−→ N(0, σ2Σ−1).
Remark 7 Carroll et al.(1997) give similar results with β = 1 and p = 1 (The case of
a partially linear model). Theorem 1 generalizes their Theorems 2 and 3.
In Theorem 1, when we start with√
n-consistent estimators for βZ and β0, θ is consistent
for θ0 with the same asymptotic efficiency as an estimator that we would have obtained had
we known β0 and g, and thus the oracle property. Numerous examples of√
n- consistent
estimators already exist in the literature. For instance, Hall (1989) showed that one can
obtain a√
n-consistent estimator for β0 using projection pursuit regression. Under the
linearity condition that is slightly weaker than elliptical symmetry of X, Li (1991), Hsing
and Carroll (1992) and Zhu and Ng (1995) proved that SIR, proposed by Li (1991), leads to
a√
n-consistent estimator of βZ and of β0, the latter when Z is not present in (1.1). Li and
Zhu (2007) further show that, when including a bias-correction and under a condition almost
equivalent to normality of X, sliced average variance estimation (SAVE, Cook and Weisberg
1991) performs similarly. We expect the results for β0 to hold when Z is dependent of X,
provided a good estimator of βZ is available. Under very general regularity conditions and
for q = 1, Xia, Tong, Li, and Zhu (2002) proposed the minimum average variance estimation
(MAVE) and Xia (2006) a refined version of MAVE, and both methods can provide√
n-
consistent estimators for the single-index β0. However, there is no result in the literature
regarding MAVE when the dimension of Z is larger than 1, and the√
n-consistency needs
further study when d is larger than or equal to 3, even for univariate Z. Therefore, for
general theory, SIR may be a good choice for the initial estimators of βZ and β0.
13
Theorem 2 Suppose that conditions C1–C6 hold. If the rth component of β0 is positive,
we have√
n(β − β0)D−→ N(0, σ2J
β(r)0
V−1QV−1JT
β(r)0
),
where Q = E{g′(XTβ0)2JT
β(r)0
[X − E(X|XTβ0)][X − E(X|XTβ0)]TJ
β(r)0}, V and J
β(r)0
are
defined in condition C6.
¿From Hardle et al (1993) and Carroll et al (1997), we can see that the estimator β of β
has an asymptotic variance that corresponds to a generalized inverse σ2Q−1 where
Q1 = E{g′(XT β0)
2[X − E(X|XT β0)
] [X − E(X|XT β0)
]T}
.
Note that there may be infinitely many inverse matrices of Q1, but there is a unique gen-
eralized inverse associated with the Jacobian Jβ
(r)0
. The following theorem shows that the
variance-cavariance matrix in Theorem 2 is smaller than σ2Q−1 , the variance associated with
Jβ
(r)0
, in the sense that σ2Q−1 − σ2J
β(r)0
V−1QV−1JT
β(r)0
is a non-negative definite matrix. We
use the usual notation: for two non-negative matrices A and B, A ≥ B denotes that A−B
is a non-negative definite matrix.
Theorem 3 Under the conditions of Theorem 2, we have
i) there is a generalized inverse of Q1 that is of the form JT
β(r)0
Q−1Jβ
(r)0
;
ii) JT
β(r)0
Q−1Jβ
(r)0≥ J
β(r)0
V−1QV−1JT
β(r)0
.
Remark 8 Theorem 3 shows that our estimator of β0 is asymptotically more efficient
than those of Hardle et al.(1993) and of Carroll et al. (1997). In addition, Carroll et al.(1997)
use an iterated procedure to estimate β0 and θ0, while our estimation procedure does not
require iteration.
¿From Theorem 2, we obtain an asymptotic result regarding the angle between β and
β0, which can be used to study issues of sufficient dimension reduction (SDR). We refer to
Cook (1998, 2007) for more details.
Corollary 1 Suppose that the conditions of Theorem 2 hold. Then
cos(β, β0)− 1 = OP (n−1/2),
14
where cos(β, β0) is the cosine of the angle between β and β0.
The next two theorems provide the convergence rate of the estimator g∗(·) of g(·) and
the asymptotic normality of the estimator of σ2.
Theorem 4 Suppose that the conditions of Theorem 1 hold. If ‖β − β0‖ = OP (n−1/2).
Then
sup(x,β)∈An
|g∗(xTβ)− g(xTβ0)| = OP ((nh/ log n)−1/2),
where An = {(x, β) : (x, β) ∈ A×Rp, ‖β − β0‖ ≤ cn−1/2} for a constant c > 0.
Theorem 5 Suppose that conditions C1–C6 hold and 0 < Var(e21) < ∞. Then
√n(σ2 − σ2)/(Var(e2
1))1/2 D−→ N(0, 1).
Note that n−1ZTZP−→ Σ in Lemma A.5 of the Appendix. By Theorems 1 and 4 , we
obtain
(ZTZ)1/2(θ − θ0)/σD−→ N(0, Iq).
We are now in the position to construct confidence regions for θ0. From Theorem 10.2d
in Arnold (1981) we obtain the following result.
Theorem 6 Under the conditions of Theorem 5, we have
(θ − θ0)T(ZTZ)(θ − θ0)/σ
2 D−→ χ2q,
where χ2q is chi-square distributed with q degrees of freedom. Let χ2
q(1 − α) be the (1 − α)-
quantile of χ2q for 0 < α < 1, an asymptotic confidence region of θ0 is
Rα = {θ : (θ − θ)T(ZTZ)(θ − θ)/σ2 ≤ χ2q(1− α)}.
To construct confidence regions for β0, a plug-in estimator of the limiting variance of β
is needed. We respectively define the following estimators V and Q of V and Q by
V =1
n
n∑
i=1
g′(XTi β; β, θ)2JT
β(r)XiXTi Jβ(r)
15
and
Q =1
n
n∑
i=1
g′(XTi β; β, θ)2JT
β(r) [Xi − g3(XTi β; β)][Xi − g3(X
Ti β; β)]TJβ(r) ,
where g3(t; β) =∑n
i=1 Wni(t; β)Xi is the estimator of g3(t) = E(X|XTβ0 = t) and Jβ(r) is
the estimator of Jβ
(r)0
. It is easy to prove that Jβ(r)
P−→ Jβ
(r)0
, VP−→ V and Q
P−→ Q. Then
for any p× l matrix A of full rank with l < p, Theorems 2 and 5 imply that
(n−1ATJβ(r)V−1QV−1JT
β(r)A)−1/2AT (β − β0)/σD−→ N(0, Il).
We again use Theorem 10.2d in Arnold (1981) to obtain the following limiting distribution.
Theorem 7 Suppose that the conditions of Theorem 5 hold. Then
(β − β0)TA(n−1ATJβ(r)V
−1QV−1JTβ(r)A)−1AT (β − β0)/σ
2 D−→ χ2l .
The asymptotic confidence region of AT β0 is, letting χ2l (1− α) be the (1− α)-quantile of χ2
l
for 0 < α < 1,
Rα = {AT β : (β − β)TA(n−1ATJβ(r)V−1QV−1JT
β(r)A)−1AT (β − β)/σ2 ≤ χ2l (1− α)}.
3 Simulation study
In this section, we examine the performance of the procedures in Section 2, for the
estimation of both β0 and θ0. We report the accuracy of estimators using PPR and SIR as
dimension-reduction methods. The sample size for the simulated data is n = 100 and the
number of simulated samples is 2000 for the parametric components. When SIR is applied,
using 5 or 10 elements per slice generally yields good results. In other words, each slice
contains 10 to 20 points. A quadratic model of the form
Y = (XTβ0 − 0.5)2 + Zθ0 + 0.2e,
was used, where θ0 = 1 is a scalar, β0 = (0.75, 0.5,−0.25,−0.25, 0.25)T, X is a 5-dimensional
vector with independent uniform [0,1] components, and e is a standard normal variable.
The dependency between X and Z was prescribed by defining Z as a binary variable with
16
probability exp(βZX)/(1 + exp(XTβZ)) to be 1 and 0 otherwise. Two extreme cases of
βZ are reported in Table 1 and Table 2, one based n choosing the same value as β0 with
βZ = β0, and the other on βZ = (0.5, 0, 0.5, 0.5,−0.5)T, so that βZ is orthogonal to β0. We
also checked scenarios where βZ and β0 are neither orthogonal nor parallel to each other,
and the results are in agreement with the two extreme cases reported here.
For the smoothing steps, we used a local linear smoother with a Gaussian kernel through-
out. A product Gaussian kernel was used when bivariate smoothing was involved and equal
bandwidths were selected for each kernel to save computing time. A pilot study revealed
that the bandwidth chosen at the first stage to estimate the residual η has little effect on
the accuracy of the final estimates of θ0, so we choose an initial bandwidth of 0.5 to estimate
φ in (1.2), as this value was frequently selected by generalized cross validation (GCV). The
subsequent smoothing steps utilized the GCV method as proposed in Craven and Wahba
(1979). For instance, when estimating g and θ0 in the second stage, the GCV statistic is
given by the formula
GCV(h) =1
n
n∑
i=1
(Yi − ZTi θ − gh(X
Ti β; β, θ)2/(n−1tr(I− Sh))
2, (3.1)
where gh(·) is the estimator of g(·) with a bandwidth h and Sh is the smoothing matrix
corresponding to a bandwidth of h. The GCV bandwidth was selected to minimize (3.1).
We use the optimal bandwidth, hopt, for g and θ. When calculating the estimator β, we
chose the bandwidths,
h = hoptn1/5n−1/3 = hoptn
−2/15 and h1 = hopt, (3.2)
respectively, because this guarantees that the required bandwidth has the correct order of
magnitude for optimal asymptotic performance [see Carroll al.(1997), Stute and Zhu (2005),
and Zhu and Ng (2003)]. Note that choices (3.2) satisfy condition C5(ii). Relevant discussion
on choosing two distinct bandwidths can be found in Chiou and Muller (1998).
In the simulation, PPR and SIR were used to obtain the initial estimators of β0 and
βZ . The notation SIRc means that when we used SIR to estimate βZ , the number of data
points per slice is c. The resulting estimates for θ0 and the one-step iterated estimates are
summarized in Tables 1 and 2, where we report bias, standard deviation (SD), and mean
17
square error (MSE). The case with known β0 is also reported in the last row and serves as
a gold standard. The right columns under “One-step iterated estimate” in Tables 1 and 2
represent the results obtained when iterating the algorithms in Section 2.1 one more time
after obtaining the estimates in the left columns.
Tables 1 and 2 are about here
¿From Tables 1 and 2 we find that the three methods have small mean square errors with
projection pursuit regression outperforming both SIR procedures. This is expected, as the
simulated model structure satisfies the additive assumption of PPR and the estimates of the
β-directions were iteratively updated through estimates of the unknown link functions, φ and
g. In other non-additive situations, SIR might be more reliable than PPR. Iterated estimates
improved the results for all cases and markedly so for the orthogonal case. Compared to
the case when β0 is known, PPR typically attains 80% or more of the efficiency after one
iteration.
For the estimation of β0, we computed the angle (in radians) between β and β0 as a
measure of accuracy. The mean, standard deviation (SD), and mean squared error (MSE)
of the angle between β and β0 are reported in Table 3. Here, PPR leads to by far superior
estimates compared to SIR.
Table 3 is about here
The performance of the nonparametric estimates for g is demonstrated in Figure 1. Again,
GCV was used for bandwidth choice and compared to the estimates based on the optimal
fixed bandwidth. The true function g and the mean of each estimated g-function over the
2000 replicates are plotted. In general, GCV seems to work well for all parametric and
nonparametric components. This is consistent with the results reported in Chen and Shiau
(1994) for the analysis of partially linear models based on generalized cross validation (GCV).
18
Theoretical properties of the current models in regard to GCV will be a topic for further
investigation.
Figure 1 is about here
A final remark is that we tried to compare our procedure with that proposed in Carroll, et
al. (1997), for the quadratic model used in the above simulations with βZ and β0 orthogonal.
However, we were not able to obtain any results for the method in Carroll et al. (1997), as
their procedure seems to be very sensitive to the choice of the initial estimates. We then used
our estimates for β0 and θ0 as the initial values for their procedure. Nevertheless, we were still
unable to obtain any meaningful comparison results as out of the seven attempted trials their
procedure crashed six times on the first simulation and once on the second simulation. Since
θ0 is only a scalar, we postulate that their procedure has difficulties with high dimensional
β0, which is here a five-dimensional vector.
4 Data Example
We analyze the Boston Housing data mentioned in Section 1. The goal is to determine the
effect of the various variables on housing price, including a binary variable, which describes
whether the census tract borders the Charles River. According to Harrison and Rubinfeld
(1978), bordering the river should have a positive effect on the median housing price of the
census tract. They used a linear model that included a log transformation for the response
variable and three of the covariates, and power transformations for three other covariates.
Their final model is
log(MV ) = a1 + a2RM2 + a3AGE + a4 log(DIS) + a5 log(RAD) + a6TAX
+ a7PTRATIO + a8(B − 0.63)2 + a9 log(LSTAT ) + a10CRIM
+ a11ZN + a12INDUS + a13CHAS + a14NOXp + e.
19
The coefficient a13 is estimated to be 0.088, which is significant with a p-value of less than
0.01 for the hypothesis H0 : a13 = 0 versus H1 : a13 6= 0. The coefficient of determination
R2 attained by their analysis is 0.81, where R2 is the squared correlation between the true
dimension-reduction variable XTβ0 and the estimated dimension-reduction variable XTβ0.
This data set was also analyzed by Chen and Li (1998), who used sliced inverse regression
with all thirteen covariates. After examining the initial results, Chen and Li (1998) trimmed
the data and then dropped some of the variables. We fit the data on the first SIR direction
of the initial analysis reported in their article and obtained an R2 of 0.705 using GCV
bandwidth 0.43. Note that the assumptions of sliced inverse regression are probably not met
because some of the covariates are discrete. We thus proposed to use a partial-linear single-
index model. Several choices of Z were attempted, but they did not yield better results, in
terms of R2, than the one using only the Charles River variable as Z and the other covariates
as X. We thus focus on this model, where a log transformation was applied on Y .
To select the number of observations per slice in the dimension reduction step of SIR, we
borrow our experience in the simulation presented in Section 3, where 5 or 10 observations
per slice worked well for a total sample size of 100, leading to about 20 to 10 slices. Since
the sample size for the housing data is much larger, we use SIR with 20 data points per slice
and this leads to a total of 26 slices. As Chen and Li (1998) pointed out, SIR is not sensitive
to the choice of slice number, and they tried slicing with 10 or 30 points per slice leading to
17 or 50 slices, and obtain very similar results. The GCV bandwidth for estimating g and
θ is 0.367, which is smaller than the bandwidth 0.43 chosen by the GCV method for the
SIR approach of Chen and Li (1998). To estimate β by (2.10), the bandwidths selected by
(3.2) for h = 0.16 and for h1 is 0.367. The R2 is 0.8047, which is essentially equal to that
obtained by Harrison and Rubinfeld and higher than that using SIR on all thirteen variables.
The value of the test statistic for H0 : θ = 0 versus H1 : θ 6= 0 is 3.389 when the degrees
of freedom are calculated according to Hastie and Tibshirani, and 3.419 when n degrees of
freedom are used. Either way the result is significant with p-value < 0.01.
We also omitted the Charles River variable and used a dimension-reduction model on Y
and X. After obtaining an estimate for β0, we then estimate the relationship between Y
20
and XTβ. GCV yields a bandwidth of 0.16, and we obtain R2 = 0.8021. Even though the
Charles River variable is significant, its inclusion leads to only a minor increase in R2.
Figure 2 is about here
Figure 2 shows the estimated g along with the data. On the x-axis of the above graph,
the estimated value xTβ is given, and on the y-axis, the estimated value g∗(t). Figure 2
shows a downward trend in the effective dimension reduction (EDR) variate obtained. The
upward curvature of the function at high values of the EDR variate may or may not be a
real effect.
The advantage of our procedure over the one used by Harrison and Rubinfeld is that
Harrison and Rubinfeld have to make choices regarding transformations for every variable
in the model. We only need to choose the bandwidth or bandwidths used for smoothing.
5 Proofs of Theorems
Since the proofs of the theorems are rather long, the proofs of Theorems 1–4 are presented
in this section, and more details of the proofs are divided into Lemmas A.2–A.7 in the
Appendix.
In this section and the Appendix, we use c > 0 to represent any constant which may take
different values for each appearance, and a ∧ b = min(a, b).
Proof of Theorem 1. Denote
G = (g(XT1 β0)− g(XT
1 β0; β0, θ0), . . . , g(XTn β0)− g(XT
n β0; β0, θ0))T.
¿From (2.6) we have
√n(θ − θ0) =
√n(ZTZ)−1ZTG +
√n(ZTZ)−1ZTe.
21
Lemma A.5 in the Appendix implies
n(ZTZ)−1 P−→ Σ−1. (5.1)
Therefore, Lemma A.6 in the Appendix leads to
√n(ZTZ)−1ZTG
P−→ 0.
It remains to show that√
n(ZTZ)−1ZTeD−→ N(0, σ2Σ−1). (5.2)
Since
ZTe =n∑
i=1
[Zi − g2(XTi β0)]ei +
n∑
i=1
[g2(XTi β0)− g2(X
Ti β0; β0)]ei
=: M1 + M2.
The central limit theorem implies n−1/2M1D−→ N(0,Σ). Similarly to the proof of (A.17),
it is easy to obtain that n−1/2M2P−→ 0. This together with (5.1) and Slutsky’s Theorem
proves (5.2), and hence Theorem 1.
Proof of Theorem 2. The proof is divided into two steps: From (2.9), step (I) provides
the existence of the least squares estimator β of β0, and from (3.1), step (II) proves the
asymptotic normality of β.
(I) Proof of existence. We prove the following fact: Under conditions C1–C5 and
with probability one there exists an estimator of β0 minimizing expression (2.9) in B1n,
where B1n = {β : ‖β − β0‖ = B1n−1/2} for some constant such that 0 < B1 < ∞.
In fact, let Y = (Y1, . . . , Yn)T and Z = (Z1, . . . , Zn)T. We have
D(β) = (Y − Zθ)T(I− Sβ)T(I− Sβ)(Y − Zθ)
= (Y − Zθ0)T(I− Sβ)T(I− Sβ)(Y − Zθ0)
− 2(Y − Zθ0)T(I− Sβ)T(I− Sβ)Z(θ − θ0)
+ {Z(θ − θ0)}T(I− Sβ)T(I− Sβ)Z(θ − θ0)
=: D1(β)−D2(β) + D3(β).
22
The same arguments as in the proof of Theorem 1 can be used to obtain that D2(β) =
R0 + oP (1) and D3(β) = oP (1), where R0 is a constant independent of β. This implies
D(β) = D1(β)−R0+oP (1). Thus, minimizing D(β) simultaneously with respect to β is very
much like separately minimizing D1(β) with respect to β. It follows from (2.7) that we only
need to prove the existence of an estimator of β(r)0 in B2n, where B2n = {β(r) : ‖β(r)−β
(r)0 ‖ =
B2n−1/2} for some constant such that 0 < B2 < ∞. Since R(β(r)) = (−1
2)∂D1(β)
∂β(r) , where
R(β(r)) is defined in (A.19) of Lemma A.7. For an arbitrary β(r) ∈ B2n with the value of
constant B2 in B2n to be determined, we have from Lemma A.7 below that
(β(r) − β(r)0 )TR(β(r))
= (β(r) − β(r)0 )TU(β
(r)0 )− n(β(r) − β
(r)0 )TV(β(r) − β
(r)0 ) + oP (1). (5.3)
The following arguments are similar to those used by Weisberg and Welsh (1994), which
in turn use (6.3.4) of Ortega and Rheinboldt (1973). We note that term (5.3) is dominated
by the term ∼ B22 because
√n‖β(r) − β
(r)0 ‖ = B2, whereas |(β(r) − β
(r)0 )TU(β
(r)0 )| = B2OP (1)
and n(β(r)−β(r)0 )TV(β(r)−β
(r)0 ) ∼ B2
2 . So, for any given η > 0, if B2 is chosen large enough,
then it will follows that (β(r) − β(r)0 )TR(β
(r)0 ) < 0 on an event with probability 1− η. From
the arbitrariness of η, we can prove the existence of the least squares estimator of β(r)0 in B2n
as in the proof of Theorem 5.1 of Welsh (1989). The details are omitted.
(II) Proof of asymptotic normality. From step (I) we find that β(r) is a solution in
B2n to the equation R(β(r)) = 0. That is, R(β(r)) = 0. By Lemma A.7, we have
0 = U(β(r)0 )− nV(β(r) − β
(r)0 ) + oP (
√n ),
and hence√
n(β(r) − β(r)0 ) = V−1n−1/2U(β0) + oP (1).
We now consider the estimator β. A simple calculation yields
2√
1− ‖β(r)0 ‖2
√1− ‖β(r)‖2 +
√1− ‖β(r)
0 ‖2− 1 =
√1− ‖β(r)
0 ‖2 −√
1− ‖β(r)‖2
√1− ‖β(r)‖2 +
√1− ‖β(r)
0 ‖2= OP (n−1/2),
23
and hence
√1− ‖β(r)‖2 −
√1− ‖β(r)
0 ‖2
= − (β(r) + β(r)0 )T (β(r) − β
(r)0 )√
1− ‖β(r)‖2 +√
1− ‖β(r)0 ‖2
= −2β(r)T0 (β(r) − β
(r)0 ) + ‖β(r) − β
(r)0 ‖2
√1− ‖β(r)‖2 +
√1− ‖β(r)
0 ‖2
= −β(r)T0 (β(r) − β
(r)0 )√
1− ‖β(r)0 ‖2
+ OP (n−1).
It follows from (2.7) and the above equation, that
β − β0 =
β1
...
βr−1√1− ‖β(r)‖2
βr+1
...
βp
−
β01
...
β0(r−1)√1− ‖β(r)
0 ‖2
β0(r+1)
...
β0p
=
β1 − β01
...
βr−1 − β0(r−1)
−β(r)T0 (β(r)−β
(r)0 )√
1−‖β(r)0 ‖2
βr+1 − β0(r+1)
...
βp − β0p
+ OP (n−1).
That is, from the definition of Jβ
(r)0
of (2.8)
β − β0 = Jβ
(r)0
(β(r) − β(r)0 ) + OP (n−1).
Thus, we have√
n(β − β0) = Jβ
(r)0
V−1n−1/2U(β(r)0 ) + oP (1).
Theorem 2 follows from this, Central Limit Theorem and Slutsky’s Theorem.
Proof of Theorem 3. Recalling the definition of Q, we can see that Q = JT
β(r)0
Q1Jβ(r)0
.
Define
Π0 := Jβ
(r)0
Q−1JT
β(r)0
, Π1 := Jβ
(r)0
V−1QV−1JT
β(r)0
.
We now prove that Π0 is a generalized inverse of Q1. To this end, we need to prove that
Π0Q1Π0 = Π0 and Q1Π0Q1 = Q1. Note that
Π0Q1Π0 = Jβ
(r)0
Q−1JT
β(r)0
Q1Jβ(r)0
Q−1JT
β(r)0
= Jβ
(r)0
Q−1QQ−1JT
β(r)0
= Π0.
24
We now prove Q1Π0Q1 = Q1. First, by QR decomposition (see, e.g. Gentle 1998, Section
3.2.2, pages 95-97 for more details) for β0, we can find its orthogonal complement such that
B = (b1, β0) is an orthogonal matrix, and β0 = B
0
1
. Thus, J
β(r)0
= BBTJβ
(r)0
=: BR
where R =
R1
R2
with R1 being a (p− 1)× (p− 1) nonsigular matrix. Further, note that
Q = JT
β(r)0
Q1Jβ(r)0
= RTBTQ1BR
= RT
bT1 Q1b1 0
0 0
R
= RT1 bT
1 Q1b1R1.
To prove the result, we rewrite Q1 in another form. Define S = B
R1 0
0 1
. S is a
nonsingular matrix. Then
Q1 = (ST )−1STQ1SS−1 = (ST )−1
RT1 0
0 1
BTQ1B
R1 0
0 1
S−1
= (ST )−1STQ1SS−1 = (ST )−1
RT1 0
0 1
bT1 Q1b1 0
0 0
R1 0
0 1
S−1
= (ST )−1
Q 0
0 0
S−1.
We now prove that Q1Π0Q1 = Q1 that is of the above form. From the above and noting
25
that S−1 =
R−11 0
0 1
BT and (ST )−1 = B
(RT1 )−1 0
0 1
, we have
Q1Π0Q1
= (ST )−1
Q 0
0 0
S−1BRQ−1RTBT (ST )−1
Q 0
0 0
S−1
= (ST )−1
Q 0
0 0
R−11 0
0 1
BTBRQ−1RTBTB
(RT1 )−1 0
0 1
Q 0
0 0
S−1
= (ST )−1
Q 0
0 0
1
R2
Q−1
(1 R2
)
Q 0
0 0
S−1
= (ST )−1
Q
0
Q−1
(Q 0
)S−1 = (ST )−1
Q 0
0 0
S−1 = Q1.
Thus, Π0 is one of the solutions of Q−1 . To prove that the asymptotic variance-covariance
matrix σ2Π1 of our estimator is smaller than the corresponding matrix σ2Π0 given in Hardle
et al. (1993), we only need to show that Π0 −Π1 is a positive semi-definite matrix, that
is, Π0 > Π1. Recall that V = JT
β(r)0
E{g′(XT β0)
2XXT}J
β(r)0
. Note that both Q and
V are positive definite matrices and obviously V ≥ Q. Thus, Q−1 ≥ V−1, and then
V−1 ≥ V−1QV−1. From these two inequalities, it is easy to see that
Π0 ≥ Jβ
(r)0
V−1JT
β(r)0
≥ Jβ
(r)0
V−1QV−1JT
β(r)0
= Π1.
The proof is now complete.
Proof of Corollary 1. Let • denote the inner product of two vectors. Theorem 2
implies ‖β − β0‖ = OP (n−1/2) and
| cos(β, β0)− 1| = |(β − β0) • β0/‖β‖+ (‖β0‖ − ‖β‖)/‖β‖ |≤ 3‖β − β0‖/‖β‖ = OP (n−1/2).
This completes the proof of Corollary 1.
Proof of Theorem 4. Denote θ0 = (θ01, . . . , θ0q)T, θ = (θ1, . . . , θq)
T. Theorem 1 and
26
Lemma A.4 in the Appendix yield
sup(x,β)∈An
|g∗(xTβ)− g(xTβ0)|
≤q∑
s=1
sup(x,β)∈An
|g2s(xTβ; β)− g2s(x
Tβ0)||θs − θ0s|
+ sup(x,β)∈An
|g(xTβ; β, θ0)− g(xTβ0)|
+q∑
s=1
supx∈A
|g2s(xTβ0)||θs − θ0s| = OP ((nh/ log n)−1/2),
and hence Theorem 4 follows.
Proof of Theorem 5. Decomposing σ2 into several parts, we have
σ2 =1
n
n∑
i=1
e2i +
1
n
n∑
i=1
[ZTi (θ0 − θ) + g(XT
i β0)− g(XTi β; β, θ0)]
2
+2
n
n∑
i=1
eiZTi (θ0 − θ) +
2
n
n∑
i=1
ei[g(XTi β0)− g(XT
i β; β, θ0)]
=: I1 + I2 + I3 + I4.
Note that√
n‖θ − θ0‖ = OP (1) and using (A.8) of Lemma A.4, we have
√n|I2| ≤ 1
n
n∑
i=1
‖Zi‖2√
n‖θ − θ0‖2
+√
n sup(x,β)∈An
|g(xTβ0)− g(xTβ; β, θ0)|2
= OP (n−1/2) + OP ((nh2/ log2 n)−1/2)P−→ 0.
Since Eei = 0, we obtain√
nI3P−→ 0. Similarly to the proof of (A.17) in the Appendix, we
also have√
nI4P−→ 0. This proves that σ2 = 1
n
∑ni=1 e2
i + oP (n−1/2). Therefore, we have
√n(σ2 − σ2) =
1√n
n∑
i=1
(e2i − σ2) + oP (1).
The proof can now be completed by employing the central limit theorem.
APPENDIX
The following Lemmas A.1–A.7 are needed to prove Theorems 1, 2, 4, 5. Lemma A.1
gives an important probability inequality and Lemmas A.2 and A.3 provide bounds for the
27
moments of the relevant estimators. They are used to obtain the rates of convergence for the
estimators of the nonparametric component, and are used in the proof of Lemmas A.4–A.7.
Lemma A.4 presents the uniform rates of convergence in probability for the estimators g,
g2s and g′. These results are very useful for the nonparametric estimations. The proof of
Lemma A.5–A.7, as well as Theorem 3 and 4, rely on Lemma A.4. To simplify the proof
of Theorem 1, we divide the main steps of the proofs into Lemmas A.5 and A.6. Lemma
A.5 is used to obtain the limiting variance of the estimator θ, and Lemma A.6 together with
Lemma A.5 shows that the rate of convergence of the nonlinear section of ei for θ − θ0 is
oP (n−1/2). Lemma A.7 provides the main step for the proof of Theorem 2.
Lemma A.1 Let ξ1(x, β), . . . , ξn(x, β) be a sequence of random variables. Denote
fx,β(Vi) = ξi(x, β) for i = 1, . . . , n, where V1, . . . , Vn be a sequence of random variables,
and fx,β is a function on An, where An = {(x, β) : (x, β) ∈ A× Rp, ‖β − β0‖ ≤ cn−1/2} for
a constant c > 0. Assume that fx,β satisfies
1
n
n∑
i=1
|fx,β(Vi)− fx∗,β∗(Vi)| ≤ cna[‖β − β∗‖+ ‖x− x∗‖] (A.1)
for some constants x∗, β∗, a > 0 and c > 0. Let εn > 0 depend only on n. If
P
{∣∣∣∣1
n
n∑
i=1
ξi(x, β)∣∣∣∣ >
1
2εn
}≤ 1
2, (A.2)
for (x, β) ∈ An, then we have
P
{sup
(x,β)∈An
∣∣∣∣1
n
n∑
i=1
ξi(x, β)∣∣∣∣ >
1
2εn
}
≤ c1n2paε−2p
n E
{sup
(x,β)∈An
2 exp( −n2ε2
n/128∑ni=1 ξ2
i (x, β)
)∧ 1
}, (A.3)
where c1 > 0 is a constant.
Proof. Let {ξ′1(x, β), . . . , ξ′n(x, β)} be an independent version of {ξ1(x, β), . . . , ξn(x, β)}.Now generate independent sign random variables σ1, . . . , σn for which P{σi = 1} = P{σi =
−1} =1
2, and {σi, 1 ≤ i ≤ n} independent of {ξi(x, β), ξ′i(x, β), 1 ≤ i ≤ n}. By symmetry,
σi(ξi − ξ′i) has the same distribution as (ξi − ξ′i). The symmetrization Lemma in Pollard
28
(1984) implies
P
{sup
(x,β)∈An
∣∣∣∣1
n
n∑
i=1
ξi(x, β)∣∣∣∣ > εn
}
≤ 2P
{sup
(x,β)∈An
∣∣∣∣1
n
n∑
i=1
[ξi(x, β)− ξ′i(x, β)]∣∣∣∣ >
1
2εn
}
= 2P
{sup
(x,β)∈An
∣∣∣∣1
n
n∑
i=1
σi[ξi(x, β)− ξ′i(x, β)]∣∣∣∣ >
1
2εn
}
≤ 4P
{sup
(x,β)∈An
∣∣∣∣1
n
n∑
i=1
σiξi(x, β)∣∣∣∣ >
1
4εn
}. (A.4)
Let Pn be the empirical measure that puts equal mass1
nat each of the n observations
V1, . . . , Vn. Let F = {fx,β(·) : ‖x‖ ≤ C, ‖β‖ ≤ B} be a class of functions indexed by x
and β consisting of fx,β(Vi) = ξi(x, β). Denote V = (V1, . . . , Vn). Given V , choose function
f ◦1 , . . . , f ◦m, each in F , such that
minj∈{1,...,m}
1
n
n∑
i=1
|fx,β(Vi)− f ◦j (Vi)| < εn (A.5)
for each fx,β in F . Let N(εn, Pn,F) be the minimum m for all sets that satisfies (A.5).
Denote f ∗x,β for the f ◦j at which the minimum is achieved, we then have
P
{sup
(x,β)∈An
∣∣∣∣1
n
n∑
i=1
σiξi(x, β)∣∣∣∣ >
1
4εn
∣∣∣∣V}
= P
{sup
(x,β)∈An
∣∣∣∣1
n
n∑
i=1
σifx,β(Vi)∣∣∣∣ >
1
4εn|V
}
≤ P
{sup
(x,β)∈An
∣∣∣∣1
n
n∑
i=1
σif∗x,β(Vi)
∣∣∣∣ >1
8εn|V
}
≤ N(εn, Pn,F) maxj∈{1,...,N}
P
{∣∣∣∣1
n
n∑
i=1
σif◦j (Vi)
∣∣∣∣ >1
8εn|V
}. (A.6)
Now we need to determine the order of N(εn, Pn,F). For each set satisfying (A.5), each f ◦j
has a pair (xj, βj) such that f ◦j (v) = fxj ,βj(v). Then for all (x, β) ∈ An, we have from (A.1)
that1
n
n∑
i=1
|fx,β(Vi)− fxj ,βj(Vi)| ≤ cna(‖β − βj‖+ ‖x− xj‖).
Next, we want to bound the right-hand side of the above formula by εn. Thus for each
(x, β) ∈ An, we need a pair (xj, βj) within radius rn = O(n−aεn) of (x, β). Therefore, the
29
number N needed to satisfy (A.5) is bounded by r−pn r−p
n = cn2paε−2pn , i.e.
N(εn, Pn,F) ≤ cn2paε−2pn . (A.7)
Now conditioning on V , σif◦j (Vi) is bounded. Hoeffding’s inequality [(Hoeffding (1963)] yields
P
{∣∣∣∣1
n
n∑
i=1
σif◦j (Vi)
∣∣∣∣ >1
8εn|V
}≤ 2 exp
( −2n(εn/8)2
∑ni=1 4f 2
xj ,βj(Vi)
)∧ 1.
This together with (A.4), (A.6) and (A.7) proves (A.3).
Lemma A.2 Suppose that conditions C1, C2 and C3(i) hold. If h = cn−a for any
0 < a < 1/2 and some constants c > 0, then, for i = 1, · · · , n, we have
E
g(XT
i β0)−n∑
j=1
Wnj(XTi β0; β0)g(XT
j β0)
2
= O(h4),
E
g(xTβ)−
n∑
j=1
Wnj(xTβ; β)g(XT
j β)
2
= O(h4),
E
g′(XT
i β0)−n∑
j=1
Wnj(XTi β0; β0)g(XT
j β0)
2
= O(h21)
and
E
n∑
j=1
Wni(XTj β0; β0)ϕ(XT
j β0)− ϕ(XTi β0)
2
= O(√
h ),
where ϕ(t) = g′(t)g3s(t) and g3s is the sth component of g3(t) = E(X|XTβ0 = t).
Proof. See Lemma 1 of Zhu and Xue (2006).
Lemma A.3 Under the assumptions of Lemma A.2, we have
E{W 2ni(X
Ti β0; β0)} = O((nh)−2),
E
n∑
j=1,j 6=i
W 2nj(X
Ti β0; β0)
= O((nh)−1),
E
n∑
j=1
W 2nj(x
Tβ; β)
= O((nh)−1)
and
E{W 2ni(X
Ti β0; β0)} = O((nh1)
−2 + (n3h51)−1),
E
n∑
j=1,j 6=i
W 2nj(X
Ti β0; β0)
= O((nh3
1)−1).
30
Proof. See Lemma 2 of Zhu and Xue (2006).
Lemma A.4 Suppose that conditions C1–C4 and C5(i) hold. We then have
sup(x,β)∈An
|g(xTβ0)− g(xTβ; β, θ0)| = OP ((nh/ log n)−1/2) (A.8)
and
sup(x,β)∈An
|g2s(xTβ0)− g2s(x
Tβ; β)| = OP ((nh/ log n)−1/2). (A.9)
If in addition, C5(ii) also holds, then we have
sup(x,β)∈An
|g′(xTβ0)− g′(xTβ; β, θ0)| = OP ((nh31/ log n)−1/2), (A.10)
where An = {(x, β) : (x, β) ∈ A×Rp, ‖β − β0‖ ≤ cn−1/2} for a constant c > 0.
Proof. We only prove (A.8), the proofs for (A.9) and (A.10) are similar. Write g(Xi, ei) =
g(xTβ0)− g(XTi β0)− ei, i = 1, . . . , n. We have
g(xTβ0)− g(xTβ; β, θ0) =n∑
i=1
Wni(xTβ; β)g(Xi, ei). (A.11)
Let ξi(x, β) = n(nh/ log n)1/2Wni(xTβ; β)g(Xi, ei), fx,β(Vi) = ξi(x, β), Vi = (Xi, ei), i =
1, . . . , n. Using lemma A.1, we have to verify (A.1) and (A.2). A simple calculation yields
(A.1), so we now verify (A.2). By lemmas A.2 and A.3, and noting that sup(x,β)∈An|g(xTβ)−
g(xTβ0)| = O(n−1/2), we have
E[g(xTβ0)− g(xTβ; β, θ0)]2 = E
[n∑
i=1
Wni(xTβ; β)g(Xi, ei)
]2
≤ cE
[g(xTβ)−
n∑
i=1
Wni(xTβ; β)g(XT
i β)
]2
+ cE
{n∑
i=1
W 2ni(x
Tβ; β)
}+ O(n−1)
≤ ch4 + c(nh)−1. (A.12)
Given a M > 0, by Chevbychev’s inequality and (A.12), we have
P
{∣∣∣∣1
n
n∑
i=1
ξi(x, β)∣∣∣∣ >
1
2M
}≤ 4M−2E
[1
n
n∑
i=1
ξi(x, β)
]2
≤ 4M−2nh(log n)−1E
[n∑
i=1
Wni(xTβ; β)g(Xi, ei)
]2
≤ cM−2(cnh5 + c(log n)−1). (A.13)
31
Therefore, from C5(i), we can choose M large enough so that the right hand side of (A.13)
is less than or equal to1
2. Hence, (A.2) is satisfied. We now can use (A.3) of Lemma A.1 to
get (A.8). By Lemma A.3, we obtain
n−2n∑
i=1
Eξ2i (x, β) = nh(log n)−1
n∑
i=1
E[Wni(x
Tβ; β)g(Xi, ei)]2
≤ cnh(log n)−1n∑
i=1
EW 2ni(x
Tβ; β) ≤ c(log n)−1.
This implies that n−2 ∑ni=1 ξ2
i (x, β) = OP ((log n)−1). Hence, from Lemma A.1 we have
P
{sup
(x,β)∈An
∣∣∣∣1
n
n∑
i=1
ξi(x, β)∣∣∣∣ >
1
2M
}≤ cn2paM−2p exp (− cM2 log n).
The right-hand side of the above formula tends to zero when M is large enough. Therefore,
(A.8) follows.
Lemma A.5 Under the assumptions of Theorem 1, we have
n−1ZTZP−→ Σ.
where Σ is defined in condition C6.
Proof. Noting that Z = (I− S)Z, the (i, s) element of Z is
Zis = [Zis − g2s(XTi β0)] + [g2s(X
Ti β0)− g2s(X
Ti β0; β0)].
The (s, t) element of ZTZ is
n∑
i=1
ZisZit =n∑
i=1
[Zis − g2s(XTi β0)][Zit − g2t(X
Ti β0)]
+n∑
i=1
[Zis − g2s(XTi β0)][g2t(X
Ti β0)− g2t(X
Ti β0; β0)]
+n∑
i=1
[Zit − g2t(XTi β0)][g2s(X
Ti β0)− g2s(X
Ti β0; β0)]
+n∑
i=1
[g2s(XTi β0)− g2s(X
Ti β0; β0)][g2t(X
Ti β0)− g2t(X
Ti β0; β0)]
=:I1 + I2 + I3 + I4. (A.14)
By the law of large numbers, we have
n−1I1P−→ E{[Z1s − E(Z1s|XT
1 β0)][Z1t − E(Z1t|XT1 β0)]} =: Σst, (A.15)
32
where Σst is the (s, t) element of Σ. Noting that
1
n
n∑
i=1
|Zis − g2s(XTi β0)| P−→ E|Z1s − g2s(X
T1 β0)| < ∞,
this together with (A.9) of Lemma A.4 proves that
n−1I2 ≤ OP (1) sup(x,β)∈An
|g2t(xTβ0)− g2t(x
Tβ; β)| P−→ 0.
Similarly, we can prove n−1I3P−→ 0 and n−1I4
P−→ 0. This together with (A.14) and
(A.15) proves Lemma A.5.
Lemma A.6 Under the assumptions of Theorem 1, we have
n−1/2ZTG :=1√n
n∑
i=1
Zi[g(XTi β0)− g(XT
i β0; β0, θ0)]P−→ 0.
Proof. The sth component of ZTG is
n∑
i=1
Zis[g(XTi β0)− g(XT
i β0; β0, θ0)]
=n∑
i=1
[Zis − g2s(XTi β0)][g(XT
i β0)− g(XTi β0; β0, θ0)]
+n∑
i=1
[g2s(XTi β0)− g2s(X
Ti β0; β0)][g(XT
i β0)− g(XTi β0; β0, θ0)]
=: J1 + J2. (A.16)
For J2, from (A.8) and (A.9) of Lemma A.4 we have
n−1/2|J2| ≤√
n sup(x,β)∈An
|g2s(xTβ0)− g2s(x
Tβ; β)|
× sup(x,β)∈An
|g(xTβ0)− g(xTβ; β, θ0)| = OP ((nh2/ log2 n)−1).
Noting that nh2/ log2 n →∞, we obtain n−1/2J2P−→ 0. It remains to prove that
n−1/2J1P−→ 0, (A.17)
as this together with (A.16) implies Lemma A.6. To prove (A.17) , we only need to show
that
supβ∈B′n
∣∣∣∣1
n
n∑
i=1
√n[Zis − g2s(X
Ti β0)][g(XT
i β0)− g(XTi β, β)]
∣∣∣∣P−→ 0, (A.18)
33
where B′n = {β : ‖β − β0‖ ≤ cn−1/2} for a constant c > 0. Toward this goal, we note that
Lemma A.1 can be used when the variable x is removed. Let
ξi(β) =√
n[Zis − g2s(XTi β0)][g(XT
i β0)− g(XTi β, β, θ0)],
fβ(Vi) = ξi(β), Vi = (Xi, Zis, ei), i = 1, . . . , n.
We now verify that (A.1) and (A.2) are satisfied. By the condition C3(ii) on the kernel
function, we calculate that
1
n
n∑
i=1
|fβ(Vi)− fβ∗(Vi)| ≤ cn5/2h−2‖β − β∗‖ = cna‖β − β∗|
where a = 52
+ 2λ(15≤ λ < 1
2). Hence, (A.1) is satisfied.
We next verify that (A.2) is satisfied. Denote ζi = Zis − g2s(XTi β0). From condition C4,
Lemmas A.2 and A3, we have
E
[1
n
n∑
i=1
ξi(β)
]2
≤ 2n−1n∑
i=1
E
g(XT
i β0)−n∑
j=1
Wnj(XTi β; β)g(XT
j β0)
2
E(ζ2i |XT
i β0)
+ 2n−1∑
i
∑
j
∑
k
∑
l
E[Wnj(XTi β; β)Wnl(X
Tk β; β)ζiζkejel]
≤ ch4 + cn−1 + cn−1
n∑
i=1
EW 2ni(X
Ti β; β) +
∑
i6=j
EW 2nj(X
Ti β; β)
≤ ch4 + cn−1 + c(nh)−1 −→ 0.
Hence, we can obtain
P
{∣∣∣∣1
n
n∑
i=1
ξi(β)∣∣∣∣ >
1
2ε
}≤ ch4 + cn−1 + c(nh)−1 <
1
2
when n large enough. Therefore, (A.2) is satisfied. By (A.8) of Lemma A.4, we have
1
n2
n∑
i=1
ξ2i (β) = OP (1) sup
(x,β)∈An
[g(xTβ0)− g(xTβ; β, θ0)]2 = OP (nh/ log n)−1).
By using Lemma A.1, we obtain
P
{sup
(x,β)∈An
∣∣∣∣1
n
n∑
i=1
ξi(β)∣∣∣∣ >
1
2ε
}≤ cn2paε−2p exp(−cnh/ log n) −→ 0.
34
by nh/ log n →∞. This proves (A.18) and thus completes the proof of Lemma A.6.
Lemma A.7. Suppose that conditions C1–C6 are satisfied, then we have
supβ(r)∈Bn
‖R(β(r))− U(β(r)0 ) + nV(β(r) − β
(r)0 )‖ = oP (
√n),
where Bn = {β(r) : ‖β(r) − β(r)0 ‖ ≤ Cn−1/2} for a constant C > 0, V is defined in condition
C6,
R(β(r)) =n∑
i=1
[Yi − ZTi θ0 − g(XT
i β; β, θ0)]g′(XT
i β; β, θ0)JTβ(r)Xi, (A.19)
and
U(β(r)0 ) =
n∑
i=1
eig′(XT
i β0)JT
β(r)0
[Xi − E(Xi|XTi β0)].
Proof. Separating R(β(r)), we have
R(β(r)) =n∑
i=1
eig′(XT
i β0)JTβ(r) [Xi − E(Xi|XT
i β0)]
+n∑
i=1
ei[g′(XT
i β; β, θ0)− g′(XTi β0)]J
Tβ(r)Xi
−n∑
i=1
g′(XTi β0)J
Tβ(r)Xi{g(XT
i β; β, θ0)− g(XTi β0; β0, θ0)}
−n∑
i=1
g′(XTi β0)J
Tβ(r){Xi[g(XT
i β0; β0, θ0)− g(XTi β0)]− eig3(X
Ti β0)}
−n∑
i=1
[g(XTi β; β, θ0)− g(XT
i β0)][g′(XT
i β; β, θ0)− g′(XTi β0)]J
Tβ(r)Xi
=: R1(β(r)) + R2(β
(r))−R3(β(r))−R4(β
(r))−R5(β(r)). (A.20)
Noting that Jβ(r) − Jβ
(r)0
= OP (n−1/2) for all β(r) ∈ Bn, we have
supβ(r)∈Bn
‖R1(β(r))− U(β
(r)0 )‖ = oP (
√n ). (A.21)
Since ‖β(r)−β(r)0 ‖ ≤ Cn−1/2 implies ‖β−β0‖ ≤ Cn−1/2 for all β(r) ∈ Bn, similar to the proof
of (A.17) we can show that
supβ(r)∈Bn
‖R2(β(r))‖ = oP (
√n ). (A.22)
35
For R3(β(r)), by a Taylor expansion of β(r) − β
(r)0 with a suitable mean β(r) ∈ Bn and
β = β(β(r)), we get
R3(β(r)) =
n∑
i=1
g′(XTi β0)g
′(XTi β; β, θ0)J
Tβ(r)XiX
Ti Jβ(r)(β(r) − β
(r)0 )
=n∑
i=1
g′(XTi β0)[g
′(XTi β; β, θ0)− g′(XT
i β0)]
× JTβ(r)XiX
Ti Jβ(r)(β(r) − β
(r)0 )
+n∑
i=1
g′(XTi β0)
2JTβ(r)XiX
Ti Jβ(r)(β(r) − β
(r)0 )
=:R31(β(r), β(r)) + R32(β
(r), β(r)).
By (A.10) of Lemma A.4 and the law of large numbers, we obtain that
supβ(r),β(r)∈Bn
‖R31(β(r), β(r))‖ = oP (
√n )
and
supβ(r),β(r)∈Bn
‖R32(β(r), β(r))− nV(β(r) − β
(r)0 )‖ = oP (
√n ).
Therefore, we have
supβ(r)∈Bn
‖R3(β(r))− nV(β(r) − β
(r)0 )‖ = oP (
√n ). (A.23)
We now consider R4(β(r)). Write R4(β
(r)) = JTβ(r)R
∗4(β
(r)). Let R∗4,s denote the sth
component of R∗4(β
(r)). First, from Lemma A.2 and A.3 we have
n−1E(R∗24,s) ≤ cn−1
n∑
i=1
E
n∑
j=1
Wni(XTj β0; β0)g
′(XTj β0)Xjs − g′(XT
i β0)g3s(XTi β0)
2
+ cn∑
i=1
E
n∑
j=1
Wnj(XTi β0; β0)g(XT
j β0)− g(XTi β0)
2
≤ c(nh)−1 + c√
h + cnh4 −→ 0.
This implies
supβ(r)∈Bn
‖R4(β(r))‖ = oP (
√n), (A.24)
and by Lemma A.4 we obtain
supβ(r)∈Bn
‖R5(β(r))‖ = oP (
√n). (A.25)
Substituting (A.21)–(A.25) into (A.20), we prove Lemma A.7.
36
REFERENCES
Arnold, S. f. (1981). The Theory of Linear Models and Multivariate Analysis. John Wiley
& Sons, New York.
Bhattacharya, P. K. and Zhao, P.-L. (1997). Semiparametric inference in a partial linear
model. Ann. Statist. 25 244-262.
Carroll, R. J., Fan, J. Gijbels, I. and Wand, M. P. (1997). Generalized partially linear
single-index models. J. Amer. Statist. Assoc. 92 477–489.
Chen, C.-H. and Li, K.-C.(1998). Can SIR be as popular as multiple linear regression.
Statistics Sinica 8 289–316.
Chen, H. (1988). Convergence rates for parametric components in a partly linear model.
Ann. Statist. 16 136–141.
Chen, H. and Shiau, J.-J. H. (1994). Data-driven efficient estimators for a Partially linear
model. Ann. Statist. 22 211–237.
Chiou, J. M. and Muller, H. G. (1998). Quasi-likelihood regression with unknown link and
variance functions. J. Amer. Statist. Assoc. 93 1376–1387.
Cook, R. D. (1998). Regression Graphics: Ideas for Studying Regressions Through Graphics.
Wiley, New York.
Cook, R. D. (2007). Fisher Lecture: Dimension Reduction in Regression1, 2 R. Dennis
Cook. Statist. Sci. 22 1-26.
Cook, R. D. and Li, B. (2002). Dimension reduction for the conditional mean in regression.
Ann. Statist. 30 455–474.
Cook, R. D. and Wiseberg, S. (1991). Comment on “Sliced inverse regression for dimension
reduction,” by K. C. Li. J. Amer. Statist. Assoc. 86 328–332.
37
Craven, P. and Wahba, G. (1979), Smoothing and noisy data with spline functions: esti-
mating the correct degree of smoothing by the method of generalized cross-validation,
Numer. Math. 31 377-403.
Doksum, K. and Samarov, A. (1995), Nonparametric estimation of global functionals and
a measure of the explanatory power of covariates in regression. Ann. of Statist. 23
1443-1473.
Fan, J. and Gijbels, I. (1996). Local Polynomial Modeling and Its Applications. Chapman
and Hall, London.
Friedman, J. H. and Stuetzle, W. (1981). Projection pursuit regression. J. Amer. Statist.
Assoc. 76 817–823.
Gentle, J. E.(1998). Numerical Linear Algebra for Applications in Statistics. Berlin:
Springer-Verlag.
Hall, P. (1989). On projection pursuit regression. Ann. Statist. 17 573–588.
Hardle, W., Hall, P. and Ichimura, H. (1993). Optimal smoothing in single-index models.
Ann. Statist. 21 157–178.
Hardle, W., Gao, J. and Liang, H. (2000) Partially Linear Models. Springer, New York
Harrison, D. and Rubinfeld, D. (1978). Hedonic housing pries and the demand for clean
air Environmental Economics and Management 5 81–102.
Heckman, N. (1986). Spline smoothing in a partly linear model. J. Royal Statist. Soci.
Ser.A 48 244–248.
Hristache, M., Juditsky, A. and Spokoiny, V. (2001). Direct estimation of the index coeffi-
cient in a single-index model. Ann. Statist. 29 595–623.
Hsing, T. and Carroll, R. J. (1992). An asymptotic theory for sliced inverse regression.
Ann. Statist. 20 1040–1061.
38
Li, B., Wen, S. Q. and Zhu L. X. (2008). On a projective resampling method for dimension
reduction with multivariate responses. J. Amer. Statist. Assoc. 106 1177-1186.
Li, K. C. (1991). Sliced inverse regression for dimension reduction (with discussion). J.
Amer. Statist. Assoc. 86 316–342.
Li, K. C. (1992). On principal Hessian directions for data visualization and dimension
reduction: Another application of Stein’s lemma. J. Amer. Statist. Assoc. 87 1025–
1039.
Li, K. C., Aragon, Y., Shedden, K., and Agnan, C. T. (2003). Dimension reduction for
multivariate response data. J. Amer. Statist. Assoc. 98 99–109.
Li, Y. X. and Zhu, L. X. (2007). Asymptotics for sliced average variance estimation. Ann.
Statist. 35 41–69.
Pollard, D. (1984). Convergence of Stochastic Processes. Springer-Verlag New York Inc.,
New York.
Rice, J. (1986), Convergence rates for partially splined models. Statist. Prob. Lett. 4
203-208.
Ruppert, D., Wand, M. P. and Carroll, R. J. (2003). Semiparametric Regression. New
York: Cambridge University Press,
Speckman, P. (1988). Kernel smoothing in partial linear models. J. Roy. Statist. Soci.
Ser.B 50 413–434.
Stoker, T. M. (1986). Consistent estimation of scaled coefficients. Econometrica 54 1461–
1481.
Stute,W. and Zhu, L. X. (2005). Nonparametric checks for single-index models. Ann.
Statist. 33 1048–1083.
Weisberg, S. and Welsh, A. H. (1994). Adapting for the Missing Linear Link. Ann. Statist.
22 1674–1700.
39
Welsh, A. H. (1989). On M-processes and M-estimation. Ann. Statist. 17 337–361.
[Correction(1990) 18 1500.]
Xia, Y. and Hardle, W. (2006). Semi-parametric estimation of partially linear single-index
models. J. Multi. Anal. 97 1162 - 1184
Xia, Y., Tong, H. Li, W. K. and Zhu L. X. (2002). An adaptive estimation of dimension
reduction space. J. R. Statist. Soc. B 64 363–410.
Xia, Y. (2006). Asymptotic distributions for two estimators of the single-index model.
Econometric Theory 22 1112–1137.
Yin, X. and Cool, R. D. (2002). Dimension reduction for the conditional k-th moment in
regression. J. R. Statist. Soc. B 64 159–175.
Yu, Y. and Ruppert, D. (2002). Penalized spline estimation for partially linear single-index
models. J. Amer. Statist. Assoc. 97 1042–1054.
Zhu, L. X. and Ng, K. W. (1995). Asymptotics for Sliced Inverse Regression. Statistica
Sinica 5 727-736.
Zhu, L. X. and Ng, K. W. (2003) Checking the adequacy of a partial linear model. Statist.
Sinica 13 763-781.
Zhu L. X. and Xue L. G. (2006) Empirical likelihood confidence regions in a partially linear
single-index model. J. Roy. Statist. Soc. ser. B, 68 549–570.
Zhu, L., Zhu, L. X., Ferre L., and Wang, T. (2008). Sufficient Dimension Reduction
Through Discretization-Expectation Estimation. Unpublished manuscript, Hong Kong
Baptist University.
40
TABLE 1
Simulation results for θ with βZ and β0 parallel
Resulting estimate One-step iterated estimate
Bias SD MSE Bias SD MSE
PPR 0.0058 0.0706 0.00502 0.0046 0.0701 0.00493
SIR5 0.0095 0.0862 0.00753 0.0083 0.0869 0.00762
SIR10 0.0113 0.0788 0.00634 0.0098 0.0808 0.00663
β0 given 0.0031 0.0660 0.00436
TABLE 2
Simulation results for θ with βZ and β0 orthogonal
Resulting estimate One-step iterated estimate
Bias SD MSE Bias SD MSE
PPR -0.0087 0.0972 0.00952 -0.0047 0.0711 0.00508
SIR5 -0.0115 0.1395 0.01960 -0.0072 0.0919 0.00850
SIR10 -0.0102 0.1362 0.01865 -0.0083 0.0959 0.00926
β0 given -0.0024 0.0696 0.00485
TABLE 3
Simulation results for the angles between β and β0
βZ and β0 parallel βZ and β0 orthogonal
Mean SD MSE Mean SD MSE
PPR 0.0148 0.0056 0.00025 0.0157 0.0066 0.00029
SIR5 0.0467 0.0223 0.00268 0.0482 0.0232 0.00286
SIR10 0.0496 0.0230 0.00299 0.0528 0.0229 0.00331
41
0.0 0.2 0.4 0.6 0.8 1.0
0.00
0.05
0.10
0.15
0.20
single−index
estim
ated
val
ues
of g
Figure 1: Curve estimate for a single replication of the quadratic model simulation study,
with orthogonal βZ and β0. The true cure g(solid curve), the mean of g∗ with GCV band-
width (dashed curve) and a fixed optimal bandwidth hopt = 0.439 (dotted curve) over 2000
simulations are shown.
42
0.3 0.4 0.5 0.6 0.7 0.8
9.0
9.5
10.0
10.5
11.0
single−index
estim
ated
val
ues
of g
Figure 2: Curve estimate for the Boston Housing data, with xTβ on the x-axis and g∗(t) on
the y-axis.
43