estimation for a partial-linear single-index modelanson.ucdavis.edu/~wang/plsi_zhu08.pdf · in this...

ESTIMATION FOR A PARTIAL-LINEAR

SINGLE-INDEX MODEL

Jane-Ling Wang1, Liugen Xue2, Lixing Zhu3, and Yun Sam Chong4

1University of California at Davis

2Beijing University of Technology, Beijing, China

3Hong Kong Baptist University, Hong Kong, China

4Wecker Associate

In this paper, we study the estimation for a partial-linear single-index model.

A two-stage estimation procedure is proposed to estimate the link function for

the single index and the parameters in the single index, as well as the parame-

ters in the linear component of the model. Asymptotic normality is established

for both parametric components. For the index, a constrained estimating equa-

tion leads to an asymptotically more efficient estimator than existing estimators

in the sense that it is of a smaller limiting variance. The estimator of the non-

parametric link function achieves optimal convergence rates; and the structural

error variance is obtained. In addition, the results facilitate the construction

of confidence regions and hypothesis testing for the unknown parameters. A

simulation study is performed and an application to a real dataset is illustrated.

The extension to multiple indices is briefly sketched.

0Lixing Zhu is the corresponding author. Email: [email protected] Wang’s research was partially supported by NSF grant DMS-0406430. Liugen Xue’s research

was supported by the National Natural Science Foundation of China (10871013), the Natural Science Founda-tion of Beijing (1072004) and Ph. D. Program Foundation of Ministry of Education of China (20070005003).Lixing Zhu’s research was supported by a grant of The Research Grant Council of Hong Kong, Hong Kong,China (HKBU7060/04P and HKBU 2030/07P). The authors thank the Editor, the Associate Editor, andthe two referees for their insightful comments and suggestions which have led to substantial improvementsin the presentation of the manuscript.

0AMS 2000 subject classifications. Primary 62G05; secondary 62G20.0Key words and phrases. Dimension reduction, local linear smoothing, bandwidth, two-stage estimation,

kernel smoother.

1 Introduction

Partial linear models have attracted lots of attention due to their flexibility to combine

traditional linear models with nonparametric regression models. See, e.g. Heckman (1986),

Rice (1986), Chen (1988), Bhattacharya and Zhao (1997), Xia and Hardle (2006), and the

recent comprehensive books by Hardle, Gao, and Liang (2000) and Ruppert, Wand and

Carroll (2003) for additional references. However, the nonparametric components are subject

to the curse of dimensionality and can only accommodate low dimensional covariates X. To

remedy this, a dimension reduction model which assumes that the influence of the covariate

X can be collapsed to a single index, XTβ, through a nonparametric link function g is a

viable option and termed the partial-linear single-index model. Specifically, it takes the form:

Y = ZTθ0 + g(XTβ0) + e, (1.1)

where (X, Z) ∈ Rp × Rq are covariates of the response variable Y , g is an unknown link

function for the single index, and e is the error term with E(e) = 0 and 0 < Var(e) = σ2 < ∞.

For the sake of identifiability, it is often assumed that ‖β0‖ = 1 and the rth component of

β0 is positive, where ‖ · ‖ denotes the Euclidean metric.

This model is quite general, it includes the aforementioned partial-linear model when

the dimension of X is one and also the popular single-index model in the absence of the

linear covariate Z. There is an extensive literature for the single-index model with three

main approaches: projection pursuit regression (PPR) [Friedman and Stuetzle (1981), Hall

(1989), Hardle, Hall and Ichimura (1993)]; the average derivative approach [Stoker (1986),

Doksum and Samarov (1995), and Hristache, Juditsky and Spokoiny (2001)]; and sliced

inverse regression (SIR) and related methods [Li (1991), Cook and Li (2002), Xia, Tong, Li

and Zhu (2002), and Yin and Cook (2002)]. All these approaches rely on the assumption

that the predictors in X are continuous variables, while model (1.1) compensates for this by

allowing discrete or other continuous variables to be linearly associated with the response

variable. To our knowledge, Carroll, Fan, Gijbels and Wand (1997) were the first to explore

model (1.1) and they actually considered a generalized version, where a known link function

is employed in the regression function while model (1.1) assumes an identity link function.

2

However, their approaches may become computationally unstable as observed by Yu and

Ruppert (2002) and confirmed by our simulations in Section 3. The theory of Carroll, Fan,

Gijbels and Wand (1997) also relies on the strong assumption that their estimator for θ0 is

already√

n-consistent. Yu and Ruppert (2002) alleviated both difficulties by employing a

link function g which falls in a finite-dimensional spline space, yielding essentially a flexible

parametric model. Xia and Hardle (2006) used a method that is based on a local polynomial

smoother and a modified version of least squares in Hardle, Hall and Ichimura (1993).

In this paper, we propose a new estimation procedure. Our approach requires no iteration

and works well under the mild condition that a few indices based on X suffice to explain Z.

Namely,

Z = φ(XTβZ) + η, (1.2)

where φ(·) is an unknown function from Rd to Rq, βZ is a p × d matrix with orthonormal

columns, η has mean zero and is independent of X. The dimension d is often much smaller

than the dimension p of X. Such an assumption is not stringent and common in most

dimension reduction approaches in the literature. A theoretical justification is provided in

Li, Wen and Zhu (2008). Model (1.2) implies that a few indices of X suffice to summarize

all the information carried in X to predict Z, which is often the case in reality, such as

for the Boston Housing data in section 4, where a single index was selected for model (1.2)

and Z is a discrete variable. In this data, first analyzed in Harrison and Rubinfeld (1978),

the response variable is the median value of houses in 506 census tracts in the Boston area.

The covariates include: average number of rooms, the proportion of houses built before

1940, eight variables describing the neighborhood, two variables describing the accessibility

to highways and employment centers, and two variables describing air pollution. A key

covariate of interest is a binary variable that specifies whether a house borders the river or

not. Our analysis presented in Section 4 based on the dimension reduction assumptions of

(1.1) and with Z equal to this binary variable in (1.2) demonstrates the advantages of our

model assumption, only one index (d = 1) was needed in model (1.2) for this data.

To avoid the computational complications that we experienced with the procedure in

Carroll et al. (1997), who aim at estimating β0 and θ0 simultaneously, we choose to estimate

3

β0 and θ0 sequentially. The idea is simple: θ0 can be estimated optimally through approaches

developed for partial linear models once we have a√

n estimate of β0 and plug it in (1.1).

However, β0 and θ0 may be correlated, leading to difficulties in identifying β0. This is where

model (1.2) comes in handy, as it allows us to remove the part of Z that is related to X so

that the residual η in (1.2) is independent of X. Again, we need to impose the identifiability

condition that βZ has norm one and a positive first component. The procedure is as follows:

First estimate βZ via any dimension reduction approach, such as SIR or PPR for q = 1,

and the projective resampling method in Li, Wen and Zhu (2008) for q > 1. Once βZ has

been estimated we proceed to estimate φ via a d-dimensional smoother and then obtain the

residual for η. Since η = Z − φ(XTβZ), plugging this into (1.1) we get

Y = ηTθ0 + h(XTβ0, XTβZ) + e,

where h is an unknown function, but now η and X are independent of each other. It is thus

possible to employ a least squares approach to estimate θ0 and the resulting estimate will

be√

n-consistent. We then employ a dimension reduction procedure to Y − ZTθ0 and X to

obtain an estimate for β0 and g. This concludes the first stage , where the resulting estimates

for θ0 and β0 are already√

n consistent but will serve the role as initial estimates for the next

stage, where we update all the estimates but use a more sophisticated approach. Specifically

for θ0 we apply the profile method, also called partial regression in Speckman (1988), to

estimate θ0. Theoretical results in Section 2.2 indicate that the two-stage procedure is fully

efficient, so there is no need for iteration. More importantly, to estimate the index β0, we use

an estimating equation to obtain asymptotic normality, which takes the constraint ‖β0‖ = 1

into account. The estimator based on this new estimating equation performs better in several

ways, summarized as follows.

1. Our estimation procedure directly targets the model parameters θ0, β0, βZ , φ(·) and

g(·) and no iteration is needed.

2. We obtain the asymptotic normality of the estimator of β0 and the optimal convergence

rate of the estimator of g(·), as well as the asymptotic normality of the estimator of

θ0. The most attractive feature of this new method is that the estimator of β0 has

4

smaller limiting variance when compared to three existing approaches in : Hardle et

al. (1993) when the model is reduced to the single-index model, Carroll et al.(1997)

if their link function is the identity function, and Xia and Hardle (2006) when their

model is homoscedastic. This is the first result providing such a small limiting variance

in this area.

3. We also provide the asymptotic normality of the estimator of σ2. It allows us to

consider the construction of confidence regions and hypothesis testing for θ0 and β0.

The rest of the paper is organized as follows. In Section 2, we elaborate on the new

methodology and then present the asymptotic properties for the estimators. Section 3 reports

the results of a simulation study and Section 4 an application to a real data example for

illustration. Section 5 gives the proofs of the main theorems. Some lemmas and their proofs

are relegated to the Appendix.

2 Methodology and Main Results

2.1 Estimating Procedures

The observations are {(Xi, Yi, Zi); 1 ≤ i ≤ n}, a sequence of independent and identically

distributed (i.i.d.) samples from (1.1), i.e.

Yi = ZTi θ0 + g(XT

i β0) + ei, i = 1, . . . , n,

where e1, · · · , en are i.i.d. random errors with E(ei) = 0 and Var(ei) = σ2 > 0, {εi; 1 ≤ i ≤ n}are independent of {(Xi, Zi); 1 ≤ i ≤ n}, Xi = (Xi1, . . . , Xip)

T, Zi = (Zi1, . . . , Ziq)T, β0 ∈ Rp

and θ0 ∈ Rq. For simplicity of presentation, we initially assume that Z can be recovered

from a single-index of X. That is, d = 1 in (1.2). The general case will be explored at the

end of this section in Remarks 2. Below we first outline the steps for each stage and then

elaborate on each of these steps.

Algorithm for Stage One:

5

1. Apply a dimension reduction method for the regression of Zi versus Xi to find an

estimator βZ of βZ ;

2. Smooth the Zi over XTi βZ to get an estimator φ(·) of φ(·), then compute the residuals

ηi = Zi − φ(XTi βZ);

3. Perform a linear regression of Yi versus ηi’s to find an initial estimator θ0 of θ0;

4. Apply a dimension reduction method to the regression of Yi − ZTi θ0 versus Xi to find

an initial estimator β0 of β0

5. Smooth the Yi−ZTi θ0 versus the XT

i β0 to obtain an estimator for g and for its deriva-

tive g′.

Algorithm for Stage Two:

6. Use the initial estimate β0 from Step 4 to update the estimate of θ0 through a profile

approach for the partial linear model by minimizing (2.5).

7. Use the updated estimate θ of θ0 from Step 6 to form the new residual Y −ZTθ, then

update the estimate of β0 by solving the estimating equation (2.10).

8. Use the updated estimates of θ0 and β0 in Steps 6 and 7 to update the estimate of g,

following the procedure as described in Step 5.

This completes the algorithm and, as we show in Section 2.2, the resulting estimators

are already theoretically efficient. However, the practical performance can be improved by

iterating Steps 6 and 7 one or more times. Our experience, through simulation studies not

reported in this paper, reveals limited benefits when iterating more than once.

Next, we elaborate on each of the steps in the above algorithms for the simple case

of a single index (d = 1). For the dimension reduction method in Step 4, one can use

any of several existing methods, such as SIR or one of its variants, PPR, or the minimum

average variance estimator (MAVE) of Xia, Tong, Li and Zhu (2002). These methods are

for univariate responses and hence can also be applied in Step 1 when q = 1. However,

6

when q > 1, a different method is needed in Step 1 for the case of a multivariate response,

and we recommend the dimension reduction method in Li, Wen and Zhu (2008). This and

other results in the literature already demonstrate the√

n-consistency of these dimension

reduction methods.

For the smoothing involved in Step 5, one can choose any one-dimensional smoother. We

employ the local polynomial smoother (Fan and Gijbels, 1996) to obtain estimators of the

link function g and its derivative g′, which will be used in the second stage of the estimation

procedure. Specifically, for a kernel function K(·) on R1 and a bandwidth sequence b = bn,

define Kb(·) = b−1K(·/b). For a fixed β and θ, the local linear smoother aims at minimizing

the weighted sum of squares

n∑

i=1

[Yi − ZTi θ − d0 − d1(X

Ti β − t)]2Kb(X

Ti β − t)

with respect to the parameters dν , ν = 0, 1. Let h = hn and h1 = h1n denote the bandwidths

for estimating g(·) and g′(·), respectively. A simple calculation shows that the local linear

smoother with these specifications can be represented as

g(t; β, θ) =n∑

i=1

Wni(t, β)(Yi − ZTi θ), (2.1)

and

g′(t; β, θ) =n∑

i=1

Wni(t, β)(Yi − ZTi θ), (2.2)

where

Wni(t; β) =Kh(X

Ti β − t)[Sn,2(t; β, h)− (XT

i β − t)Sn,1(t; β, h)]

Sn,0(t; β, h)Sn,2(t; β, h)− S2n,1(t; β, h)]

, (2.3)

Wni(t; β) =Kh1(X

Ti β − t)[(XT

i β − t)Sn,0(t; β, h1)− Sn,1(t; β, h1)]

Sn,0(t; β, h1)Sn,2(t; β, h1)− S2n,1(t; β, h1)]

, (2.4)

and

Sn,l(t; β, h) =1

n

n∑

i=1

(XTi β − t)lKh(X

Ti β − t), l = 0, 1, 2.

The above estimators are for generic fixed values of β and θ. To obtain the estimates

needed in Step 5, one replaces them with the initial values β0 obtained in Step 1 and θ0

obtained in Step 3, respectively. We will show in Theorem 2 that this results in standard

convergence rates for the estimate of g.

7

Likewise, a local linear smoother can be employed in Step 2 for estimating the unknown

function φ in model (1.2). The resulting estimator is defined as

φ(t; βZ) =n∑

i=1

Wni(t; βZ)Zi.

Several possibilities are available for the estimator of θ0 in Step 6, such as the profile

approach (termed “partial regression” in Speckman, 1988) or the partial spline approach

(Heckman, 1986). Here the the partial spline approach is not suitable for correlated X and

Z, so we adopt a profile approach and a local linear smoother. In short, this amounts to

minimizing, over all θ, the sum of squared errors,

n∑

i=1

[Yi − ZTi θ − g(XT

i β0; β0, θ)]2, (2.5)

where g is the estimator in (2.1) of g, obtained by smoothing Yi−ZTi θ versus XT

i β0, and β0

is an initial estimator of β0, which could be the initial estimator β0 in Step 4 or the refined

estimator from Step 7 when an iterated estimator for θ0 is desirable. Because this smoother

is expressed as a function of θ, the estimate derived from (2.5) is a profile estimate. More

details about the derivation and advantages of the profile approach can be found in Speckman

(1988). Specifically, let β0 be the current estimator, Y = (Y1, . . . , Yn)T, Z = (Z1, . . . , Zn)T,

where

Yi = Yi − g1(XTi β0; β0), Zi = Zi − g2(X

Ti β0; β0),

g1(t; β0) =∑n

i=1 Wni(t; β0)Yi, g2(t; β0) =∑n

i=1 Wni(t; β0)Zi,

with g1 and g2 the respective estimators of g1(t) = E(Y |XTβ0 = t) and g2(t) = E(Z|XTβ0 =

t). The resulting partial regression estimator is thus

θ = (ZTZ)−1ZTY. (2.6)

For the estimator of β0 in Step 7, we propose a novel method that takes advantage of the

constraint ‖β0‖ = 1 and hence is more efficient than existing approaches, including the

PPR approach in Hardle et al (1993), the MAVE method in Xia et al. (2002), and the

least squares approaches of Carroll et al (1997) and Xia and Hardle (2006) for the single-

index partial linear model in (1.1). It is worth mentioning that Xia and Hardle (2006)

allow possible heteroscadestic structure in (1.1), and least squares approaches have been

8

standard dimension methods and lead to the same asymptotic variances for estimators of

β0. For instance, in the homoscadestic case, the estimator in Xia and Hardle (2006) has an

asymptotic variance that is identical to that of Hardle et al (1993). Our approach, based

on an estimating equation under the constraint ‖β0‖ = 1, is computationally stable and

asymptotically more efficient, i.e., its asymptotic variance is smaller. The efficiency gain can

be attributed to a re-parametrization, making use of the constraint ‖β0‖ = 1 by transferring

restricted least squares to un-restricted least squares, which makes it possible to search for

the solution of the estimating equation over a restricted region in the Euclidean space Rp−1.

Without loss of generality, we may assume that the true parameter β0 has a positive

component (otherwise, consider −β0), say β0r > 0 for β0 = (β01, . . . , β0p)T and 1 ≤ r ≤ p.

For β = (β1, . . . , βp)T, let β(r) = (β1, . . . , βr−1, βr+1, . . . , βp)

T be a p−1 dimensional parameter

vector after removing the rth component βr in β. Then we may write

β = β(β(r)) = (β1, . . . , βr−1, (1− ‖β(r)‖2)1/2, βr+1, . . . , βp)T. (2.7)

The true parameter β(r)0 must satisfy the constraint ‖β(r)

0 ‖ < 1, and β is infinitely differen-

tiable in a neighborhood of β(r)0 . This “remove-one-component” method for β has also been

applied in Yu and Ruppert (2002).

To obtain the estimator, consider a Jacobian matrix of β with respect to β(r),

Jβ(r) =∂β

∂β(r)= (γ1, . . . , γp)

T, (2.8)

where γs (1 ≤ s ≤ p, s 6= r) is a p − 1 dimensional unit vector with sth component 1, and

γr = −(1 − ‖β(r)‖2)−1/2β(r). To motivate the estimating equation, we start with the least

squares criterion:

D(β) :=n∑

i=1


i β; β, θ)]2. (2.9)

From (2.7) and (2.9) we find D(β) = D(β(β(r))) = D(β(r)). Therefore, we may obtain an

estimator of β(r)0 , say β(r), by minimizing D(β(r)), and then obtain an estimator of β0, β, via

a transformation. This means that we transform a restricted least squares problem to an

unrestricted least squares problem by solving the estimation equation:

n∑

i=1


i β; β, θ)]g′(XTi β; β, θ)JT

β(r)Xi = 0. (2.10)

9

We define the resulting estimator β of β0 as the final target estimator. Theorem 3 implies

that our estimator for β0 has a smaller limiting variance than the estimators in Xia and

Hardle (2006) and Carroll et al. (1997).

With θ and β, the final estimator g∗ of g in Step 8 can be defined by

g∗(t) := g(t; β, θ) =n∑

i=1

Wni(t; β)(Yi − ZTi θ),

and the estimator σ2 of σ2 by σ2 = 1n

∑ni=1 [Yi − ZT

i θ − g∗(XTi β)]2. Asymptotic results for

the final parameter estimates of θ and β are established in Theorem 1 and Theorem 2, and

results for the link estimate of g follow from Theorem 4.

Remark 1 We consider a homoscedastic model of (1.1) with d = 1 in model (1.2). While

the estimation procedure can be extended easily to heteroscedastic errors, an additional

dimension reduction assumption on the variance function of of η, given X, is needed to avoid

the curse of high dimensional smoother needed in Step 2 to estimate φ. This assumption

requires that this variance function is also a function of a few indices based on X. Moreover,

the extension of asymptotic theory is not straightforward. For instance, the asymptotic

efficiency of the estimator β0 is technically challenging in the heteroscedastic case and its

study is beyond the scope of this paper.

Remark 2 So far, we have assumed that d = 1. This assumption can be extended

without difficulty to the general case where d might be greater than 1. In this case, a

multivariate smoother will be employed for estimating φ(·). The asymptotic results for the

parameter estimates of β and θ remain unchanged, except that the rate of convergence for

the link estimate of φ(·) changes with the dimension of d.

Remark 3 Other dimension reduction approaches, such as MAVE (Xia et al., 2002) and

other variants of SIR, such as SIR2 (Li, 1991) and SAVE (Cook and Wiseberg, 1991), could

be employed in Steps 1 and 4 for the case of q = d = 1 in (1.2), especially when SIR fails

for the case of symmetric design of X. While MAVE is perhaps the most efficient method

of all, the benefits over SIR are limited, as all estimates are updated in Stage 2, and it is

in this step where the major efficiency gains occur. In addition, MAVE is computationally

10

more intensive than SIR and encounters difficulties in estimating βZ , unless the covariate Z

is one-dimensional and the dimension d of βZ is also small. In fact, the√

n-consistency may

not hold when d > 3 in (1.2) as shown in Xia, Tong, Li and Zhu (2002).

Also, SIR2/SAVE was shown in Li and Zhu (2007) to be not√

n-consistent, unless a

bias correction is performed. In contrast, either SIR or pHd (Li, 1992) can be employed to

identify the directions when d > 1 and q = 1, and both lead to√

n-consistency.

Remark 4 When the dimension q of Z is greater than 1, a multivariate extension of

SIR (Li et al., 2003) can be employed conceptually in Step 1 of the algorithm. However,

the number of observations per slice may become sparse, so we recommend an alternative

multivariate approach as in Li, Wen and Zhu (2008) or Zhu, Zhu, Ferre and Wang (2008) in

Step 1.

Remark 5 The single-index assumption in (1.1) can be easily extended to multiple

indices through SIR or its variants, but the estimation of the multivariate link function g

would encounter the curse of high dimensionality. Since no more than three indices will be

needed in many applications, the approach in this paper can indeed be extended in practice

to multiple indices.

2.2 Main results

In this section, the√

n asymptotics for initial estimates of β and θ in Stage 1 are taken for

granted as they follow from existing results, so we do not formally list the needed assumptions

for this to hold but have provided sources after Theorem 1 below. However, the asymptotics

for the initial estimate of g and each of the parametric and nonaprametric estimates in Stage

2 are fully developed in Section 2.2 with detailed assumptions listed for each estimator.

In order to study the asymptotic behavior of the estimators, we list the following condi-

tions:

C1. (i) The distribution of X has a compact support set A.

(ii) The density function of XTβ is positive and satisfies a Lipschitz condition of order

1 for β in a neighborhood of β0. Further, XTβ0 has a positive and bounded density

function f(t) on T , where T = {t = xTβ0 : x ∈ A}.

11

C2. (i) The functions g and g2i have two bounded and continuous derivatives, where g2i is

the ith component of g2(t), 1 ≤ i ≤ q;

(ii) g3j satisfies a Lipschitz condition of order 1, where g3j is the jth component of g3(t),

and g3(t) = E(X|XTβ0 = t), 1 ≤ j ≤ p.

C3. (i) The kernel K is a bounded, continuous and symmetric probability density function,

satisfying ∫ ∞

−∞u2K(u)du 6= 0,

∫ ∞

−∞|u|2K(u)du < ∞;

(ii) K satisfies a Lipschitz condition on R1.

C4. (i) supt E(‖Z‖2|XT1 β0 = t) < ∞;

(ii) E(e) = 0, Var(e) = σ2 < ∞, E(e4) < ∞.

C5. (i) nh2/ log2 n →∞, lim supn→∞

nh5 < ∞;

(ii) nhh31/ log2 n →∞, nh4 → 0, lim sup

n→∞nh5

1 < ∞.

C6. (i) Σ = Cov(Z − E(Z|XTβ0)) is a positive definite matrix;

(ii) V = E[g′(XTβ0)2JT

β(r)0

XXTJβ

(r)0

] is a positive definite matrix, where Jβ

(r)0

is defined

by (2.8).

Remark 6 The Lipschitz condition and the two derivatives in C1 and C2 are standard

smoothness conditions. C3 is the usual assumption for second-order kernels. C1 is used to

bound the density function of XTβ away from zero. This ensures that the denominators

of g(t; β, θ0) and g′(t; β, θ0) are, with high probability, bounded away from 0 for t = xTβ,

x ∈ A and β near β0. C4 is a necessary condition for the asymptotic normality of an

estimator. In C5(i), the range of h for the estimators θ and g is fairly large and contains the

rate n−1/5 of “optimal” bandwidths. However, when analyzing the asymptotic properties

of the estimator β of β0, we have to estimate the derivative g′ of g. As is well known, the

convergence rate of the estimator of g′ is slower than that of the estimator of g if the same

bandwidth is used. This leads to a slower convergence rate for β than√

n, unless we use a

kernel of order 3 or undersmoothing to deal with the bias of the estimator. This motivates

the introduction of another bandwidth h1 in C5(ii) to control the variability of the estimator

of g′, and condition C5(ii) for bandwidths h and h1. Chiou and Muller (1998) also consider

12

the use of two bandwidths to construct the estimator of β in a relevant model. C6 ensures

that the limiting variances for the estimators θ and β exist.

The following theorems state the asymptotic behavior of the estimators proposed in

Section 2.1. We first establish the asymptotic efficiency of θ.

Theorem 1 Suppose that conditions C1, C2(i), C3(i), C4(i), C5(i) and C6(i) hold.

When ‖βZ − βZ‖ = OP (n−1/2) and ‖β0 − β0‖ = OP (n−1/2), we have

√n(θ − θ0)

D−→ N(0, σ2Σ−1).

Remark 7 Carroll et al.(1997) give similar results with β = 1 and p = 1 (The case of

a partially linear model). Theorem 1 generalizes their Theorems 2 and 3.

In Theorem 1, when we start with√

n-consistent estimators for βZ and β0, θ is consistent

for θ0 with the same asymptotic efficiency as an estimator that we would have obtained had

we known β0 and g, and thus the oracle property. Numerous examples of√

n- consistent

estimators already exist in the literature. For instance, Hall (1989) showed that one can

obtain a√

n-consistent estimator for β0 using projection pursuit regression. Under the

linearity condition that is slightly weaker than elliptical symmetry of X, Li (1991), Hsing

and Carroll (1992) and Zhu and Ng (1995) proved that SIR, proposed by Li (1991), leads to

a√

n-consistent estimator of βZ and of β0, the latter when Z is not present in (1.1). Li and

Zhu (2007) further show that, when including a bias-correction and under a condition almost

equivalent to normality of X, sliced average variance estimation (SAVE, Cook and Weisberg

1991) performs similarly. We expect the results for β0 to hold when Z is dependent of X,

provided a good estimator of βZ is available. Under very general regularity conditions and

for q = 1, Xia, Tong, Li, and Zhu (2002) proposed the minimum average variance estimation

(MAVE) and Xia (2006) a refined version of MAVE, and both methods can provide√

n-

consistent estimators for the single-index β0. However, there is no result in the literature

regarding MAVE when the dimension of Z is larger than 1, and the√

n-consistency needs

further study when d is larger than or equal to 3, even for univariate Z. Therefore, for

general theory, SIR may be a good choice for the initial estimators of βZ and β0.

13

Theorem 2 Suppose that conditions C1–C6 hold. If the rth component of β0 is positive,

we have√

n(β − β0)D−→ N(0, σ2J

β(r)0

V−1QV−1JT

β(r)0

),

where Q = E{g′(XTβ0)2JT

β(r)0

[X − E(X|XTβ0)][X − E(X|XTβ0)]TJ

β(r)0}, V and J

β(r)0

are

defined in condition C6.

¿From Hardle et al (1993) and Carroll et al (1997), we can see that the estimator β of β

has an asymptotic variance that corresponds to a generalized inverse σ2Q−1 where

Q1 = E{g′(XT β0)

2[X − E(X|XT β0)

] [X − E(X|XT β0)

]T}

.

Note that there may be infinitely many inverse matrices of Q1, but there is a unique gen-

eralized inverse associated with the Jacobian Jβ

(r)0

. The following theorem shows that the

variance-cavariance matrix in Theorem 2 is smaller than σ2Q−1 , the variance associated with

Jβ

(r)0

, in the sense that σ2Q−1 − σ2J

β(r)0

V−1QV−1JT

β(r)0

is a non-negative definite matrix. We

use the usual notation: for two non-negative matrices A and B, A ≥ B denotes that A−B

is a non-negative definite matrix.

Theorem 3 Under the conditions of Theorem 2, we have

i) there is a generalized inverse of Q1 that is of the form JT

β(r)0

Q−1Jβ

(r)0

;

ii) JT

β(r)0

Q−1Jβ

(r)0≥ J

β(r)0

V−1QV−1JT

β(r)0

.

Remark 8 Theorem 3 shows that our estimator of β0 is asymptotically more efficient

than those of Hardle et al.(1993) and of Carroll et al. (1997). In addition, Carroll et al.(1997)

use an iterated procedure to estimate β0 and θ0, while our estimation procedure does not

require iteration.

¿From Theorem 2, we obtain an asymptotic result regarding the angle between β and

β0, which can be used to study issues of sufficient dimension reduction (SDR). We refer to

Cook (1998, 2007) for more details.

Corollary 1 Suppose that the conditions of Theorem 2 hold. Then

cos(β, β0)− 1 = OP (n−1/2),

14

where cos(β, β0) is the cosine of the angle between β and β0.

The next two theorems provide the convergence rate of the estimator g∗(·) of g(·) and

the asymptotic normality of the estimator of σ2.

Theorem 4 Suppose that the conditions of Theorem 1 hold. If ‖β − β0‖ = OP (n−1/2).

Then

sup(x,β)∈An

|g∗(xTβ)− g(xTβ0)| = OP ((nh/ log n)−1/2),

where An = {(x, β) : (x, β) ∈ A×Rp, ‖β − β0‖ ≤ cn−1/2} for a constant c > 0.

Theorem 5 Suppose that conditions C1–C6 hold and 0 < Var(e21) < ∞. Then

√n(σ2 − σ2)/(Var(e2

1))1/2 D−→ N(0, 1).

Note that n−1ZTZP−→ Σ in Lemma A.5 of the Appendix. By Theorems 1 and 4 , we

obtain

(ZTZ)1/2(θ − θ0)/σD−→ N(0, Iq).

We are now in the position to construct confidence regions for θ0. From Theorem 10.2d

in Arnold (1981) we obtain the following result.

Theorem 6 Under the conditions of Theorem 5, we have

(θ − θ0)T(ZTZ)(θ − θ0)/σ

2 D−→ χ2q,

where χ2q is chi-square distributed with q degrees of freedom. Let χ2

q(1 − α) be the (1 − α)-

quantile of χ2q for 0 < α < 1, an asymptotic confidence region of θ0 is

Rα = {θ : (θ − θ)T(ZTZ)(θ − θ)/σ2 ≤ χ2q(1− α)}.

To construct confidence regions for β0, a plug-in estimator of the limiting variance of β

is needed. We respectively define the following estimators V and Q of V and Q by

V =1

n

n∑

i=1

g′(XTi β; β, θ)2JT

β(r)XiXTi Jβ(r)

15

and

Q =1

n

n∑

i=1

g′(XTi β; β, θ)2JT

β(r) [Xi − g3(XTi β; β)][Xi − g3(X

Ti β; β)]TJβ(r) ,

where g3(t; β) =∑n

i=1 Wni(t; β)Xi is the estimator of g3(t) = E(X|XTβ0 = t) and Jβ(r) is

the estimator of Jβ

(r)0

. It is easy to prove that Jβ(r)

P−→ Jβ

(r)0

, VP−→ V and Q

P−→ Q. Then

for any p× l matrix A of full rank with l < p, Theorems 2 and 5 imply that

(n−1ATJβ(r)V−1QV−1JT

β(r)A)−1/2AT (β − β0)/σD−→ N(0, Il).

We again use Theorem 10.2d in Arnold (1981) to obtain the following limiting distribution.

Theorem 7 Suppose that the conditions of Theorem 5 hold. Then

(β − β0)TA(n−1ATJβ(r)V

−1QV−1JTβ(r)A)−1AT (β − β0)/σ

2 D−→ χ2l .

The asymptotic confidence region of AT β0 is, letting χ2l (1− α) be the (1− α)-quantile of χ2

l

for 0 < α < 1,

Rα = {AT β : (β − β)TA(n−1ATJβ(r)V−1QV−1JT

β(r)A)−1AT (β − β)/σ2 ≤ χ2l (1− α)}.

3 Simulation study

In this section, we examine the performance of the procedures in Section 2, for the

estimation of both β0 and θ0. We report the accuracy of estimators using PPR and SIR as

dimension-reduction methods. The sample size for the simulated data is n = 100 and the

number of simulated samples is 2000 for the parametric components. When SIR is applied,

using 5 or 10 elements per slice generally yields good results. In other words, each slice

contains 10 to 20 points. A quadratic model of the form

Y = (XTβ0 − 0.5)2 + Zθ0 + 0.2e,

was used, where θ0 = 1 is a scalar, β0 = (0.75, 0.5,−0.25,−0.25, 0.25)T, X is a 5-dimensional

vector with independent uniform [0,1] components, and e is a standard normal variable.

The dependency between X and Z was prescribed by defining Z as a binary variable with

16

probability exp(βZX)/(1 + exp(XTβZ)) to be 1 and 0 otherwise. Two extreme cases of

βZ are reported in Table 1 and Table 2, one based n choosing the same value as β0 with

βZ = β0, and the other on βZ = (0.5, 0, 0.5, 0.5,−0.5)T, so that βZ is orthogonal to β0. We

also checked scenarios where βZ and β0 are neither orthogonal nor parallel to each other,

and the results are in agreement with the two extreme cases reported here.

For the smoothing steps, we used a local linear smoother with a Gaussian kernel through-

out. A product Gaussian kernel was used when bivariate smoothing was involved and equal

bandwidths were selected for each kernel to save computing time. A pilot study revealed

that the bandwidth chosen at the first stage to estimate the residual η has little effect on

the accuracy of the final estimates of θ0, so we choose an initial bandwidth of 0.5 to estimate

φ in (1.2), as this value was frequently selected by generalized cross validation (GCV). The

subsequent smoothing steps utilized the GCV method as proposed in Craven and Wahba

(1979). For instance, when estimating g and θ0 in the second stage, the GCV statistic is

given by the formula

GCV(h) =1

n

n∑

i=1

(Yi − ZTi θ − gh(X

Ti β; β, θ)2/(n−1tr(I− Sh))

2, (3.1)

where gh(·) is the estimator of g(·) with a bandwidth h and Sh is the smoothing matrix

corresponding to a bandwidth of h. The GCV bandwidth was selected to minimize (3.1).

We use the optimal bandwidth, hopt, for g and θ. When calculating the estimator β, we

chose the bandwidths,

h = hoptn1/5n−1/3 = hoptn

−2/15 and h1 = hopt, (3.2)

respectively, because this guarantees that the required bandwidth has the correct order of

magnitude for optimal asymptotic performance [see Carroll al.(1997), Stute and Zhu (2005),

and Zhu and Ng (2003)]. Note that choices (3.2) satisfy condition C5(ii). Relevant discussion

on choosing two distinct bandwidths can be found in Chiou and Muller (1998).

In the simulation, PPR and SIR were used to obtain the initial estimators of β0 and

βZ . The notation SIRc means that when we used SIR to estimate βZ , the number of data

points per slice is c. The resulting estimates for θ0 and the one-step iterated estimates are

summarized in Tables 1 and 2, where we report bias, standard deviation (SD), and mean

17

square error (MSE). The case with known β0 is also reported in the last row and serves as

a gold standard. The right columns under “One-step iterated estimate” in Tables 1 and 2

represent the results obtained when iterating the algorithms in Section 2.1 one more time

after obtaining the estimates in the left columns.

Tables 1 and 2 are about here

¿From Tables 1 and 2 we find that the three methods have small mean square errors with

projection pursuit regression outperforming both SIR procedures. This is expected, as the

simulated model structure satisfies the additive assumption of PPR and the estimates of the

β-directions were iteratively updated through estimates of the unknown link functions, φ and

g. In other non-additive situations, SIR might be more reliable than PPR. Iterated estimates

improved the results for all cases and markedly so for the orthogonal case. Compared to

the case when β0 is known, PPR typically attains 80% or more of the efficiency after one

iteration.

For the estimation of β0, we computed the angle (in radians) between β and β0 as a

measure of accuracy. The mean, standard deviation (SD), and mean squared error (MSE)

of the angle between β and β0 are reported in Table 3. Here, PPR leads to by far superior

estimates compared to SIR.

Table 3 is about here

The performance of the nonparametric estimates for g is demonstrated in Figure 1. Again,

GCV was used for bandwidth choice and compared to the estimates based on the optimal

fixed bandwidth. The true function g and the mean of each estimated g-function over the

2000 replicates are plotted. In general, GCV seems to work well for all parametric and

nonparametric components. This is consistent with the results reported in Chen and Shiau

(1994) for the analysis of partially linear models based on generalized cross validation (GCV).

18

Theoretical properties of the current models in regard to GCV will be a topic for further

investigation.

Figure 1 is about here

A final remark is that we tried to compare our procedure with that proposed in Carroll, et

al. (1997), for the quadratic model used in the above simulations with βZ and β0 orthogonal.

However, we were not able to obtain any results for the method in Carroll et al. (1997), as

their procedure seems to be very sensitive to the choice of the initial estimates. We then used

our estimates for β0 and θ0 as the initial values for their procedure. Nevertheless, we were still

unable to obtain any meaningful comparison results as out of the seven attempted trials their

procedure crashed six times on the first simulation and once on the second simulation. Since

θ0 is only a scalar, we postulate that their procedure has difficulties with high dimensional

β0, which is here a five-dimensional vector.

4 Data Example

We analyze the Boston Housing data mentioned in Section 1. The goal is to determine the

effect of the various variables on housing price, including a binary variable, which describes

whether the census tract borders the Charles River. According to Harrison and Rubinfeld

(1978), bordering the river should have a positive effect on the median housing price of the

census tract. They used a linear model that included a log transformation for the response

variable and three of the covariates, and power transformations for three other covariates.

Their final model is

log(MV ) = a1 + a2RM2 + a3AGE + a4 log(DIS) + a5 log(RAD) + a6TAX

+ a7PTRATIO + a8(B − 0.63)2 + a9 log(LSTAT ) + a10CRIM

+ a11ZN + a12INDUS + a13CHAS + a14NOXp + e.

19

The coefficient a13 is estimated to be 0.088, which is significant with a p-value of less than

0.01 for the hypothesis H0 : a13 = 0 versus H1 : a13 6= 0. The coefficient of determination

R2 attained by their analysis is 0.81, where R2 is the squared correlation between the true

dimension-reduction variable XTβ0 and the estimated dimension-reduction variable XTβ0.

This data set was also analyzed by Chen and Li (1998), who used sliced inverse regression

with all thirteen covariates. After examining the initial results, Chen and Li (1998) trimmed

the data and then dropped some of the variables. We fit the data on the first SIR direction

of the initial analysis reported in their article and obtained an R2 of 0.705 using GCV

bandwidth 0.43. Note that the assumptions of sliced inverse regression are probably not met

because some of the covariates are discrete. We thus proposed to use a partial-linear single-

index model. Several choices of Z were attempted, but they did not yield better results, in

terms of R2, than the one using only the Charles River variable as Z and the other covariates

as X. We thus focus on this model, where a log transformation was applied on Y .

To select the number of observations per slice in the dimension reduction step of SIR, we

borrow our experience in the simulation presented in Section 3, where 5 or 10 observations

per slice worked well for a total sample size of 100, leading to about 20 to 10 slices. Since

the sample size for the housing data is much larger, we use SIR with 20 data points per slice

and this leads to a total of 26 slices. As Chen and Li (1998) pointed out, SIR is not sensitive

to the choice of slice number, and they tried slicing with 10 or 30 points per slice leading to

17 or 50 slices, and obtain very similar results. The GCV bandwidth for estimating g and

θ is 0.367, which is smaller than the bandwidth 0.43 chosen by the GCV method for the

SIR approach of Chen and Li (1998). To estimate β by (2.10), the bandwidths selected by

(3.2) for h = 0.16 and for h1 is 0.367. The R2 is 0.8047, which is essentially equal to that

obtained by Harrison and Rubinfeld and higher than that using SIR on all thirteen variables.

The value of the test statistic for H0 : θ = 0 versus H1 : θ 6= 0 is 3.389 when the degrees

of freedom are calculated according to Hastie and Tibshirani, and 3.419 when n degrees of

freedom are used. Either way the result is significant with p-value < 0.01.

We also omitted the Charles River variable and used a dimension-reduction model on Y

and X. After obtaining an estimate for β0, we then estimate the relationship between Y

20

and XTβ. GCV yields a bandwidth of 0.16, and we obtain R2 = 0.8021. Even though the

Charles River variable is significant, its inclusion leads to only a minor increase in R2.

Figure 2 is about here

Figure 2 shows the estimated g along with the data. On the x-axis of the above graph,

the estimated value xTβ is given, and on the y-axis, the estimated value g∗(t). Figure 2

shows a downward trend in the effective dimension reduction (EDR) variate obtained. The

upward curvature of the function at high values of the EDR variate may or may not be a

real effect.

The advantage of our procedure over the one used by Harrison and Rubinfeld is that

Harrison and Rubinfeld have to make choices regarding transformations for every variable

in the model. We only need to choose the bandwidth or bandwidths used for smoothing.

5 Proofs of Theorems

Since the proofs of the theorems are rather long, the proofs of Theorems 1–4 are presented

in this section, and more details of the proofs are divided into Lemmas A.2–A.7 in the

Appendix.

In this section and the Appendix, we use c > 0 to represent any constant which may take

different values for each appearance, and a ∧ b = min(a, b).

Proof of Theorem 1. Denote

G = (g(XT1 β0)− g(XT

1 β0; β0, θ0), . . . , g(XTn β0)− g(XT

n β0; β0, θ0))T.

¿From (2.6) we have

√n(θ − θ0) =

√n(ZTZ)−1ZTG +

√n(ZTZ)−1ZTe.

21

Lemma A.5 in the Appendix implies

n(ZTZ)−1 P−→ Σ−1. (5.1)

Therefore, Lemma A.6 in the Appendix leads to

√n(ZTZ)−1ZTG

P−→ 0.

It remains to show that√

n(ZTZ)−1ZTeD−→ N(0, σ2Σ−1). (5.2)

Since

ZTe =n∑

i=1

[Zi − g2(XTi β0)]ei +

n∑

i=1

[g2(XTi β0)− g2(X

Ti β0; β0)]ei

=: M1 + M2.

The central limit theorem implies n−1/2M1D−→ N(0,Σ). Similarly to the proof of (A.17),

it is easy to obtain that n−1/2M2P−→ 0. This together with (5.1) and Slutsky’s Theorem

proves (5.2), and hence Theorem 1.

Proof of Theorem 2. The proof is divided into two steps: From (2.9), step (I) provides

the existence of the least squares estimator β of β0, and from (3.1), step (II) proves the

asymptotic normality of β.

(I) Proof of existence. We prove the following fact: Under conditions C1–C5 and

with probability one there exists an estimator of β0 minimizing expression (2.9) in B1n,

where B1n = {β : ‖β − β0‖ = B1n−1/2} for some constant such that 0 < B1 < ∞.

In fact, let Y = (Y1, . . . , Yn)T and Z = (Z1, . . . , Zn)T. We have

D(β) = (Y − Zθ)T(I− Sβ)T(I− Sβ)(Y − Zθ)

= (Y − Zθ0)T(I− Sβ)T(I− Sβ)(Y − Zθ0)

− 2(Y − Zθ0)T(I− Sβ)T(I− Sβ)Z(θ − θ0)

+ {Z(θ − θ0)}T(I− Sβ)T(I− Sβ)Z(θ − θ0)

=: D1(β)−D2(β) + D3(β).

22

The same arguments as in the proof of Theorem 1 can be used to obtain that D2(β) =

R0 + oP (1) and D3(β) = oP (1), where R0 is a constant independent of β. This implies

D(β) = D1(β)−R0+oP (1). Thus, minimizing D(β) simultaneously with respect to β is very

much like separately minimizing D1(β) with respect to β. It follows from (2.7) that we only

need to prove the existence of an estimator of β(r)0 in B2n, where B2n = {β(r) : ‖β(r)−β

(r)0 ‖ =

B2n−1/2} for some constant such that 0 < B2 < ∞. Since R(β(r)) = (−1

2)∂D1(β)

∂β(r) , where

R(β(r)) is defined in (A.19) of Lemma A.7. For an arbitrary β(r) ∈ B2n with the value of

constant B2 in B2n to be determined, we have from Lemma A.7 below that

(β(r) − β(r)0 )TR(β(r))

= (β(r) − β(r)0 )TU(β

(r)0 )− n(β(r) − β

(r)0 )TV(β(r) − β

(r)0 ) + oP (1). (5.3)

The following arguments are similar to those used by Weisberg and Welsh (1994), which

in turn use (6.3.4) of Ortega and Rheinboldt (1973). We note that term (5.3) is dominated

by the term ∼ B22 because

√n‖β(r) − β

(r)0 ‖ = B2, whereas |(β(r) − β

(r)0 )TU(β

(r)0 )| = B2OP (1)

and n(β(r)−β(r)0 )TV(β(r)−β

(r)0 ) ∼ B2

2 . So, for any given η > 0, if B2 is chosen large enough,

then it will follows that (β(r) − β(r)0 )TR(β

(r)0 ) < 0 on an event with probability 1− η. From

the arbitrariness of η, we can prove the existence of the least squares estimator of β(r)0 in B2n

as in the proof of Theorem 5.1 of Welsh (1989). The details are omitted.

(II) Proof of asymptotic normality. From step (I) we find that β(r) is a solution in

B2n to the equation R(β(r)) = 0. That is, R(β(r)) = 0. By Lemma A.7, we have

0 = U(β(r)0 )− nV(β(r) − β

(r)0 ) + oP (

√n ),

and hence√

n(β(r) − β(r)0 ) = V−1n−1/2U(β0) + oP (1).

We now consider the estimator β. A simple calculation yields

2√

1− ‖β(r)0 ‖2

√1− ‖β(r)‖2 +

√1− ‖β(r)

0 ‖2− 1 =

√1− ‖β(r)

0 ‖2 −√

1− ‖β(r)‖2

√1− ‖β(r)‖2 +

√1− ‖β(r)

0 ‖2= OP (n−1/2),

23

and hence

√1− ‖β(r)‖2 −

√1− ‖β(r)

0 ‖2

= − (β(r) + β(r)0 )T (β(r) − β

(r)0 )√

1− ‖β(r)‖2 +√

1− ‖β(r)0 ‖2

= −2β(r)T0 (β(r) − β

(r)0 ) + ‖β(r) − β

(r)0 ‖2

√1− ‖β(r)‖2 +

√1− ‖β(r)

0 ‖2

= −β(r)T0 (β(r) − β

(r)0 )√

1− ‖β(r)0 ‖2

+ OP (n−1).

It follows from (2.7) and the above equation, that

β − β0 =

β1

...

βr−1√1− ‖β(r)‖2

βr+1

...

βp

−

β01

...

β0(r−1)√1− ‖β(r)

0 ‖2

β0(r+1)

...

β0p

=

β1 − β01

...

βr−1 − β0(r−1)

−β(r)T0 (β(r)−β

(r)0 )√

1−‖β(r)0 ‖2

βr+1 − β0(r+1)

...

βp − β0p

+ OP (n−1).

That is, from the definition of Jβ

(r)0

of (2.8)

β − β0 = Jβ

(r)0

(β(r) − β(r)0 ) + OP (n−1).

Thus, we have√

n(β − β0) = Jβ

(r)0

V−1n−1/2U(β(r)0 ) + oP (1).

Theorem 2 follows from this, Central Limit Theorem and Slutsky’s Theorem.

Proof of Theorem 3. Recalling the definition of Q, we can see that Q = JT

β(r)0

Q1Jβ(r)0

.

Define

Π0 := Jβ

(r)0

Q−1JT

β(r)0

, Π1 := Jβ

(r)0

V−1QV−1JT

β(r)0

.

We now prove that Π0 is a generalized inverse of Q1. To this end, we need to prove that

Π0Q1Π0 = Π0 and Q1Π0Q1 = Q1. Note that

Π0Q1Π0 = Jβ

(r)0

Q−1JT

β(r)0

Q1Jβ(r)0

Q−1JT

β(r)0

= Jβ

(r)0

Q−1QQ−1JT

β(r)0

= Π0.

24

We now prove Q1Π0Q1 = Q1. First, by QR decomposition (see, e.g. Gentle 1998, Section

3.2.2, pages 95-97 for more details) for β0, we can find its orthogonal complement such that

B = (b1, β0) is an orthogonal matrix, and β0 = B

0

1

. Thus, J

β(r)0

= BBTJβ

(r)0

=: BR

where R =

R1

R2

with R1 being a (p− 1)× (p− 1) nonsigular matrix. Further, note that

Q = JT

β(r)0

Q1Jβ(r)0

= RTBTQ1BR

= RT

bT1 Q1b1 0

0 0

R

= RT1 bT

1 Q1b1R1.

To prove the result, we rewrite Q1 in another form. Define S = B

R1 0

0 1

. S is a

nonsingular matrix. Then

Q1 = (ST )−1STQ1SS−1 = (ST )−1

RT1 0

0 1

BTQ1B

R1 0

0 1

S−1

= (ST )−1STQ1SS−1 = (ST )−1

RT1 0

0 1

bT1 Q1b1 0

0 0

R1 0

0 1

S−1

= (ST )−1

Q 0

0 0

S−1.

We now prove that Q1Π0Q1 = Q1 that is of the above form. From the above and noting

25

that S−1 =

R−11 0

0 1

BT and (ST )−1 = B

(RT1 )−1 0

0 1

, we have

Q1Π0Q1

= (ST )−1

Q 0

0 0

S−1BRQ−1RTBT (ST )−1

Q 0

0 0

S−1

= (ST )−1

Q 0

0 0

R−11 0

0 1

BTBRQ−1RTBTB

(RT1 )−1 0

0 1

Q 0

0 0

S−1

= (ST )−1

Q 0

0 0

1

R2

Q−1

(1 R2

)

Q 0

0 0

S−1

= (ST )−1

Q

0

Q−1

(Q 0

)S−1 = (ST )−1

Q 0

0 0

S−1 = Q1.

Thus, Π0 is one of the solutions of Q−1 . To prove that the asymptotic variance-covariance

matrix σ2Π1 of our estimator is smaller than the corresponding matrix σ2Π0 given in Hardle

et al. (1993), we only need to show that Π0 −Π1 is a positive semi-definite matrix, that

is, Π0 > Π1. Recall that V = JT

β(r)0

E{g′(XT β0)

2XXT}J

β(r)0

. Note that both Q and

V are positive definite matrices and obviously V ≥ Q. Thus, Q−1 ≥ V−1, and then

V−1 ≥ V−1QV−1. From these two inequalities, it is easy to see that

Π0 ≥ Jβ

(r)0

V−1JT

β(r)0

≥ Jβ

(r)0

V−1QV−1JT

β(r)0

= Π1.

The proof is now complete.

Proof of Corollary 1. Let • denote the inner product of two vectors. Theorem 2

implies ‖β − β0‖ = OP (n−1/2) and

| cos(β, β0)− 1| = |(β − β0) • β0/‖β‖+ (‖β0‖ − ‖β‖)/‖β‖ |≤ 3‖β − β0‖/‖β‖ = OP (n−1/2).

This completes the proof of Corollary 1.

Proof of Theorem 4. Denote θ0 = (θ01, . . . , θ0q)T, θ = (θ1, . . . , θq)

T. Theorem 1 and

26

Lemma A.4 in the Appendix yield

sup(x,β)∈An

|g∗(xTβ)− g(xTβ0)|

≤q∑

s=1

sup(x,β)∈An

|g2s(xTβ; β)− g2s(x

Tβ0)||θs − θ0s|

+ sup(x,β)∈An

|g(xTβ; β, θ0)− g(xTβ0)|

+q∑

s=1

supx∈A

|g2s(xTβ0)||θs − θ0s| = OP ((nh/ log n)−1/2),

and hence Theorem 4 follows.

Proof of Theorem 5. Decomposing σ2 into several parts, we have

σ2 =1

n

n∑

i=1

e2i +

1

n

n∑

i=1

[ZTi (θ0 − θ) + g(XT

i β0)− g(XTi β; β, θ0)]

2

+2

n

n∑

i=1

eiZTi (θ0 − θ) +

2

n

n∑

i=1

ei[g(XTi β0)− g(XT

i β; β, θ0)]

=: I1 + I2 + I3 + I4.

Note that√

n‖θ − θ0‖ = OP (1) and using (A.8) of Lemma A.4, we have

√n|I2| ≤ 1

n

n∑

i=1

‖Zi‖2√

n‖θ − θ0‖2

+√

n sup(x,β)∈An

|g(xTβ0)− g(xTβ; β, θ0)|2

= OP (n−1/2) + OP ((nh2/ log2 n)−1/2)P−→ 0.

Since Eei = 0, we obtain√

nI3P−→ 0. Similarly to the proof of (A.17) in the Appendix, we

also have√

nI4P−→ 0. This proves that σ2 = 1

n

∑ni=1 e2

i + oP (n−1/2). Therefore, we have

√n(σ2 − σ2) =

1√n

n∑

i=1

(e2i − σ2) + oP (1).

The proof can now be completed by employing the central limit theorem.

APPENDIX

The following Lemmas A.1–A.7 are needed to prove Theorems 1, 2, 4, 5. Lemma A.1

gives an important probability inequality and Lemmas A.2 and A.3 provide bounds for the

27

moments of the relevant estimators. They are used to obtain the rates of convergence for the

estimators of the nonparametric component, and are used in the proof of Lemmas A.4–A.7.

Lemma A.4 presents the uniform rates of convergence in probability for the estimators g,

g2s and g′. These results are very useful for the nonparametric estimations. The proof of

Lemma A.5–A.7, as well as Theorem 3 and 4, rely on Lemma A.4. To simplify the proof

of Theorem 1, we divide the main steps of the proofs into Lemmas A.5 and A.6. Lemma

A.5 is used to obtain the limiting variance of the estimator θ, and Lemma A.6 together with

Lemma A.5 shows that the rate of convergence of the nonlinear section of ei for θ − θ0 is

oP (n−1/2). Lemma A.7 provides the main step for the proof of Theorem 2.

Lemma A.1 Let ξ1(x, β), . . . , ξn(x, β) be a sequence of random variables. Denote

fx,β(Vi) = ξi(x, β) for i = 1, . . . , n, where V1, . . . , Vn be a sequence of random variables,

and fx,β is a function on An, where An = {(x, β) : (x, β) ∈ A× Rp, ‖β − β0‖ ≤ cn−1/2} for

a constant c > 0. Assume that fx,β satisfies

1

n

n∑

i=1

|fx,β(Vi)− fx∗,β∗(Vi)| ≤ cna[‖β − β∗‖+ ‖x− x∗‖] (A.1)

for some constants x∗, β∗, a > 0 and c > 0. Let εn > 0 depend only on n. If

P

{∣∣∣∣1

n

n∑

i=1

ξi(x, β)∣∣∣∣ >

1

2εn

}≤ 1

2, (A.2)

for (x, β) ∈ An, then we have

P

{sup

(x,β)∈An

∣∣∣∣1

n

n∑

i=1

ξi(x, β)∣∣∣∣ >

1

2εn

}

≤ c1n2paε−2p

n E

{sup

(x,β)∈An

2 exp( −n2ε2

n/128∑ni=1 ξ2

i (x, β)

)∧ 1

}, (A.3)

where c1 > 0 is a constant.

Proof. Let {ξ′1(x, β), . . . , ξ′n(x, β)} be an independent version of {ξ1(x, β), . . . , ξn(x, β)}.Now generate independent sign random variables σ1, . . . , σn for which P{σi = 1} = P{σi =

−1} =1

2, and {σi, 1 ≤ i ≤ n} independent of {ξi(x, β), ξ′i(x, β), 1 ≤ i ≤ n}. By symmetry,

σi(ξi − ξ′i) has the same distribution as (ξi − ξ′i). The symmetrization Lemma in Pollard

28

(1984) implies

P

{sup

(x,β)∈An

∣∣∣∣1

n

n∑

i=1

ξi(x, β)∣∣∣∣ > εn

}

≤ 2P

{sup

(x,β)∈An

∣∣∣∣1

n

n∑

i=1

[ξi(x, β)− ξ′i(x, β)]∣∣∣∣ >

1

2εn

}

= 2P

{sup

(x,β)∈An

∣∣∣∣1

n

n∑

i=1

σi[ξi(x, β)− ξ′i(x, β)]∣∣∣∣ >

1

2εn

}

≤ 4P

{sup

(x,β)∈An

∣∣∣∣1

n

n∑

i=1

σiξi(x, β)∣∣∣∣ >

1

4εn

}. (A.4)

Let Pn be the empirical measure that puts equal mass1

nat each of the n observations

V1, . . . , Vn. Let F = {fx,β(·) : ‖x‖ ≤ C, ‖β‖ ≤ B} be a class of functions indexed by x

and β consisting of fx,β(Vi) = ξi(x, β). Denote V = (V1, . . . , Vn). Given V , choose function

f ◦1 , . . . , f ◦m, each in F , such that

minj∈{1,...,m}

1

n

n∑

i=1

|fx,β(Vi)− f ◦j (Vi)| < εn (A.5)

for each fx,β in F . Let N(εn, Pn,F) be the minimum m for all sets that satisfies (A.5).

Denote f ∗x,β for the f ◦j at which the minimum is achieved, we then have

P

{sup

(x,β)∈An

∣∣∣∣1

n

n∑

i=1

σiξi(x, β)∣∣∣∣ >

1

4εn

∣∣∣∣V}

= P

{sup

(x,β)∈An

∣∣∣∣1

n

n∑

i=1

σifx,β(Vi)∣∣∣∣ >

1

4εn|V

}

≤ P

{sup

(x,β)∈An

∣∣∣∣1

n

n∑

i=1

σif∗x,β(Vi)

∣∣∣∣ >1

8εn|V

}

≤ N(εn, Pn,F) maxj∈{1,...,N}

P

{∣∣∣∣1

n

n∑

i=1

σif◦j (Vi)

∣∣∣∣ >1

8εn|V

}. (A.6)

Now we need to determine the order of N(εn, Pn,F). For each set satisfying (A.5), each f ◦j

has a pair (xj, βj) such that f ◦j (v) = fxj ,βj(v). Then for all (x, β) ∈ An, we have from (A.1)

that1

n

n∑

i=1

|fx,β(Vi)− fxj ,βj(Vi)| ≤ cna(‖β − βj‖+ ‖x− xj‖).

Next, we want to bound the right-hand side of the above formula by εn. Thus for each

(x, β) ∈ An, we need a pair (xj, βj) within radius rn = O(n−aεn) of (x, β). Therefore, the

29

number N needed to satisfy (A.5) is bounded by r−pn r−p

n = cn2paε−2pn , i.e.

N(εn, Pn,F) ≤ cn2paε−2pn . (A.7)

Now conditioning on V , σif◦j (Vi) is bounded. Hoeffding’s inequality [(Hoeffding (1963)] yields

P

{∣∣∣∣1

n

n∑

i=1

σif◦j (Vi)

∣∣∣∣ >1

8εn|V

}≤ 2 exp

( −2n(εn/8)2

∑ni=1 4f 2

xj ,βj(Vi)

)∧ 1.

This together with (A.4), (A.6) and (A.7) proves (A.3).

Lemma A.2 Suppose that conditions C1, C2 and C3(i) hold. If h = cn−a for any

0 < a < 1/2 and some constants c > 0, then, for i = 1, · · · , n, we have

E

g(XT

i β0)−n∑

j=1

Wnj(XTi β0; β0)g(XT

j β0)

2

= O(h4),

E

g(xTβ)−

n∑

j=1

Wnj(xTβ; β)g(XT

j β)

2

= O(h4),

E

g′(XT

i β0)−n∑

j=1


j β0)

2

= O(h21)

and

E

n∑

j=1

Wni(XTj β0; β0)ϕ(XT

j β0)− ϕ(XTi β0)

2

= O(√

h ),

where ϕ(t) = g′(t)g3s(t) and g3s is the sth component of g3(t) = E(X|XTβ0 = t).

Proof. See Lemma 1 of Zhu and Xue (2006).

Lemma A.3 Under the assumptions of Lemma A.2, we have

E{W 2ni(X

Ti β0; β0)} = O((nh)−2),

E

n∑

j=1,j 6=i

W 2nj(X

Ti β0; β0)

= O((nh)−1),

E

n∑

j=1

W 2nj(x

Tβ; β)

= O((nh)−1)

and

E{W 2ni(X

Ti β0; β0)} = O((nh1)

−2 + (n3h51)−1),

E

n∑

j=1,j 6=i

W 2nj(X

Ti β0; β0)

= O((nh3

1)−1).

30

Proof. See Lemma 2 of Zhu and Xue (2006).

Lemma A.4 Suppose that conditions C1–C4 and C5(i) hold. We then have

sup(x,β)∈An

|g(xTβ0)− g(xTβ; β, θ0)| = OP ((nh/ log n)−1/2) (A.8)

and

sup(x,β)∈An

|g2s(xTβ0)− g2s(x

Tβ; β)| = OP ((nh/ log n)−1/2). (A.9)

If in addition, C5(ii) also holds, then we have

sup(x,β)∈An

|g′(xTβ0)− g′(xTβ; β, θ0)| = OP ((nh31/ log n)−1/2), (A.10)

where An = {(x, β) : (x, β) ∈ A×Rp, ‖β − β0‖ ≤ cn−1/2} for a constant c > 0.

Proof. We only prove (A.8), the proofs for (A.9) and (A.10) are similar. Write g(Xi, ei) =

g(xTβ0)− g(XTi β0)− ei, i = 1, . . . , n. We have

g(xTβ0)− g(xTβ; β, θ0) =n∑

i=1

Wni(xTβ; β)g(Xi, ei). (A.11)

Let ξi(x, β) = n(nh/ log n)1/2Wni(xTβ; β)g(Xi, ei), fx,β(Vi) = ξi(x, β), Vi = (Xi, ei), i =

1, . . . , n. Using lemma A.1, we have to verify (A.1) and (A.2). A simple calculation yields

(A.1), so we now verify (A.2). By lemmas A.2 and A.3, and noting that sup(x,β)∈An|g(xTβ)−

g(xTβ0)| = O(n−1/2), we have

E[g(xTβ0)− g(xTβ; β, θ0)]2 = E

[n∑

i=1

Wni(xTβ; β)g(Xi, ei)

]2

≤ cE

[g(xTβ)−

n∑

i=1

Wni(xTβ; β)g(XT

i β)

]2

+ cE

{n∑

i=1

W 2ni(x

Tβ; β)

}+ O(n−1)

≤ ch4 + c(nh)−1. (A.12)

Given a M > 0, by Chevbychev’s inequality and (A.12), we have

P

{∣∣∣∣1

n

n∑

i=1

ξi(x, β)∣∣∣∣ >

1

2M

}≤ 4M−2E

[1

n

n∑

i=1

ξi(x, β)

]2

≤ 4M−2nh(log n)−1E

[n∑

i=1

Wni(xTβ; β)g(Xi, ei)

]2

≤ cM−2(cnh5 + c(log n)−1). (A.13)

31

Therefore, from C5(i), we can choose M large enough so that the right hand side of (A.13)

is less than or equal to1

2. Hence, (A.2) is satisfied. We now can use (A.3) of Lemma A.1 to

get (A.8). By Lemma A.3, we obtain

n−2n∑

i=1

Eξ2i (x, β) = nh(log n)−1

n∑

i=1

E[Wni(x

Tβ; β)g(Xi, ei)]2

≤ cnh(log n)−1n∑

i=1

EW 2ni(x

Tβ; β) ≤ c(log n)−1.

This implies that n−2 ∑ni=1 ξ2

i (x, β) = OP ((log n)−1). Hence, from Lemma A.1 we have

P

{sup

(x,β)∈An

∣∣∣∣1

n

n∑

i=1

ξi(x, β)∣∣∣∣ >

1

2M

}≤ cn2paM−2p exp (− cM2 log n).

The right-hand side of the above formula tends to zero when M is large enough. Therefore,

(A.8) follows.

Lemma A.5 Under the assumptions of Theorem 1, we have

n−1ZTZP−→ Σ.

where Σ is defined in condition C6.

Proof. Noting that Z = (I− S)Z, the (i, s) element of Z is

Zis = [Zis − g2s(XTi β0)] + [g2s(X

Ti β0)− g2s(X

Ti β0; β0)].

The (s, t) element of ZTZ is

n∑

i=1

ZisZit =n∑

i=1

[Zis − g2s(XTi β0)][Zit − g2t(X

Ti β0)]

+n∑

i=1

[Zis − g2s(XTi β0)][g2t(X

Ti β0)− g2t(X

Ti β0; β0)]

+n∑

i=1

[Zit − g2t(XTi β0)][g2s(X

Ti β0)− g2s(X

Ti β0; β0)]

+n∑

i=1

[g2s(XTi β0)− g2s(X

Ti β0; β0)][g2t(X

Ti β0)− g2t(X

Ti β0; β0)]

=:I1 + I2 + I3 + I4. (A.14)

By the law of large numbers, we have

n−1I1P−→ E{[Z1s − E(Z1s|XT

1 β0)][Z1t − E(Z1t|XT1 β0)]} =: Σst, (A.15)

32

where Σst is the (s, t) element of Σ. Noting that

1

n

n∑

i=1

|Zis − g2s(XTi β0)| P−→ E|Z1s − g2s(X

T1 β0)| < ∞,

this together with (A.9) of Lemma A.4 proves that

n−1I2 ≤ OP (1) sup(x,β)∈An

|g2t(xTβ0)− g2t(x

Tβ; β)| P−→ 0.

Similarly, we can prove n−1I3P−→ 0 and n−1I4

P−→ 0. This together with (A.14) and

(A.15) proves Lemma A.5.

Lemma A.6 Under the assumptions of Theorem 1, we have

n−1/2ZTG :=1√n

n∑

i=1

Zi[g(XTi β0)− g(XT

i β0; β0, θ0)]P−→ 0.

Proof. The sth component of ZTG is

n∑

i=1

Zis[g(XTi β0)− g(XT

i β0; β0, θ0)]

=n∑

i=1

[Zis − g2s(XTi β0)][g(XT

i β0)− g(XTi β0; β0, θ0)]

+n∑

i=1

[g2s(XTi β0)− g2s(X

Ti β0; β0)][g(XT

i β0)− g(XTi β0; β0, θ0)]

=: J1 + J2. (A.16)

For J2, from (A.8) and (A.9) of Lemma A.4 we have

n−1/2|J2| ≤√

n sup(x,β)∈An

|g2s(xTβ0)− g2s(x

Tβ; β)|

× sup(x,β)∈An

|g(xTβ0)− g(xTβ; β, θ0)| = OP ((nh2/ log2 n)−1).

Noting that nh2/ log2 n →∞, we obtain n−1/2J2P−→ 0. It remains to prove that

n−1/2J1P−→ 0, (A.17)

as this together with (A.16) implies Lemma A.6. To prove (A.17) , we only need to show

that

supβ∈B′n

∣∣∣∣1

n

n∑

i=1

√n[Zis − g2s(X

Ti β0)][g(XT

i β0)− g(XTi β, β)]

∣∣∣∣P−→ 0, (A.18)

33

where B′n = {β : ‖β − β0‖ ≤ cn−1/2} for a constant c > 0. Toward this goal, we note that

Lemma A.1 can be used when the variable x is removed. Let

ξi(β) =√

n[Zis − g2s(XTi β0)][g(XT

i β0)− g(XTi β, β, θ0)],

fβ(Vi) = ξi(β), Vi = (Xi, Zis, ei), i = 1, . . . , n.

We now verify that (A.1) and (A.2) are satisfied. By the condition C3(ii) on the kernel

function, we calculate that

1

n

n∑

i=1

|fβ(Vi)− fβ∗(Vi)| ≤ cn5/2h−2‖β − β∗‖ = cna‖β − β∗|

where a = 52

+ 2λ(15≤ λ < 1

2). Hence, (A.1) is satisfied.

We next verify that (A.2) is satisfied. Denote ζi = Zis − g2s(XTi β0). From condition C4,

Lemmas A.2 and A3, we have

E

[1

n

n∑

i=1

ξi(β)

]2

≤ 2n−1n∑

i=1

E

g(XT

i β0)−n∑

j=1

Wnj(XTi β; β)g(XT

j β0)

2

E(ζ2i |XT

i β0)

+ 2n−1∑

i

∑

j

∑

k

∑

l

E[Wnj(XTi β; β)Wnl(X

Tk β; β)ζiζkejel]

≤ ch4 + cn−1 + cn−1

n∑

i=1

EW 2ni(X

Ti β; β) +

∑

i6=j

EW 2nj(X

Ti β; β)

≤ ch4 + cn−1 + c(nh)−1 −→ 0.

Hence, we can obtain

P

{∣∣∣∣1

n

n∑

i=1

ξi(β)∣∣∣∣ >

1

2ε

}≤ ch4 + cn−1 + c(nh)−1 <

1

2

when n large enough. Therefore, (A.2) is satisfied. By (A.8) of Lemma A.4, we have

1

n2

n∑

i=1

ξ2i (β) = OP (1) sup

(x,β)∈An

[g(xTβ0)− g(xTβ; β, θ0)]2 = OP (nh/ log n)−1).

By using Lemma A.1, we obtain

P

{sup

(x,β)∈An

∣∣∣∣1

n

n∑

i=1

ξi(β)∣∣∣∣ >

1

2ε

}≤ cn2paε−2p exp(−cnh/ log n) −→ 0.

34

by nh/ log n →∞. This proves (A.18) and thus completes the proof of Lemma A.6.

Lemma A.7. Suppose that conditions C1–C6 are satisfied, then we have

supβ(r)∈Bn

‖R(β(r))− U(β(r)0 ) + nV(β(r) − β

(r)0 )‖ = oP (

√n),

where Bn = {β(r) : ‖β(r) − β(r)0 ‖ ≤ Cn−1/2} for a constant C > 0, V is defined in condition

C6,

R(β(r)) =n∑

i=1

[Yi − ZTi θ0 − g(XT

i β; β, θ0)]g′(XT

i β; β, θ0)JTβ(r)Xi, (A.19)

and

U(β(r)0 ) =

n∑

i=1

eig′(XT

i β0)JT

β(r)0

[Xi − E(Xi|XTi β0)].

Proof. Separating R(β(r)), we have

R(β(r)) =n∑

i=1

eig′(XT

i β0)JTβ(r) [Xi − E(Xi|XT

i β0)]

+n∑

i=1

ei[g′(XT

i β; β, θ0)− g′(XTi β0)]J

Tβ(r)Xi

−n∑

i=1

g′(XTi β0)J

Tβ(r)Xi{g(XT

i β; β, θ0)− g(XTi β0; β0, θ0)}

−n∑

i=1

g′(XTi β0)J

Tβ(r){Xi[g(XT

i β0; β0, θ0)− g(XTi β0)]− eig3(X

Ti β0)}

−n∑

i=1

[g(XTi β; β, θ0)− g(XT

i β0)][g′(XT

i β; β, θ0)− g′(XTi β0)]J

Tβ(r)Xi

=: R1(β(r)) + R2(β

(r))−R3(β(r))−R4(β

(r))−R5(β(r)). (A.20)

Noting that Jβ(r) − Jβ

(r)0

= OP (n−1/2) for all β(r) ∈ Bn, we have

supβ(r)∈Bn

‖R1(β(r))− U(β

(r)0 )‖ = oP (

√n ). (A.21)

Since ‖β(r)−β(r)0 ‖ ≤ Cn−1/2 implies ‖β−β0‖ ≤ Cn−1/2 for all β(r) ∈ Bn, similar to the proof

of (A.17) we can show that

supβ(r)∈Bn

‖R2(β(r))‖ = oP (

√n ). (A.22)

35

For R3(β(r)), by a Taylor expansion of β(r) − β

(r)0 with a suitable mean β(r) ∈ Bn and

β = β(β(r)), we get

R3(β(r)) =

n∑

i=1

g′(XTi β0)g

′(XTi β; β, θ0)J

Tβ(r)XiX

Ti Jβ(r)(β(r) − β

(r)0 )

=n∑

i=1

g′(XTi β0)[g

′(XTi β; β, θ0)− g′(XT

i β0)]

× JTβ(r)XiX


(r)0 )

+n∑

i=1

g′(XTi β0)

2JTβ(r)XiX


(r)0 )

=:R31(β(r), β(r)) + R32(β

(r), β(r)).

By (A.10) of Lemma A.4 and the law of large numbers, we obtain that

supβ(r),β(r)∈Bn

‖R31(β(r), β(r))‖ = oP (

√n )

and

supβ(r),β(r)∈Bn

‖R32(β(r), β(r))− nV(β(r) − β

(r)0 )‖ = oP (

√n ).

Therefore, we have

supβ(r)∈Bn

‖R3(β(r))− nV(β(r) − β

(r)0 )‖ = oP (

√n ). (A.23)

We now consider R4(β(r)). Write R4(β

(r)) = JTβ(r)R

∗4(β

(r)). Let R∗4,s denote the sth

component of R∗4(β

(r)). First, from Lemma A.2 and A.3 we have

n−1E(R∗24,s) ≤ cn−1

n∑

i=1

E

n∑

j=1

Wni(XTj β0; β0)g

′(XTj β0)Xjs − g′(XT

i β0)g3s(XTi β0)

2

+ cn∑

i=1

E

n∑

j=1


j β0)− g(XTi β0)

2

≤ c(nh)−1 + c√

h + cnh4 −→ 0.

This implies

supβ(r)∈Bn

‖R4(β(r))‖ = oP (

√n), (A.24)

and by Lemma A.4 we obtain

supβ(r)∈Bn

‖R5(β(r))‖ = oP (

√n). (A.25)

Substituting (A.21)–(A.25) into (A.20), we prove Lemma A.7.

36

REFERENCES

Arnold, S. f. (1981). The Theory of Linear Models and Multivariate Analysis. John Wiley

& Sons, New York.

Bhattacharya, P. K. and Zhao, P.-L. (1997). Semiparametric inference in a partial linear

model. Ann. Statist. 25 244-262.

Carroll, R. J., Fan, J. Gijbels, I. and Wand, M. P. (1997). Generalized partially linear

single-index models. J. Amer. Statist. Assoc. 92 477–489.

Chen, C.-H. and Li, K.-C.(1998). Can SIR be as popular as multiple linear regression.

Statistics Sinica 8 289–316.

Chen, H. (1988). Convergence rates for parametric components in a partly linear model.

Ann. Statist. 16 136–141.

Chen, H. and Shiau, J.-J. H. (1994). Data-driven efficient estimators for a Partially linear

model. Ann. Statist. 22 211–237.

Chiou, J. M. and Muller, H. G. (1998). Quasi-likelihood regression with unknown link and

variance functions. J. Amer. Statist. Assoc. 93 1376–1387.

Cook, R. D. (1998). Regression Graphics: Ideas for Studying Regressions Through Graphics.

Wiley, New York.

Cook, R. D. (2007). Fisher Lecture: Dimension Reduction in Regression1, 2 R. Dennis

Cook. Statist. Sci. 22 1-26.

Cook, R. D. and Li, B. (2002). Dimension reduction for the conditional mean in regression.


Cook, R. D. and Wiseberg, S. (1991). Comment on “Sliced inverse regression for dimension

reduction,” by K. C. Li. J. Amer. Statist. Assoc. 86 328–332.

37

Craven, P. and Wahba, G. (1979), Smoothing and noisy data with spline functions: esti-

mating the correct degree of smoothing by the method of generalized cross-validation,

Numer. Math. 31 377-403.

Doksum, K. and Samarov, A. (1995), Nonparametric estimation of global functionals and

a measure of the explanatory power of covariates in regression. Ann. of Statist. 23

1443-1473.

Fan, J. and Gijbels, I. (1996). Local Polynomial Modeling and Its Applications. Chapman

and Hall, London.

Friedman, J. H. and Stuetzle, W. (1981). Projection pursuit regression. J. Amer. Statist.

Assoc. 76 817–823.

Gentle, J. E.(1998). Numerical Linear Algebra for Applications in Statistics. Berlin:

Springer-Verlag.

Hall, P. (1989). On projection pursuit regression. Ann. Statist. 17 573–588.

Hardle, W., Hall, P. and Ichimura, H. (1993). Optimal smoothing in single-index models.


Hardle, W., Gao, J. and Liang, H. (2000) Partially Linear Models. Springer, New York

Harrison, D. and Rubinfeld, D. (1978). Hedonic housing pries and the demand for clean

air Environmental Economics and Management 5 81–102.

Heckman, N. (1986). Spline smoothing in a partly linear model. J. Royal Statist. Soci.

Ser.A 48 244–248.

Hristache, M., Juditsky, A. and Spokoiny, V. (2001). Direct estimation of the index coeffi-

cient in a single-index model. Ann. Statist. 29 595–623.

Hsing, T. and Carroll, R. J. (1992). An asymptotic theory for sliced inverse regression.

Ann. Statist. 20 1040–1061.

38

Li, B., Wen, S. Q. and Zhu L. X. (2008). On a projective resampling method for dimension

reduction with multivariate responses. J. Amer. Statist. Assoc. 106 1177-1186.

Li, K. C. (1991). Sliced inverse regression for dimension reduction (with discussion). J.

Amer. Statist. Assoc. 86 316–342.

Li, K. C. (1992). On principal Hessian directions for data visualization and dimension

reduction: Another application of Stein’s lemma. J. Amer. Statist. Assoc. 87 1025–

1039.

Li, K. C., Aragon, Y., Shedden, K., and Agnan, C. T. (2003). Dimension reduction for

multivariate response data. J. Amer. Statist. Assoc. 98 99–109.

Li, Y. X. and Zhu, L. X. (2007). Asymptotics for sliced average variance estimation. Ann.

Statist. 35 41–69.

Pollard, D. (1984). Convergence of Stochastic Processes. Springer-Verlag New York Inc.,

New York.

Rice, J. (1986), Convergence rates for partially splined models. Statist. Prob. Lett. 4

203-208.

Ruppert, D., Wand, M. P. and Carroll, R. J. (2003). Semiparametric Regression. New

York: Cambridge University Press,

Speckman, P. (1988). Kernel smoothing in partial linear models. J. Roy. Statist. Soci.

Ser.B 50 413–434.

Stoker, T. M. (1986). Consistent estimation of scaled coefficients. Econometrica 54 1461–

1481.

Stute,W. and Zhu, L. X. (2005). Nonparametric checks for single-index models. Ann.

Statist. 33 1048–1083.

Weisberg, S. and Welsh, A. H. (1994). Adapting for the Missing Linear Link. Ann. Statist.

22 1674–1700.

39

Welsh, A. H. (1989). On M-processes and M-estimation. Ann. Statist. 17 337–361.

[Correction(1990) 18 1500.]

Xia, Y. and Hardle, W. (2006). Semi-parametric estimation of partially linear single-index

models. J. Multi. Anal. 97 1162 - 1184

Xia, Y., Tong, H. Li, W. K. and Zhu L. X. (2002). An adaptive estimation of dimension

reduction space. J. R. Statist. Soc. B 64 363–410.

Xia, Y. (2006). Asymptotic distributions for two estimators of the single-index model.

Econometric Theory 22 1112–1137.

Yin, X. and Cool, R. D. (2002). Dimension reduction for the conditional k-th moment in

regression. J. R. Statist. Soc. B 64 159–175.

Yu, Y. and Ruppert, D. (2002). Penalized spline estimation for partially linear single-index

models. J. Amer. Statist. Assoc. 97 1042–1054.

Zhu, L. X. and Ng, K. W. (1995). Asymptotics for Sliced Inverse Regression. Statistica

Sinica 5 727-736.

Zhu, L. X. and Ng, K. W. (2003) Checking the adequacy of a partial linear model. Statist.

Sinica 13 763-781.

Zhu L. X. and Xue L. G. (2006) Empirical likelihood confidence regions in a partially linear

single-index model. J. Roy. Statist. Soc. ser. B, 68 549–570.

Zhu, L., Zhu, L. X., Ferre L., and Wang, T. (2008). Sufficient Dimension Reduction

Through Discretization-Expectation Estimation. Unpublished manuscript, Hong Kong

Baptist University.

40

TABLE 1

Simulation results for θ with βZ and β0 parallel

Resulting estimate One-step iterated estimate

Bias SD MSE Bias SD MSE

PPR 0.0058 0.0706 0.00502 0.0046 0.0701 0.00493

SIR5 0.0095 0.0862 0.00753 0.0083 0.0869 0.00762

SIR10 0.0113 0.0788 0.00634 0.0098 0.0808 0.00663

β0 given 0.0031 0.0660 0.00436

TABLE 2

Simulation results for θ with βZ and β0 orthogonal

Resulting estimate One-step iterated estimate

Bias SD MSE Bias SD MSE

PPR -0.0087 0.0972 0.00952 -0.0047 0.0711 0.00508

SIR5 -0.0115 0.1395 0.01960 -0.0072 0.0919 0.00850

SIR10 -0.0102 0.1362 0.01865 -0.0083 0.0959 0.00926

β0 given -0.0024 0.0696 0.00485

TABLE 3

Simulation results for the angles between β and β0

βZ and β0 parallel βZ and β0 orthogonal

Mean SD MSE Mean SD MSE

PPR 0.0148 0.0056 0.00025 0.0157 0.0066 0.00029

SIR5 0.0467 0.0223 0.00268 0.0482 0.0232 0.00286

SIR10 0.0496 0.0230 0.00299 0.0528 0.0229 0.00331

41

0.0 0.2 0.4 0.6 0.8 1.0

0.00

0.05

0.10

0.15

0.20

single−index

estim

ated

val

ues

of g

Figure 1: Curve estimate for a single replication of the quadratic model simulation study,

with orthogonal βZ and β0. The true cure g(solid curve), the mean of g∗ with GCV band-

width (dashed curve) and a fixed optimal bandwidth hopt = 0.439 (dotted curve) over 2000

simulations are shown.

42

0.3 0.4 0.5 0.6 0.7 0.8

9.0

9.5

10.0

10.5

11.0

single−index

estim

ated

val

ues

of g

Figure 2: Curve estimate for the Boston Housing data, with xTβ on the x-axis and g∗(t) on

the y-axis.

43

estimation for a partial-linear single-index modelanson.ucdavis.edu/~wang/plsi_zhu08.pdf · in this...

Documents