estimation in semiparametric models with missing data

Ann Inst Stat MathDOI 10.1007/s10463-012-0393-6

Estimation in semiparametric models with missing data

Song Xi Chen · Ingrid Van Keilegom

Received: 7 May 2012 / Revised: 22 October 2012© The Institute of Statistical Mathematics, Tokyo 2012

Abstract This paper considers the problem of parameter estimation in a generalclass of semiparametric models when observations are subject to missingness at ran-dom. The semiparametric models allow for estimating functions that are non-smoothwith respect to the parameter. We propose a nonparametric imputation method for themissing values, which then leads to imputed estimating equations for the finite dimen-sional parameter of interest. The asymptotic normality of the parameter estimator isproved in a general setting, and is investigated in detail for a number of specific semi-parametric models. Finally, we study the small sample performance of the proposedestimator via simulations.

Keywords Copulas · Imputation · Kernel smoothing · Missing at random · Nuisancefunction · Partially linear model · Semiparametric model · Single-index model

1 Introduction

Semiparametric models encompass a large class of statistical models. They have theadvantage of being more interpretable and parsimonious than nonparametric models,and at the same time, they are less restrictive than purely parametric models. Let

S. X. ChenGuanghua School of Management and Center for Statistical Science,Peking University, Beijing 100871, China

S. X. ChenIowa State University, Ames, IA, USA

I. Van Keilegom (B)Institute of Statistics, Université catholique de Louvain, Voie du Roman Pays 20,1348 Louvain-la-Neuve, Belgiume-mail: [email protected]

123

S. X. Chen, I. Van Keilegom

(X, Y ) and (Xi , Yi ), i = 1, . . . , n, be independent and identically distributed randomvectors. We denote � for a finite dimensional parameter set (a compact subset ofIR p) and H for a set of infinite dimensional functions depending on X and/or Y . Thefunctions in H are allowed to depend on θ . Suppose that g(X, Y, θ, h) is an estimatingfunction which is known up to the finite dimensional parameter θ ∈ � and the infinitedimensional nuisance function h ∈ H, and which satisfies

G(θ, h) := E{g(X, Y, θ, h)} = 0 (1)

at θ = θ0 and h = h0, which are respectively the true parameter value and the trueinfinite dimensional nuisance function.

Model (1) includes as special cases many well-known semiparametric models. Forinstance, by adding a nonparametric functional component h(·) to the classical linearregression model, we get the following partially linear model:

Y = θT X1 + h(X2) + ε, (2)

where X = (X1, X2) is a covariate vector, Y is univariate response, and the error ε

satisfies some identifiability constraint, like E(ε|X) = 0 or med(ε|X) = 0. Here his a nuisance function that summarizes the nonparametric covariate effect due to agroup of predictors X2. In the context of the generalized linear model (McCullagh andNelder 1983), if the known link function is replaced by an unknown nonparametric linkfunction h, we arrive at the single-index regression model Y = h(θT X) + ε. Othersemiparametric models that are special cases of model (1) include copula models,semiparametric transformation models, Cox models, among many others.

Missing values are commonly encountered in statistical applications. In survey sam-pling, e.g., there are typically non-responses of respondents to some survey questions.In biological applications, part of the data vector is often incompletely collected. Thepresence of missing values means that the entire sample {(Xi , Yi )}n

i=1 is not available.Without loss of generality, we assume that Yi is subject to missingness, whereas Xi isalways available. Note that this convention implies that if the vector (Xi , Yi ) followsa regression model, the vector Yi possibly contains some covariates, and the vector Xi

possibly contains the (or a) response.There are basically two streams of inference methods for missing values. The

first one is the imputation approach. The celebrated multiple imputation method ofLittle and Rubin (2002) is a popular representation of this approach. The idea of thesecond approach is based on inverse weighting by the missing propensity functionproposed by James Robins and colleagues, see for instance Robins et al. (1994). Theimplementation of both approaches usually requires a parametric model for the missingpropensity function or the missing at random mechanism (Rubin 1976).

The aim of this paper is to provide a general estimator of the finite dimensionalparameter θ in the presence of the nuisance function h and of missing values. To makethe estimation fully respective to the underlying missing values mechanism withoutassuming a parametric model, we impute for each missing Yi multiple copies from akernel estimator of the conditional distribution of Yi given Xi , under the assumptionof missingness at random. This nonparametric imputation method can be viewed as

123


a nonparametric counterpart of the multiple-imputation approach of Little and Rubin(2002). With the imputed missing values and a separate estimator for the nonparametricfunction h, the estimator of θ is obtained by solving an estimating equation based on(1). The consistency and asymptotic normality of the estimator are established undera set of mild conditions.

We end this section by mentioning some related papers on parametric and semi-parametric models with missing data. Recent contributions have been made, e.g., byWang et al. (1998, 2004, 2010), Müller et al. (2006), Chen et al. (2007), Wang andSun (2007), Liang (2008), Muller (2009) and Wang (2009). All these contributions arehowever limited to specific (often quite narrow) classes of models, whereas we aimin this paper at developing a general approach, applicable not only to regression mod-els (with missing responses and/or covariates), but also to any other semiparametricmodel with missing data. We also refer to Chen et al. (2008), who studied semiparamet-ric efficiency bounds and efficient estimation of parameters defined through generalmoment restrictions with missing data. Their method relies, however, on auxiliary datacontaining information about the distribution of the missing variables conditional onproxy variables that are observed in both the primary and the auxiliary database, whensuch distribution is common to the two datasets.

The paper is organized as follows. The nonparametric imputation and the estimationframework are introduced in Sect. 2. Section 3 reports the general asymptotic resultregarding the consistency and the asymptotic normality of the proposed estimator. Thegeneral result is illustrated and applied to a set of popular semiparametric models inSect. 4. In Sect. 5, we study the small sample performance of the proposed estimatorby simulations. All the technical details are provided in the Appendix.

2 General method

Let X be a dx -dimensional vector that is always observable, and let Y be ady-dimensional vector that is subject to missingness. Define � = 1 if Y is observed,and � = 0 if Y is missing. We assume that Y is missing at random, i.e., � and Y areconditionally independent given X :

P(� = 1|X, Y ) = P(� = 1|X) =: p(X).

Note that using Y to denote the missing vector does not mean that we work under aregression model and that Y is the response in that model. In specific, Y can representa set of covariates in a regression problem. Hence, our framework includes the casewhere the covariates are subject to missingness.

In the absence of missing values, the semiparametric model is defined by anr -dimensional real-valued estimation function g(X, Y, θ, h), where θ is a finite dimen-sional parameter taking values in a compact � ⊂ IR p, and h is an unknownfunction taking values in a functional space H (an infinite dimensional parameterset of functions) and is depending on X and/or Y . The functions in H are allowedto depend on θ also (but we will often suppress this dependence when no confu-sion is possible). Let θ0 and h0 be the true unknown finite and infinite dimensional

123


parameters. We often omit the arguments of the function h for notational convenience,i.e., (θ, h) ≡ (θ, hθ ), (θ, h0) ≡ (θ, h0θ ) and (θ0, h0) ≡ (θ0, h0θ0). The estimatingfunction g is known up to θ and h. Suppose that r ≥ p, meaning that the numberof estimating functions may be larger than the dimension of θ , so we allow for anover-identified set of equations, popular in, e.g., econometrics. Moreover, by allowingthe function g to be a non-smooth function of its arguments, our general model alsoincludes, e.g., quantile regression models or change-point models.

Let G(θ, h) = E[g(X, Y, θ, h)], which is a nonrandom vector-valued functionG : �×H → IRr , such that G(θ, h0) = 0 for θ = θ0. If all data (Xi , Yi ), i = 1, . . . , nwere observed, the parameter θ could be estimated by minimizing a weighted normof n−1∑n

i=1 g(Xi , Yi , θ, hθ ), where hθ is an appropriate estimator of hθ . See Chenet al. (2003) for more details.

The issue of interest here is the estimation of θ in the presence of missing values. Let(Xi ,�i Yi ,�i ), i = 1, . . . , n, be i.i.d. random vectors having the same distributionas (X,�Y,�). We use a nonparametric approach to impute the missing values viaa nonparametric kernel estimator of F(y|x) = P(Y ≤ y|X = x), the conditionaldistribution of Y given X = x . The kernel estimator of F(y|x) is

F(y|x) =n∑

j=1

� j Ka(X j − x)I (Y j ≤ y)∑n

l=1 �l Ka(Xl − x), (3)

based on the portion of the sample without missing data, where K is a dx -dimensionalkernel function, a = an is a bandwidth sequence and Ka(·) = K (·/an)/adx

n .For each missing Yi , we generate κ (conditionally) independent Y ∗

i1, . . . , Y ∗iκ from

F(·|Xi ) as imputed values for the missing Yi . Define now the imputed estimatingfunction:

Gn(θ, h) = n−1n∑

i=1

{

�i g(Xi , Yi , θ, h) + (1 − �i )1

κ

κ∑

l=1

g(Xi , Y ∗

il , θ, h)}

.

Note that the value of κ controls the variance of the imputed component. Theoreti-cally speaking, we will let κ tend to infinity. Our numerical experience shows that thechoice κ = 50 is sufficient when the dimension is not very large. For larger dimen-sion, κ should be chosen larger. We note that 1

κ

∑κl=1 g(Xi , Y ∗

il , θ, h) approximates∫

g(Xi , y, θ, h)d F(y|Xi ). If the integral can be computed directly, then explicit impu-tation can be avoided. If a parametric model F(y|x; θ) is available for the conditionaldistribution, where θ is a finite dimensional parameter and θ is its maximum likelihoodestimator, then we can use F(y|x; θ ) instead of the nonparametric estimator F(y|x)

to generate the imputed Y ∗i . We would like to emphasize here that our general model

is not necessarily a regression model, and hence in general, we cannot impute missingY values using conditional mean imputation. And even if we would have a regressionstructure, the conditional mean imputation approach would still not be applicable ingeneral, since Y does not necessarily represent the response variable in that model.See also Wang and Chen (2009), where a similar imputation approach has been usedin the context of parametric estimating equations with missing values.

123


From the imputed estimating function Gn(θ, h) and for a given estimator hθ of hθ ,depending on the particular model at hand, we define the estimator of θ by:

θ = argminθ∈� ‖Gn(θ, hθ )‖W , (4)

where ‖A‖W = (tr(ATW A))1/2 for any r -dimensional vector A and for some fixedsymmetric r × r positive definite matrix W (and where tr stands for the trace ofa matrix). Note that when r = p (so the number of equations equals the numberof parameters to be estimated) and when the function g is smooth in θ , the systemof equations Gn(θ, hθ ) = 0 has a solution [namely θ defined in (4)] under certainregularity conditions. In other situations (e.g., in the case of quantile regression or inan over-identified case), there is no vector θ that solves this equation, in which case,we have to use the (more general) definition given in (4).

3 Main result

Below, we state the asymptotic normality of the estimator θ and we also give theformula of its asymptotic variance. The conditions under which this result is valid aregiven in the Appendix, and they will be checked in detail in Sect. 4 for a number ofspecific semiparametric models.

Theorem 1 Assume that conditions (A1)–(A5), (B1)–(B5) and (C1)–(C3) hold. Then,

θ − θ0 = n−1n∑

i=1

(�TW�)−1�TW k(Xi ,�i Yi ,�i ) + oP (n−1/2),

and

n1/2(θ − θ0)d→ N (0,�),

where � = (�TW�)−1�TW Var{k(X,�Y,�)}W�(�TW�)−1,

k(x, δy, δ)= δ

p(x)g(x, y, θ0, h0)+

(

1− δ

p(x)

)

E[g(x, Y, θ0, h0)|X = x]+ξ(x, δy, δ),

the function ξ is defined in condition (A4), and � = �(θ0), with

�(θ) = d

dθG(θ, h0) = lim

τ→0

1

τ

[G(θ + τ, h0,θ+τ ) − G(θ, h0θ )

].

The proof of this result can be found in the Appendix.

Remark 1 (i) Instead of using an imputation approach to take care of the missingvalues, we could also estimate θ by minimizing

∥∥∥∥∥

n−1n∑

i=1

�i g(Xi , Yi , θ, h)

p(Xi , β)

∥∥∥∥∥

W

123


with respect to θ , where p(Xi ) = p(Xi , β) follows, e.g., a logistic model, and β

can be estimated by

β = argmaxβ

n∑

i=1

{�i log p(Xi , β) + (1 − �i ) log(1 − p(Xi , β))}.

See Robins et al. (1994), among many other papers, for more details on thisestimation procedure based on the inverse weighting by the missing propensityfunction.

(ii) Based on Theorem 1 the efficiency of the proposed estimator can be studied, andthe optimal choice of the weight matrix W can be obtained. We do not elaborateon this in this paper, and refer, e.g., to the Section 6 in Ai and Chen (2003) formore details.

(iii) In the above i.i.d. representation of θ − θ0, the function ξ comes from the esti-mation of h0 by h. In addition, note that if there are no missing data, then δ = 1and p(·) ≡ 1, so k(x, δy, δ) = g(x, y, θ0, h0) + ξ(x, y, 1) in that case.

(iv) Note that when r = p (i.e., there are as many equations as there are parametersto be estimated), then � reduces to

� = �−1Var{k(X,�Y,�)}(�T)−1,

provided � is of full rank.

4 Examples

4.1 Partially linear regression model

The first example we consider is that of a partial linear mean regression model:

Y = θT X1 + h(X2) + ε, (5)

where E(ε|X) = 0, X = (XT1 , X2)

T is (d +1)-dimensional, Y is one-dimensional andfor identifiability reasons we let E(h(X2)) = 0 (the linear part θT X1 contains an inter-cept). We suppose that the response Y is missing at random. Let (X1, Y1), . . . , (Xn, Yn)

be i.i.d. coming from model (5). Define g(x, y, θ, h) = (y − θTx1 − h(x2))x1, and let

hθ (x2) =n∑

i=1

Wni (x2, bn){�i

[Yi − θT X1i

]+ (1 − �i )

[m(Xi ) − θT X1i

]},

where

Wni (x2, bn) = Lb(X2i − x2)∑n

j=1 Lb(X2 j − x2),

123


L is a univariate kernel function, b = bn → 0 is a bandwidth sequence, Lb(·) =L(·/b)/b, and m(x) = ∫

yd F(y|x), with F(y|x) given in (3). Finally, let θ =argminθ‖Gn(θ, hθ )‖, where ‖ · ‖ is the Euclidean norm. Instead of working withthe above estimator of hθ (x2), we could also work with a weighted average of thenon-missing observations only.

We now verify conditions (A1)–(A5) and (B1)–(B5). Let h0θ (x2) = h0(x2) −(θ − θ0)

T E(X1|X2 = x2) = E(Y |X2 = x2) − θT E(X1|X2 = x2), and let ‖h‖H =supθ,x2

|hθ (x2)| for any h. First, note that

hθ (x2) − h0θ (x2) = {hθ (x2) − E [hθ (x2)|X]} + {E [hθ (x2)|X] − h0θ (x2)}= (T1 + T2)(x2),

where X = (X1, . . . , Xn). Denoting Yi = �i Yi + (1 − �i )m(Xi ), we have thatE[Yi |X] = p(Xi )m(Xi ) + (1 − p(Xi ))(m(Xi ) + OP (aq

n )) = m(Xi ) + oP (n−1/4)

uniformly in i , provided na4qn → 0, and where m(x) = E(Y |X = x) and q is the

order of the kernel k. Hence,

T1(x2) =n∑

i=1

Wni (x2, bn)�i [Yi − m(Xi )]

+n∑

i=1

Wni (x2, bn)(1 − �i )[m(Xi ) − m(Xi )] + oP (n−1/4)

= OP ((nbn)−1/2(log n)1/2) + OP

((nad+1

n

)−1/2(log n)1/2

)

+ oP (n−1/4)

= oP (n−1/4),

provided nb2n(log n)−2 → ∞ and na2(d+1)

n (log n)−2 → ∞. Next, consider

T2(x2) =n∑

i=1

Wni (x2, bn)[m(Xi ) − θT X1i

]− h0θ (x2) + oP (n−1/4)

=n∑

i=1

Wni (x2, bn)[

E(Y |Xi ) − θT X1i − E(Y |X2i ) + θT E(X1|X2i )]

+OP (b2n) + oP (n−1/4)

= OP ((nbn)−1/2(log n)1/2) + oP (n−1/4) = oP (n−1/4),

uniformly in θ , provided nb8n → 0. Hence, (A1) is verified. Next, for (A2), define

H = C1M (RX2), where the space C1

M (RX2) is defined in Remark 2, and where RX2

is the (compact) support of X2. Using a similar derivation as for verifying condition(A1), we can show that sup‖θ−θ0‖=o(1) supx2

|h′θ (x2) − h′

0θ (x2)| = oP (1), providednb3

n(log n)−1 → ∞, nad+1n b2

n(log n)−1 → ∞ and aqn b−1

n → 0. It then follows thatP (hθ ∈ C1

M (RX2)) → 1. Moreover, the second part of (A2) is valid for s = 1

123


by Remark 2. For condition (A3), note that for any θ and h, (θ, h0)[h − h0] =−E{(hθ (X2) − h0θ (X2))X1}. Hence,

‖ (θ, h0)[h − h0] − (θ0, h0)[h − h0]‖

=∥∥∥∥∥(θ − θ0)

T E

{[n∑

i=1

Wni (X2, bn)X1i − E(X1|X2)

]

X1

}∥∥∥∥∥

= oP (‖θ − θ0‖).

Next, consider (A4). Using the above derivations, write

hθ0(x2) − h0(x2)

=n∑

i=1

Wni (x2, bn)

[

�i {Yi − m(Xi )} + (1 − �i ){m(Xi ) − m(Xi )}]

+ oP (n−1/2),

provided na2qn → 0 and nb4

n → 0. Replacing Wni (x2, bn) by (nbn)−1 K(

X2i−x2bn

)

/ fX2(x2), and noting that

b−1n E

{X1

fX2(X2)K

(X2i − X2

bn

)}

= E(X1|X2 = X2i ) + O(b2n),

uniformly in i , we obtain

(θ0, h0)[h − h0]= −E{(hθ0(X2) − h0(X2))X1}

= −n−1n∑

i=1

E(X1|X2i )

[

�i {Yi − m(Xi )}+(1 − �i ){m(Xi ) − m(Xi )}]

+oP (n−1/2).

It can be easily seen that

n−1n∑

i=1

E(X1|X2i )(1 − �i ){m(Xi ) − m(Xi )}

= n−1n∑

i=1

1 − p(Xi )

p(Xi )E(X1|X2i )�i {Yi − m(Xi )} + oP (n−1/2),

and hence

ξ(Xi ,�Yi ,�i ) = −E(X1|X2i )�i

p(Xi ){Yi − m(Xi )}.

123


For (A5) consider

Gn(θ, h) − Gn(θ, h) = n−1n∑

i=1

(1 − �i )

[1

κ

κ∑

l=1

Y ∗il − E(Y |X = Xi )

]

X1i

= n−1n∑

i=1

(1 − �i )

(

m(Xi ) − m(Xi )

)

X1i + oP (1) = oP (1),

as κ and n tend to infinity. Since this does not depend on θ , the second part of (A5) isobvious.

We now turn to the B-conditions. Condition (B1) is an identifiability condition,whereas (B2) holds true provided E |X1 j | < ∞ for j = 1, . . . , d. Next, for (B3) it iseasily seen that

� = −E[(X1 − E(X1|X2))

T X1

]= −Var[X1|X2],

which we assume to be of full rank. Condition (B4) is automatically fulfilled, sinceG(θ, h) is linear in h. Finally, condition (B5) holds true for s = 1.

It now follows from Theorem 1 that n1/2(θ − θ0) is asymptotically normally dis-tributed with asymptotic variance given by

� = �−1Var

[�

p(X)

{Y − θT

0 X1 − h0(X2)}

{X1 − E(X1|X2)}]

(�T)−1,

since E[g(x, Y, θ0, h0)|X = x] = 0. Note that if all data would be observed, the matrix� equals the variance–covariance matrix given in Robinson (1988), who consideredthe special case where ε is independent of X .

4.2 Single-index regression model

We now consider a single-index regression model:

Y = h(θT X) + ε, (6)

where X is d-dimensional, Y is one-dimensional, E(ε|X) = 0 and θ ∈ � = {θ ∈IRd : ‖θ‖ = 1} for identifiability reasons. See Powell et al. (1989), Ichimura andLee (1991), Ichimura (1993), Härdle et al. (1993), among many others for importantresults on the estimation and inference for this model when all the data are completelyobserved. We assume here that the response Y is missing at random. Suppose that thedensity of θT

0 X is bounded away from zero on its support (which is supposed to becompact). Let (X1, Y1), . . . , (Xn, Yn) be a sample of i.i.d. data drawn from model (6).Define g(x, y, θ, h, h′) = (y − h(θTx))h′(θTx)x , and let

123


hθ (u) =n∑

j=1

Wnj (u)Y j

be a kernel estimator of h0θ (u) = E(Y |θT X = u), where

Wnj (u) = � j Lb(θT X j − u)

∑n�=1 ��Lb(θT X� − u)

,

L is a kernel function and b is an appropriate bandwidth. Finally, let θ be a solutionof the equation Gn(θ, hθ , h′

θ ) = 0.We restrict attention here to the calculation of the matrix � and the function ξ , which

determines the asymptotic variance of θ . The verification of conditions (A1)–(A5) and(B1)–(B5) can be done by adapting the arguments used in the previous example to thecontext of single-index models. Details are omitted. Note that

� = d

dθE

[

(Y − h0θ (θT X))h′

0θ (θT X)X

]∣∣∣∣θ=θ0

.

It can be easily seen that � can also be written as

� = −E

[

h′20 (θT

0 X){

X − μX (θT0 X)

} {X − μX (θT

0 X)}T]

,

where μX (u) = E(X |θT0 X = u). Also note that

(θ0, h0)[h − h0, h′ − h′0] = −E

[{hθ0(θ

T0 X)− h0(θ

T0 X)}h′

0(θT0 X)X

]

+E[{Y − h0(θ

T0 X)}{h′

θ0(θT

0 X)− h′0(θ

T0 X)}X

].

Since the latter term equals zero, we have that

(θ0, h0)[h− h0, h′− h′0]

=−n−1n∑

i=1

�i

p(Xi )

{Yi − h0(θ

T0 Xi )

}h′

0(θT0 Xi )E(X |θT

0 X = θT0 Xi )

+oP (n−1/2).

It now follows that the asymptotic variance of θ equals

� = �−1Var

[�

p(X)

{Y − h0(θ

T0 X)

}h′

0(θT0 X)

{X − E(X |θT

0 X)}]

(�T)−1.

123


It is easily seen that if all data would be observed and the model is homoscedastic,the matrix � reduces to the asymptotic variance formula given in Härdle et al. (1993)(see their Section 2.5).

4.3 Copula model

In the third example we, consider a copula model for two random variables X andY , with X being always observed and Y being missing at random. We suppose thatthe copula belongs to a parametric family {Cθ : θ ∈ �} where � is a subset of IR p.Hence, for any x, y,

F(x, y) = P(X ≤ x, Y ≤ y) = Cθ (FX (x), FY (y)),

where FX (x) = P(X ≤ x) and FY (y) = P(Y ≤ y) are the marginals of X and Y ,which are completely unspecified and will be estimated nonparametrically. We assumethat FX and FY are continuous, so that θ0 is unique. Let (Xi , Yi ), i = 1, . . . , n bei.i.d. with common distribution F . Define FX (x) = (n + 1)−1∑n

i=1 I (Xi ≤ x) and

FY (y) = (n + 1)−1n∑

i=1

{�i I (Yi ≤ y) + (1 − �i )F(y|Xi )},

where F(y|x) is defined in (3). Note that in the second step, the estimator FY (y)

could be improved, by replacing F(y|x) by Fθ (y|x) = C2θ (FX (x), FY (y)), where C2

θ

denotes the derivative of Cθ with respect to the second argument (similar notationsare used to denote higher-order derivatives). In what follows, we restrict attention tothe estimator defined in the first step. Now, define

g(x, y, θ, FX , FY ) = C12′θ (FX (x), FY (y))

C12θ (FX (x), FY (y))

(the prime in C12′θ stands for the derivative with respect to θ ), i.e., g(x, y, θ, FX , FY )

is the derivative of the log-likelihood function under the copula model, and defineθ = argminθ∈�‖Gn(θ, FX , FY )‖.

We now calculate the function ξ defined in condition (A4), from which the formulaof the asymptotic variance of θ can be easily obtained. We omit the verification ofthe conditions of Theorem 1, but more details can be obtained from the authors. Letdθ (u, v) = C12′

θ (u, v)/C12θ (u, v). Then, straightforward calculations show that

(θ, FX , FY )[FX −FX , FY −FY ] = E

[∂

∂udθ (u, FY (Y ))|u=FX (X)(FX (X)−FX (X))

+ ∂

∂vdθ (FX (X), v)|v=FY (Y )(FY (Y )−FY (Y ))

]

.

123


Next, note that

FY (y) − FY (y) = n−1n∑

i=1

[

�i I (Yi ≤ y) + (1 − �i )F(y|Xi ) − FY (y)

]

+n−1n∑

i=1

(1 − �i )[F(y|Xi ) − F(y|Xi )

].

For the second term above, it can be shown that

n−1n∑

i=1

(1−�i )E

[∂

∂vdθ (FX (X), v)|v=FY (Y )(F(Y |Xi )−F(Y |Xi ))

]

=n−1n∑

i=1

1− p(Xi )

p(Xi )E

[∂

∂vdθ (FX (X), v)|v=FY (Y )(I (Yi ≤ Y )−F(Y |Xi ))

]

+oP (n−1/2).

It now follows that

(θ0, FX , FY )[FX − FX , FY − FY ]= n−1

n∑

i=1

E

[∂

∂udθ0(u, FY (Y ))|u=FX (X)

{

I (Xi ≤ X) − FX (X)

}

+ ∂

∂vdθ0(FX (X), v)|v=FY (Y )

{

�i I (Yi ≤ Y ) + (1 − �i )F(Y |Xi ) − FY (Y )

+1 − p(Xi )

p(Xi )(I (Yi ≤ Y ) − F(Y |Xi ))

}]

+ oP (n−1/2).

The formula of the asymptotic variance can now be easily obtained from Theorem 1.

5 Simulation study

In this section, we present simulation results for a single-index mean regression model.Consider

Y = h(θT X) + ε,

where X = (X1, X2, X3)T, X j ∼ Unif[0, 1] ( j = 1, 2, 3), (X1, X2, X3) are mutually

independent, ε ∼ N (0, 0.52), θ ∈ � = {θ ∈ IR3 : ‖θ‖ = 1}, and h(u) = exp(u). Theprobability that the response Y is missing depends on the covariate X via a logisticmodel:

P(� = 1|X, Y ) = exp(βT X)

1 + exp(βT X).

123


Table 1 Bias, standard deviation and MSE of α1 and α2 when β = (1, 1, 0)T (leading to 27 % of missingvalues)

n κ Bandwidths α1 α2 MSE

a b Bias Std Bias Std

50 50 0.14 0.30 −0.0161 0.114 0.0318 0.137 0.0331

0.14 0.28 −0.0203 0.119 0.0234 0.142 0.0351

0.12 0.22 −0.0131 0.123 0.0131 0.141 0.0355

0.18 0.30 −0.0205 0.111 0.0255 0.149 0.0356

0.14 0.20 −0.0078 0.128 0.0171 0.140 0.0363

0.08 0.26 −0.0001 0.126 0.0121 0.144 0.0366

0.12 0.26 −0.0004 0.124 0.0180 0.145 0.0367

0.02 0.28 −0.0167 0.133 0.0202 0.136 0.0369

0.16 0.24 −0.0103 0.127 0.0088 0.146 0.0375

100 50 0.14 0.24 −0.0135 0.083 0.0192 0.091 0.0156

0.12 0.26 −0.0080 0.081 0.0206 0.093 0.0158

0.14 0.28 −0.0068 0.078 0.0284 0.094 0.0159

0.12 0.16 −0.0062 0.082 0.0108 0.096 0.0160

0.16 0.30 −0.0116 0.078 0.0193 0.097 0.0161

0.08 0.24 −0.0171 0.081 0.0210 0.095 0.0164

0.10 0.20 −0.0069 0.084 0.0069 0.098 0.0167

0.10 0.30 −0.0238 0.081 0.0347 0.093 0.0169

0.14 0.20 −0.0109 0.083 0.0151 0.099 0.0170

Note that � can be regarded as compact by reparametrizing the unit circle via a polartransformation and using the angles of the polar transformation as the new parameterto replace θ , i.e., we write θ(α) = (sin(α1) sin(α2), sin(α1) cos(α2), cos(α1))

T forsome α1, α2. Note that ‖θ‖ = 1 for any value of α1 and α2. In the simulations, wework with α1 = π/3 and α2 = π/6, and we take β = (1, 1, 0)T in Table 1, andβ = (0.5, 0.5, 0)T in Table 2.

The estimating function for α1 and α2 is given by

g(x, y, α, h)

={

y−h(θ(α)Tx)}

h′(θ(α)Tx)

⎛

⎝cos(α1) sin(α2)x1+cos(α1) cos(α2)x2

−sin(α1)x3sin(α1) cos(α2)x1−sin(α1) sin(α2)x2

⎞

⎠ .

Next, define h(u) =∑nj=1 Wnj (u)Y j , where

Wnj (u) = � j Kb(u − θ(α)T X j )∑n

i=1 �i Kb(u − θ(α)T Xi ).

123


Table 2 Bias, standard deviation and MSE of α1 and α2 when β = (0.5, 0.5, 0)T (leading to 38 % ofmissing values)

n κ Bandwidths α1 α2 MSE

a b Bias Std Bias Std

50 50 0.02 0.22 −0.0125 0.147 0.0120 0.150 0.0442

0.18 0.30 −0.0076 0.132 0.0175 0.164 0.0447

0.12 0.26 −0.0009 0.144 0.0138 0.160 0.0463

0.14 0.22 −0.0224 0.142 0.0123 0.161 0.0469

0.18 0.26 −0.0078 0.147 0.0178 0.162 0.0481

0.02 0.26 −0.0199 0.151 0.0267 0.157 0.0486

0.14 0.28 −0.0069 0.140 0.0173 0.170 0.0486

0.12 0.24 −0.0084 0.144 0.0102 0.167 0.0487

0.18 0.28 −0.0034 0.146 0.0159 0.166 0.0490

100 50 0.12 0.30 −0.0166 0.092 0.0225 0.103 0.0199

0.14 0.18 −0.0068 0.092 0.0119 0.107 0.0202

0.10 0.28 −0.0092 0.091 0.0119 0.110 0.0207

0.20 0.30 −0.0046 0.089 0.0304 0.108 0.0207

0.14 0.24 −0.0079 0.094 0.0125 0.110 0.0211

0.14 0.28 −0.0111 0.093 0.0276 0.108 0.0211

0.12 0.16 −0.0028 0.096 0.0126 0.109 0.0211

0.14 0.22 −0.0097 0.094 0.0136 0.112 0.0216

0.20 0.22 −0.0038 0.094 0.0122 0.112 0.0216

The imputed estimating equation is

Gn(α, h) = n−1n∑

i=1

[

�i g(Xi , Yi , α, h) + (1 − �i )1

κ

κ∑

l=1

g(Xi , Y ∗il , α, h)

]

. (7)

We now define α as the solution of the equation Gn(α, h) = 0 with respect to allvectors α ∈ [0, 2π ]2. Throughout the simulation, k was set to be 50.

Two bandwidths need to be selected. The first one is the bandwidth a in the ker-nel estimator of the conditional distribution function F(y|x) defined in (3), which isused for the imputation procedure. The second one is in the bandwidth b in the kernelestimator of the function h. Because the cross-validation method for selecting band-widths is complicated when imputation is involved, and because it can only produceone combination of bandwidths, we prefer to use a sequence of bandwidths for a andb to study the influence of the bandwidths on the estimator. In our simulation, weconsidered for a and b, all values between 0.02 and 0.30 with step size 0.02. So, weobtain 225 combinations of the bandwidths used for estimating α. We report in Tables1 and 2 the results corresponding to the 9 pairs of bandwidths a and b that lead to thesmallest MSEs.

Table 1 summarizes the bias, standard deviation and MSE of the estimators of α1and α2 based on the selected a and b. The MSE is the sum of the MSEs of α1 and

123


α2. Each entry is based on 500 simulations. The table shows that the influence of thebandwidth is rather small. We also notice that the bias is reasonably close to 0 and asthe sample size increases, the standard deviation and MSE of α1 and α2 decrease. InTable 2, the percentage of missing values is 38 % compared to 27 % in Table 1. As canbe expected, the standard deviation and the MSE of α1 and α2 are larger compared toTable 1.

Appendix

Below we list the assumptions that are needed for the main result in Sect. 3. Thefollowing notations are needed. We equip the space H with a semi-norm ‖·‖H, definedby ‖h‖H = supθ∈� ‖hθ‖S for any h ∈ H, i.e., ‖ · ‖H is a sup-norm with respect tothe θ -argument and a semi-norm ‖ · ‖S with respect to all the other arguments. Inaddition, N (λ,H, ‖ · ‖H) is the covering number of the class H with respect to thenorm ‖·‖H, i.e., the minimal number of balls of ‖·‖H-radius λ needed to cover H. Weuse the notation �(θ) = d

dθG(θ, h0) to denote the complete derivative of G(θ, h0)

with respect to θ , i.e.,

�(θ) = d

dθG(θ, h0) = lim

τ→0

1

τ

[G(θ + τ, h0,θ+τ ) − G(θ, h0θ )

], (8)

and the notation (θ, h0)[h − h0] to denote the functional derivative of G(θ, h0) inthe direction [h − h0], i.e.,

(θ, h0)[h − h0] = limτ→0

1

τ

[

G(θ, h0 + τ(h − h0)) − G(θ, h0)

]

.

We also need to introduce the function

Gn(θ, h) = n−1n∑

i=1

[�i g(Xi , Yi , θ, h) + (1 − �i )E{g(Xi , Y, θ, h)|X = Xi }].

Finally, ‖ · ‖ denotes the Euclidean norm.

Assumptions on the estimator h (A1) supθ∈� ‖hθ − h0θ‖S = oP (1), andsup‖θ−θ0‖≤δn

‖hθ − h0θ‖S = oP (n−1/4), where δn → 0, and where h0θ is suchthat h0θ0 ≡ h0.

(A2) P (hθ ∈ H) → 1 as n tends to infinity, uniformly over all θ with ‖θ−θ0‖ = o(1),and

∫∞0

√log N (λ1/s,H, ‖ · ‖H)dλ < ∞, where 0 < s ≤ 1 is defined in

condition (B5).(A3) For θ in a neighborhood of θ0, ‖ (θ, h0)[h − h0] − (θ0, h0)[h − h0]‖W =

oP (‖θ − θ0‖).(A4) (θ0, h0)[h − h0] = n−1∑n

i=1 ξ(Xi ,�i Yi ,�i ) + oP (n−1/2), where the func-tion ξ = (ξ1, . . . , ξr ) satisfies E[ξ j (X,�Y,�)] = 0 and E[ξ2

j (X,�Y,�)] <

∞ for j = 1, . . . , r .

123


(A5) supθ∈� ‖Gn(θ, h) − Gn(θ, h)‖W = oP (1), and for any δn → 0,

sup‖θ−θ0‖≤δn

‖Gn(θ, h) − Gn(θ, h) − Gn(θ0, h0) + Gn(θ0, h0)‖W = oP (n−1/2).

Assumptions on the function G (B1) For all δ > 0, there exists ε > 0 such thatinf‖θ−θ0‖>δ ‖G(θ, h0)‖W ≥ ε > 0.

(B2) Uniformly for all θ ∈ �, G(θ, h) is continuous in h with respect to the ‖ · ‖H-norm at h = h0.

(B3) The matrix �(θ) exists for θ in a neighborhood of θ0, and is continuous atθ = θ0. Moreover, � ≡ �(θ0) is of full rank.

(B4) For all θ in a neighborhood of θ0, (θ, h0)[h − h0] exists in all directions[h − h0], and for some 0 < c < ∞, ‖G(θ, h) − G(θ, h0) − (θ, h0)[h −h0]‖W ≤ c‖h − h0‖2

H.(B5) For each j = 1, . . . , r ,

E

[

sup(θ ′,h′):‖θ ′−θ‖≤δ,‖h′−h‖H≤δ

|g j (X, Y, θ ′, h′) − g j (X, Y, θ, h)|2]

≤ K δ2s,

for all (θ, h) ∈ � × H and δ > 0, and for some constants 0 < K < ∞ and0 < s ≤ 1.

Regularity assumptions (C1) The kernel K satisfies K (u1, . . . , udx ) = ∏dxj=1

k(u j ), where k is a qth-order (q ≥ 2) univariate probability density function supportedon [−1, 1], and k is bounded, symmetric and Lipschitz continuous. The bandwidth an

satisfies nadxn → ∞ and na2q

n → 0. Moreover, κ → ∞.

(C2) For all x0 ∈ X and θ ∈ �, the function x → E[g�(x0, Y, θ, h0)|X = x] isuniformly continuous in x for � = 1 and 2, and E[g2(X, Y, θ, h0)] < ∞.

(C3) For all x0 ∈ X and θ ∈ �, the functions x → p(x), fX (x), mg(x0, x, θ, h0)

and mg2(x0, x, θ, h0) are q times continuously differentiable with respect to thecomponents of x on the interior of their support. Here, fX is the probability den-sity function of X and mg� (x0, x, θ, h) = E[gl(x0, Y, θ, h)|X = x]. Moreover,inf x∈X p(x) > 0.

Remark 2 Suppose the functions in H have a compact support R of dimension d ≤dx +dy . To check condition (A2), define for any vector a = (a1, . . . , ad) of d integers,the differential operator

Da = ∂ |a|

∂ua11 . . . ∂uad

d

,

where |a| =∑di=1 ai . For any smooth function h : R → IR and some α > 0, let α be

the largest integer smaller than α, and

123


‖h‖∞,α = max|a|≤αsup

u|Dah(u)| + max|a|=α

supu =u′

|Dah(u) − Dah(u′)|‖u − u′‖α−α

.

Further, let CαM (R) be the set of all continuous functions h : R → IR with ‖h‖∞,α ≤

M . If H ⊂ CαM (R) with ‖ · ‖H = ‖ · ‖∞, then log N (λ,H, ‖ · ‖∞) ≤ K (M/λ)d/α ,

where K is a constant depending only on d, α and the Lebesgue measure of the domainR. Hence,

∫ ∞

0

√log N (λ1/s,H, ‖ · ‖∞)dλ < ∞ if α >

d

2s

[(see Theorem 2.7.1 in Van der Vaart and Wellner (1996)].

Proof of Theorem 1 The proof is based on Theorems 1 and 2 in Chen et al. (2003)(CLV hereafter). In these theorems, high-level conditions are given under which theestimator θ is, respectively, weakly consistent and asymptotically normal. We startwith verifying the conditions of Theorem 1 in CLV. Condition (1.1) holds by defini-tion of θ , while the second, third and fourth condition are guaranteed by assumptions(B1), (B2) and (A1). Finally, condition (1.5) can be treated in a similar way as condition(2.5) of Theorem 2 of CLV, which we check below. Next, we verify conditions (2.1)–(2.6) of Theorem 2 in CLV. Condition (2.1) is, as for condition (1.1), valid by con-struction of the estimator θ . For (2.2) and the first part of (2.3), use assumptions (B3)and (B4). For the second part of (2.3), it follows from the proof in CLV that it sufficesto assume (A3). Next, for condition (2.4), we use assumptions (A1) and (A2). For(2.5), note that

‖Gn(θ, h) − G(θ, h) − Gn(θ0, h0) + G(θ0, h0)‖W

≤ ‖Gn(θ, h) − Gn(θ, h) − Gn(θ0, h0) + Gn(θ0, h0)‖W

+‖Gn(θ, h) − G(θ, h) − Gn(θ0, h0) + G(θ0, h0)‖W .

For the first term above, note that it follows from the Proof of Theorem 2 in CLV thatit suffices to take h = h. Hence, the assumption (A5) gives the required rate. For thesecond term, we verify the conditions of Theorem 3 in CLV. For condition (3.2), notethat for j = 1, . . . , r and for fixed (θ, h) ∈ � × H,

E

[

sup∗|�g j (X, Y, θ ′, h′) − �g j (X, Y, θ, h) − (1 − �)E{g j (X, Y, θ ′, h′)|X}

+(1 − �)E{g j (X, Y, θ, h)|X}|] ≤ 2E[sup∗|g j (X, Y, θ ′, h′) − g j (X, Y, θ, h)|]

,

where sup∗ is the supremum over all ‖θ ′ −θ‖ ≤ δn and ‖h′ −h‖H ≤ δn , with δn → 0.Hence, condition (3.2) follows from assumption (B5), whereas condition (3.3) is givenin (A2). Finally, condition (2.6) is valid by combining assumption (A4), Proposition 1and the central limit theorem. ��

123


Proposition 1 Assume that conditions (A1)–(A5), (B1)–(B5) and (C1)–(C3) hold.Then,

Gn(θ0, h0) = n−1n∑

i=1

{�i

p(Xi )g(Xi , Yi , θ0, h0)

+(

1 − �i

p(Xi )

)

E[g(Xi , Y, θ0, h0)|X = Xi ]}

+ oP (n−1/2).

Proof First we consider

Gn(θ0, h0) − Gn(θ0, h0)

= n−1n∑

i=1

(1 − �i )

{1

κ

κ∑

l=1

g(Xi , Y ∗il , θ0, h0) − mga(Xi , θ0, h0)

}

+n−1n∑

i=1

(1 − �i ){mga(Xi , θ0, h0) − mg(Xi , θ0, h0)

}

:= Vn1(θ0, h0) + Vn2(θ0, h0),

where mg(x, θ0, h0) = E[g(x, Y, θ0, h0)|X = x] and

mga(x, θ0, h0) =n∑

j=1

� j Ka(X j − x)∑n

l=1 �l Ka(Xl − x)g(x, Y j , θ0, h0)

is the conditional mean imputation based on the kernel estimator of the conditionaldistribution. We will first show that Vn1(θ0, h0) = oP (n−1/2), which means that wecan just substitute κ−1∑κ

l=1 g(Xi , Y ∗il , θ0, h0) by the conditional mean imputation

mga(Xi , θ0, h0), which would simplify the theoretical analysis. However, the proposedimputation is attractive in practical implementations as it separates the imputation andanalysis steps, as proposed by Little and Rubin (2002). Let χnc = {(X j , Y j ,� j =1) : j = 1, . . . , n} be the complete part of the sample with no missing values. GivenXi with �i = 0, write mgκ(Xi , θ0, h0) = κ−1∑κ

l=1 g(Xi , Y ∗il , θ0, h0). From the way

we impute Y ∗il ,

E{mgκ(Xi , θ0, h0)|χnc, Xi ,�i = 0} = mga(Xi , θ0, h0). (9)

Hence, E{Vn1(θ0, h0)} = 0. We next calculate the variance of Vn1(θ0, h0). Note that

Var{Vn1(θ0, h0)} = n−2n∑

i, j

Cov{(1 − �i )[mgκ(Xi , θ0, h0) − mga(Xi , θ0, h0)],

(1 − � j )[mgκ(X j , θ0, h0) − mga(X j , θ0, h0)]}.

123


If i = j , then conditioning on χnc, (Xi ,�i = 0) and (X j ,� j = 0),

Cov[{mgκ(Xi , θ0, h0), mgκ(X j , θ0, h0)}|χnc, (Xi ,�i = 0), (X j ,� j = 0)] = 0.

This together with (9) implies that

Var{Vn1(θ0, h0)} (10)

= n−2n∑

i=1

Var{(1 − �i )[mgκ(Xi , θ0, h0) − mga(Xi , θ0, h0)]

}

= n−1 E

[

Var

{

mgκ(Xi , θ0, h0) − mga(Xi , θ0, h0)|χnc, (Xi ,�i = 0)

}]

=(nκ)−1 E

[∑nl=1 �l Ka(Xl −Xi )(ggT)(Xi , Yl , θ0, h0)

∑nl=1 �l Ka(Xl −Xi )

−(mgamTga)(Xi , θ0, h0)

]

= O{(nκ)−1}, (11)

provided (C2) and (C3) hold true. Hence, Vn1(θ0, h0) = oP (n−1/2). Next, we considerthe term Vn2(θ0, h0). Defining w j (x, a) = � j Ka(X j − x)/

{∑nl=1 �l Ka(Xl − x)

},

we have,

Vn2(θ0, h0) = n−1n∑

i=1

(1 − �i )

n∑

k=1

wk(Xi , a)

{

g(Xi , Yk, θ0, h0) − mg(Xi , θ0, h0)

− g(Xk, Yk, θ0, h0) + mg(Xk, θ0, h0)

}

+ n−1n∑

i=1

(1−�i )

n∑

k=1

wk(Xi , a)

{

g(Xk, Yk, θ0, h0)−mg(Xk, θ0, h0)

}

= Vn21(θ0, h0) + Vn22(θ0, h0).

First, we will show that Vn21(θ0, h0) = oP (n−1/2). Note that E[Vn21(θ0, h0)] = 0and using the notation

γnj (x1, x2) = (1 − p(x1))(1 − p(x2))Cov[g(x1, Y j , θ0, h0) − g(X j , Y j , θ0, h0),

g(x2, Y j , θ0, h0) − g(X j , Y j , θ0, h0) |X j ],

we have,

Var[Vn21(θ0, h0)] = E

{

Var[Vn21(θ0, h0)|X1, . . . , Xn]}

= n−2n∑

i, j,k=1

E

{

w j (Xi , a)w j (Xk, a)γnj (Xi , Xk)

}

123


= n−2n∑

i, j,k=1

E

{

w j (Xi , a)w j (Xk, a)γnj (X j , X j )

}

+ O(n−1aqn )

= O(n−1aqn ),

since γnj (X j , X j ) = 0. Hence, Vn21(θ0, h0) = oP (n−1/2). Next, note that

Vn22(θ0, h0)=n−1n∑

i=1

�i {g(Xi , Yi , θ0, h0)−mg(Xi , θ0, h0)}1− p(Xi )

p(Xi )+oP (n−1/2),

where the rate of the remainder term can be shown in a similar way as for Vn21(θ0, h0).This shows that

Gn(θ0, h0)

= Gn(θ0, h0) + [Gn(θ0, h0) − Gn(θ0, h0)]= n−1

n∑

j=1

[� j g(X j , Y j , θ0, h0) + (1 − � j )mg(X j , θ0, h0)]

+ n−1n∑

j=1

� j [g(X j , Y j , θ0, h0) − mg(X j , θ0, h0)]1 − p(X j )

p(X j )+ oP (n−1/2)

=n−1n∑

j=1

[� j

p(X j )g(X j , Y j , θ0, h0)+

(

1− � j

p(X j )

)

mg(X j , θ0, h0)

]

+oP (n−1/2).

��Acknowledgments Song Xi Chen acknowledges support from a National Science Foundation GrantSES-0518904, a National Natural Science Foundation of China Key Grant 11131002, and the MOE KeyLab in Econometrics and Quantitative Finance at Peking University. Ingrid Van Keilegom acknowledgesfinancial support from IAP research network P6/03 of the Belgian Government (Belgian Science Policy),and from the European Research Council under the European Community’s Seventh Framework Programme(FP7/2007–2013) / ERC Grant agreement No. 203650. Both authors thank two referees for constructivecomments and suggestions which have improved the presentation of the paper.

References

Ai, C., Chen, X. (2003). Efficient estimation of models with conditional moment restrictions containingunknown functions. Econometrica, 71, 1795–1843.

Chen, Q., Zeng, D., Ibrahim, J. G. (2007). Sieve maximum likelihood estimation for regression models withcovariates missing at random. Journal of the American Statistical Association, 102, 1309–1317.

Chen, X., Hong, H., Tarozzi, A. (2008). Semiparametric efficiency in GMM models with auxiliary data.Annals of Statistics, 36, 808–843.

Chen, X., Linton, O. B., Van Keilegom, I. (2003). Estimation of semiparametric models when the criterionfunction is not smooth. Econometrica, 71, 1591–1608.

Härdle, W., Hall, P., Ichimura, H. (1993). Optimal smoothing in single-index models. Annals of Statistics,21, 157–178.

Ichimura, H. (1993). Semiparametric least squares (SLS) and weighted SLS estimation of single-indexmodels. Journal of Econometrics, 58, 71–120.

123


Ichimura, H., Lee, L.-F. (1991). Semiparametric least squares estimation of multiple index models: Singleequation estimation. In W. A. Barnett, J. Powell, G. Tauchen (Eds.), Nonparametric and semiparametricmethods in statistics and econometrics. Cambridge: Cambridge University Press (Chapter 1).

Liang, H. (2008). Generalized partially linear models with missing covariates. Journal of MultivariateAnalysis, 99, 880–895.

Little, R. J. A., Rubin, D. B. (2002). Statistical analysis with missing data. New York: Wiley.McCullagh, P., Nelder, J. A. (1983). Generalized linear models. London: Chapman & Hall.Müller, U. U. (2009). Estimating linear functionals in nonlinear regression with responses missing at random.

Annals of Statistics, 37, 2245–2277.Müller, U. U., Schick, A., Wefelmeyer, W. (2006). Imputing responses that are not missing. In M. Nikulin,

D. Commenges, C. Huber (Eds.), Probability, statistics and modelling in public health (pp. 350–363).New York: Springer.

Powell, J. L., Stock, J. M., Stoker, T. M. (1989). Semiparametric estimation of index coefficients. Econo-metrica, 57, 1403–1430.

Robins, J. M., Rotnitzky, A., Zhao, L. P. (1994). Estimation of regression coefficients when some regressorsare not always observed. Journal of the American Statistical Association, 89, 846–866.

Robinson, P. M. (1988). Root-N -consistent semiparametric regression. Econometrica, 56, 931–954.Rubin, D. B. (1976). Inference and missing values (with discussion). Biometrika, 63, 481–592.Van der Vaart, A. W., Wellner, J. A. (1996). Weak convergence and empirical processes. New York: Springer.Wang, C. Y., Wang, S., Gutierrez, R. G., Carroll, R. J. (1998). Local linear regression for generalized linear

models with missing data. Annals of Statistics, 26, 1028–1050.Wang, D., Chen, S. X. (2009). Empirical likelihood for estimating equations with missing values. Annals

of Statistics, 37, 490–517.Wang, Q., Sun, Z. (2007). Estimation in partially linear models with missing responses at random. Journal

of Multivariate Analysis, 98, 1470–1493.Wang, Q., Linton, O., Härdle, W. (2004). Semiparametric regression analysis with missing response at

random. Journal of the American Statistical Association, 99, 334–345.Wang, Q.-H. (2009). Statistical estimation in partial linear models with covariate data missing at random.

Annals of the Institute of Statistical Mathematics, 61, 47–84.Wang, Y., Shen, J., He, S., Wang, Q. (2010). Estimation of single index model with missing response at

random. Journal of Statistical Planning and Inference, 140, 1671–1690.

123

estimation in semiparametric models with missing data

Documents