simultaneous inference for best linear predictor of the … · 2018-06-20 · simultaneous...

52
ESTIMATION AND INFERENCE ABOUT CONDITIONAL AVERAGE TREATMENT EFFECT AND OTHER CAUSAL FUNCTIONS By Vira Semenova , Victor Chernozhukov * Harvard University * , MIT Our framework can be viewed as inference on low-dimensional nonparametric functions in the presence of high-dimensional nuisance function (where dimensionality refers to the number of covariates). Specifically, we consider the setting where we have a signal Y = Y (η0) that is an unbiased predictor of causal/structural objects like treat- ment effect, structural derivative, outcome given a treatment, and others, conditional on a set of very high dimensional controls/ con- founders Z. We are interested in simpler lower-dimensional nonpara- metric summaries of Y , namely g(x)= E[Y (η0)|X = x], conditional on a low-dimensional subset of covariates X. The random variable Y = Y (η), which we refer to as a signal, depends on a nuisance function η0(Z) of the high-dimensional controls, which is unknown. In the first stage, we need to learn the function η0(Z) using any machine learning method that is able to approximate η accurately under very high dimensionality of Z. For example, under approxi- mate sparsity with respect to a dictionary, 1-penalized methods can be used; in others, tools such as deep neural networks can be used. To estimate Y (η0) we would use Y ( b η), but to make the subsequent inference valid, we need to carefully construct Y (·) such that the sig- nal is orthogonal to perturbations of η, namely η E[Y (η0)|X] = 0. This property allows us to use arbitrary, high-quality machine learn- ing tools to learn η, because it eliminates the first order biases arising from biased/regularized estimation of η necessarily occurring in such cases. As a result, the second-stage low-dimensional nonparametric inference enjoys the quasi-oracle properties, as if we knew η0. In the second stage, we approximate the target function g(x) by a linear form p(x) 0 β0, where p(x) is a vector of the approximating func- tions and β0 is the Best Linear Predictor parameter, where the di- mension of p(x) is increasing slower than the sample size. We develop a complete set of results about estimation and approximately Gaus- sian inference on x 7p(x) 0 β and x 7g(x). If p(x) is sufficiently rich and g(x) admits a good approximation, then g(x) gets automatically targeted by the inference; otherwise, the best linear approximation p(x) 0 β to g(x) gets targeted. In some applications, p(x) can be sim- ply specified as a collection of group indicators, in which case p(x) 0 β describes group-average treatment/structural effects (GATEs), pro- viding a very practical summary of the heterogeneous effects. MSC 2010 subject classifications: Primary 62G08; secondary 62G05 62G20 62G35 Keywords and phrases: High-dimensional statistics, heterogeneous treatment effect, conditional aver- age treatment effect, orthogonal estimation, machine learning, double robustness arXiv:1702.06240v3 [stat.ME] 3 Feb 2020

Upload: others

Post on 20-Apr-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

ESTIMATION AND INFERENCE ABOUT CONDITIONALAVERAGE TREATMENT EFFECT AND OTHER CAUSAL

FUNCTIONS

By Vira Semenova†, Victor Chernozhukov∗

Harvard University∗, MIT†

Our framework can be viewed as inference on low-dimensionalnonparametric functions in the presence of high-dimensional nuisancefunction (where dimensionality refers to the number of covariates).Specifically, we consider the setting where we have a signal Y = Y (η0)that is an unbiased predictor of causal/structural objects like treat-ment effect, structural derivative, outcome given a treatment, andothers, conditional on a set of very high dimensional controls/ con-founders Z. We are interested in simpler lower-dimensional nonpara-metric summaries of Y , namely g(x) = E[Y (η0)|X = x], conditionalon a low-dimensional subset of covariates X. The random variableY = Y (η), which we refer to as a signal, depends on a nuisancefunction η0(Z) of the high-dimensional controls, which is unknown.

In the first stage, we need to learn the function η0(Z) using anymachine learning method that is able to approximate η accuratelyunder very high dimensionality of Z. For example, under approxi-mate sparsity with respect to a dictionary, `1-penalized methods canbe used; in others, tools such as deep neural networks can be used.To estimate Y (η0) we would use Y (η), but to make the subsequentinference valid, we need to carefully construct Y (·) such that the sig-nal is orthogonal to perturbations of η, namely ∂ηE[Y (η0)|X] = 0.This property allows us to use arbitrary, high-quality machine learn-ing tools to learn η, because it eliminates the first order biases arisingfrom biased/regularized estimation of η necessarily occurring in suchcases. As a result, the second-stage low-dimensional nonparametricinference enjoys the quasi-oracle properties, as if we knew η0.

In the second stage, we approximate the target function g(x) by alinear form p(x)′β0, where p(x) is a vector of the approximating func-tions and β0 is the Best Linear Predictor parameter, where the di-mension of p(x) is increasing slower than the sample size. We developa complete set of results about estimation and approximately Gaus-sian inference on x 7→ p(x)′β and x 7→ g(x). If p(x) is sufficiently richand g(x) admits a good approximation, then g(x) gets automaticallytargeted by the inference; otherwise, the best linear approximationp(x)′β to g(x) gets targeted. In some applications, p(x) can be sim-ply specified as a collection of group indicators, in which case p(x)′βdescribes group-average treatment/structural effects (GATEs), pro-viding a very practical summary of the heterogeneous effects.

MSC 2010 subject classifications: Primary 62G08; secondary 62G05 62G20 62G35Keywords and phrases: High-dimensional statistics, heterogeneous treatment effect, conditional aver-

age treatment effect, orthogonal estimation, machine learning, double robustness

arX

iv:1

702.

0624

0v3

[st

at.M

E]

3 F

eb 2

020

Page 2: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

1

1. Introduction and Motivation. This paper gives a method to estimate andconduct inference about a nonparametric function g(x) that summarizes heterogeneoustreatment/causal/structural effects conditionally on a small subset X : X ⊂ Z of (poten-tially very) high-dimensional controls Z. We assume that this function can be representedas a conditional expectation function

g(x) = E[Y (η0)|X = x],(1.1)

where the random variable Y = Y (η), which we refer to as a signal, depends on a nui-sance function η0(Z) of the controls. Examples of the nonparametric target functioninclude the Group Average Treatment Effect, Continuous Treatment Effects, the Con-ditional Average Treatment Effect (CATE), regression function with Partially MissingOutcome, and the Conditional Average Partial Derivative (CAPD). Examples of thenuisance functions

η0 = η0(Z)

include the propensity score (probability of treatment assignment), conditional density,and the regression function, among others. In summary,

dim(Z) is high, dim(X) is low.

Although there are multiple possible choices of signals Y = Y (η) for the target functiong(x), we focus on signals that possess the Neyman-type orthogonality property (Neyman(1959)). Formally, we require the pathwise derivative of the conditional expectation tobe zero conditionally on X:

∂rE[Y (η0 + r(η − η0))|X = x]|r=0 = 0, for all x and η.(1.2)

If the signal Y = Y (η) is orthogonal, its plug-in estimate Y (η) is insensitive to the biasedestimation of η, resulting from application of modern adaptive learning methods in highdimensions, and delivers a high-quality estimator of the target function g(x) under mildconditions.

We demonstrate the importance of the orthogonality property using an example ofCATE. We define the CATE function as:

g(x) = E[Y 1 − Y 0|X = x],

where Y 1(Y 0) are the potential outcomes corresponding to receipt(non-receipt) of abinary treatment D ∈ 1, 0 and X ∈ X ⊂ Rr is a covariate vector of interest. Consideran Inverse Probability Weighting (IPW, Horwitz and Thompson (1952)) type signal

Y (s) =DY o

s(Z)− (1−D)Y o

1− s(Z),

where Y o = DY 1 + (1−D)Y 0 is the observed outcome, and the true value of the first-stage nuisance parameter s(Z) is the propensity score s0(Z) = E[D = 1|Z]. The functions0(Z) is estimated by modern regularized methods, and the bias of its estimation error

Page 3: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

2 SEMENOVA AND CHERNOZHUKOV

s(Z)− s0(Z) converges slowly. The standard choice of signal Y is not orthogonal to thefirst-stage bias:

∂rE[Y (s0 + r(s− s0))|X]|r=0 = E[(− DY o

s20(Z)

− (1−D)Y o

(1− s0(Z))2

)(s(Z)− s0(Z))

∣∣∣∣X] 6= 0.

Consequently, the bias of the estimation error, s(Z) − s0(Z), in the propensity scoretranslates into the bias of the estimated signal Ys. As a result, the estimate of the targetfunction based on Ys is low-quality.

To overcome the transmission of the first-stage bias, we consider the signal of Robinsand Rotnitzky (1995) type:

Y (η) =D(Y o − µ(1, Z))

s(Z)− (1−D)(Y o − µ(0, Z))

1− s(Z)+ µ(1, Z)− µ(0, Z),

where µ0(D,Z) = E[Y o|D,Z] is the regression function, and the true value of the first-stage nuisance parameter η0(Z) = s0(Z), µ0(1, Z), µ0(0, Z). Our proposed choice ofsignal Y (η) is orthogonal with respect to the first-stage bias

∂rE[Y (η0 + r(η − η0))|X]

=

E(−D(Y o − µ0(1, Z))

s20(Z)

− (1−D)(Y o − µ0(0, Z))

(1− s0(Z))2

)(s(Z)− s0(Z))|X

E(

1− D

s0(Z)

)(µ(1, Z)− µ0(1, Z))|X

−E(

1− 1−D1− s0(Z)

)(µ(0, Z)− µ0(0, Z))|X

= 0.

Consequently, the bias of the estimation error, η(Z)−η0(Z), in the propensity score doesnot translate into the bias of the estimated signal Y (η). As a result, the estimate of thetarget function based on Y (η) is high-quality.

In the second stage we consider a linear projection of an orthogonal signal Y (η) on avector of basis functions p(X)

β := arg minβ∈Rd

E(Y (η)− p(X)′β)2.

The choice of basis functions depends on the problem and the desired interpretationof CATE. For one example, perhaps the simplest choice is to take X = tdk=1Gk be apartition of the support of X into d mutually exclusive groups. Setting

pk(x) = 1x∈Gk, k ∈ 1, 2, . . . , d, x ∈ X

corresponds to p(x)′β0 that can be interpreted as Group Average Treatment Effect. Foranother example, let p(x) ∈ Rd be a d-dimensional dictionary of series basis functions(e.g., polynomials or splines). Then, p(x)′β corresponds to the best linear approximationto the target function g(x) in the given dictionary. Under some smoothness conditions,as the dimension of the dictionary becomes large, p(x)′β will approximate g(x) at thegiven point x.

Page 4: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

3

1.1. Overview of Main Results. The first main contribution of this paper is to providesufficient conditions for pointwise and uniform asymptotic Gaussian approximation ofthe target function. We approximate the target function g(x) at a point x by a linearform p(x)′β0:

g(x) = p(x)′β0 + rg(x),

where p(x) is a d-vector of technical transformations of the covariates x, rg(x) is the mis-specification error due to the linear approximation1, and β0 is the Best Linear Predictor,defined by the normal equation:

Ep(X)[g(X)− p(X)′β0] = Ep(X)rg(X) = 0.(1.3)

The two-stage estimator β of β0, which we refer to as Orthogonal Estimator, is con-structed as follows. In the first stage we construct an estimate η of the nuisance parameterη0, using any high-quality machine learning estimator. In the second stage we constructan estimate Yi of the signal Yi as Yi := Yi(η) and run ordinary least squares of Yi on thetechnical regressors p(Xi). We use different samples for the estimation of η in the firststage and the estimation of β0 in the second stage in a form of cross-fitting, described inthe following definition.

Definition 1.1 (Cross-fitting). 1. For a random sample of size N , denote a K-fold random partition of the sample indices [N ] = 1, 2, ..., N by (Jk)

Kk=1, where

K is the number of partitions and sample size of each fold is n = N/K. Also foreach k ∈ [K] = 1, 2, ...,K define Jck = 1, 2, ..., N \ Jk.

2. For each k ∈ [K], construct an estimator ηk = η(Vi∈Jck)2 of the nuisance parametervalue η0 using only the data from Jck; and for any observation i ∈ Jk, define the

estimate Yi := Yi(ηk).

Definition 1.2 (Orthogonal Estimator). Given (Yi)Ni=1, define Orthogonal Estima-

tor as:

β :=

(1

N

N∑i=1

p(Xi)p(Xi)′

)−11

N

N∑i=1

p(Xi)Yi.(1.4)

Under mild conditions on η, Orthogonal Estimator delivers a high-quality estimatep(x)′β of the pseudo-target function p(x)′β0 with the following properties:

• W.p. → 1, the mean squared error of p(x)′β is bounded by

(EN (p(Xi)′(β − β0))2)1/2 = OP (

√d

N) + zd,

where zd is the effect of the misspecification error rg(x).

1Our analysis allows for either vanishing or non-vanishing specification error rg(x).2The results of this paper hold without relying on the specific choice of the first stage estimator η(Z).

For the possible ways to estimate η0, see the discussion.

Page 5: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

4 SEMENOVA AND CHERNOZHUKOV

• The estimator p(x)′β of the pseudo-target function p(x)′β0 is asymptotically linear:

√Np(x)′(β − β0)√p(x)′Ωp(x)

= GN (x) + oP (1/√

logN),

where the empirical process GN (x) is approximated by a Gaussian process withN(0, 1) coordinates, uniformly over x ∈ X , and the covariance matrix Ω can beconsistently estimated by a sample analog Ω.• If the misspecification error rg(x) is small, the pseudo-target function p(x)′β0 can

be replaced by the target function g(x):

√Np(x)′β − g(x)√p(x)′Ωp(x)

= GN (x) + oP (1/√

logN).

• The quantiles of the suprema of the approximating Gaussian process can be con-sistently approximated by bootstrap, using which valid uniform confidence bandsfor x 7→ p(x)′β and x 7→ g(x) can be constructed.

The results of this paper accommodate the estimation of η by high-dimensional/highlycomplex modern machine learning (ML) methods, such as random forests, neural net-works, and `1-shrinkage estimators, as well as estimation of η by previously developedmethods. The only requirement we impose on the estimation of η is its mean squareconvergence to the true nuisance parameter η0 at a high-quality rate oP (N−1/4−δ) forsome δ > 0. This requirement is satisfied under structured assumptions on η0, such asapproximate sparsity of η0 with respect to some dictionary, well-approximability of η0

by trees or by sparse neural and deep neural nets.Another important example, worth highlighting here, is that of Continuous Treatment

Effects, studied in Kennedy et al. (2017), where

g(x) = E[Y x],

whereX ∈ X ⊂ R is a continuous treatment and Y x, x ∈ X are the potential outcomes.We prove that the doubly robust signal of Kennedy et al. (2017)

Y (η) :=Y o − µ(X,Z)

s(X|Z)w(X) +

∫µ(X,Z)dP (Z),

is conditionally orthogonal with respect to the nuisance parameter η0(X,Z) consisting ofthe conditional treatment density s0(X|Z), regression function µ0(X,Z), and marginaldensity w0(X) =

∫s0(X|Z)dP (Z):

η0(X,Z) := s0(X|Z), µ0(X,Z), w0(X).

The proposed estimator of the signal takes the form

Y †i (η) =Y oi − µ(Xi, Zi)

s(Xi|Zi)w(Xi) +

1

n− 1

∑j∈Jk,j 6=i

µ(Xi, Zj), i ∈ Jk,

Page 6: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

5

where η(X,Z) is estimated on Jck and the sample average in the second summand is taken

over data (Vj)j∈Jk excluding observation i. Having replaced Yi(η) by Y †i (η) in Definition

1.2, we obtain an asymptotically linear estimator p(x)′β† of the pseudo-target function:

√Np(x)′(β† − β0)√p(x)′Ω†p(x)

= GN (x) + oP (1/√

logN),(1.5)

where empirical process GN (x) is approximated by a Gaussian process with the samecovariance function uniformly over x ∈ X .

To the best of our knowledge, the inferential approach via series approximation thatwe propose for this example and all other is novel and so are the inferential results.

1.2. Literature Review. This paper builds on the three bodies of research within thesemiparametric literature: orthogonal(debiased) machine learning, least squares seriesestimation, and treatment effects/missing data problems. Orthogonal machine learning(Chernozhukov et al. (2016a), Chernozhukov et al. (2016b)) concerns with the inferenceon a fixed-dimensional target parameter β in presence of a high-dimensional nuisancefunction η in a semiparametric moment problem. In case the moment condition is or-thogonal to the perturbations of η, the estimation of η by ML methods has no first-ordereffect on the asymptotic distribution of the target parameter β. In particular, pluggingin an estimate of η obtained on a separate sample, results in a

√N -consistent asymptot-

ically normal estimate whose asymptotic variance is the same as if η = η0 was known.This result allows one to use highly complex machine learning methods to estimate thenuisance function η, such as `1 penalized methods in sparse models (Buhlmann andvan der Geer (2011), Belloni et al. (2016)), L2 boosting in sparse linear models (Luoand Spindler (2016)), and other methods for classes of neural nets, regression trees, andrandom forests. In many cases, the orthogonal moment conditions are also doubly robust(Robins and Rotnitzky (1995),Kennedy et al. (2017)) with respect to misspecification ofone of the nuisance components.

The second building block of our paper is the literature on least squares series es-timation (Newey (2007), Newey (2009), Belloni et al. (2015), Chen and Christensen(2015)), which establishes the pointwise and the uniform limit theory for least squaresseries estimation. We extend this theory by allowing the dependent variable of the se-ries projection to depend upon an unknown nuisance parameter η. We show that seriesproperties continue to hold without any additional strong assumptions on the problemdesign.

Finally, we also contribute to the literature on missing data, estimation of the condi-tional average treatment effect and group average treatment effect (Robins and Rotnitzky(1995), Hahn (1998), Graham (2011), Graham et al. (2012), Hirano et al. (2003), Abre-vaya et al. (2015), Athey and Imbens (2015), Kennedy et al. (2017), Grimmer et al.(2017)). It is also worth mentioning several references that appeared well after our paperwas posted to ArXiv (1702.06240). They include Fan et al. (2019), Colangelo and Lee(2019), Jacob et al. (2019), Oprescu et al. (2019), Zimmert and Lechner (2019), andmainly analyze inference on (average, group, or local average) features of CATE(Fan

Page 7: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

6 SEMENOVA AND CHERNOZHUKOV

et al. (2019)) and CTE (Colangelo and Lee (2019)). Our framework covers many moreexamples than features CATE (which was our original motivation), and uses series esti-mators instead of kernels for localization (e.g., in contrast to Fan et al. (2019), Colangeloand Lee (2019)).

2. Examples. Examples below apply the proposed framework to study Continu-ous Treatment Effects, Conditional Average Treatment Effect, regression function withPartially Missing Outcome and Conditional Average Partial Derivative.

Here we introduce a new target function - average potential outcome for the case ofcontinuous treatment D - and derive an orthogonal signal Y (η) for it.

Example 1 (Continuous Treatment Effects). Let X = D ∈ R be a continuoustreatment variable, Z be a vector of the controls, and Y d, d ∈ D stand for thepotential outcomes corresponding to the subject’s response after receiving d units oftreatment. The observed data V = (D,Z, Y o) consist of the treatment D, controls Z,and the treatment response Y o = Y D. For a given value x = d, the average potentialoutcome as defined in Gill and Robins (2001) and Kennedy et al. (2017) is

g(x) = E[Y x] = E[Y d].(2.1)

Let us define the following nuisance functions. Define the conditional density of Xgiven Z as

s0(x|z) =dP (X 6 t|Z = z)

dt

∣∣∣∣t=x

and the conditional expectation of the outcome Y o given X,Z as µ(x, z) = E[Y o|X =x, Z = z]. Finally, let

w0(x) =

∫z∈Z

s0(x|z)dPZ(z)

be the marginal density of the vector X.The orthogonal signal Y (η) conditionally onX = x is given by

Y (η) :=Y o − µ(X,Z)

s(X|Z)w(X) +

∫µ(X,Z)dP (Z),(2.2)

where the nuisance parameter η0 is a vector-valued function:

η0(X,Z) = s0(X|Z), w0(X), µ0(X,Z).

In particular, (2.2) coincides with the doubly robust signal of Kennedy et al. (2017).Corollary 5.1 shows that Equation (2.2) is orthogonal with respect to the nuisance pa-rameter η0(D,Z). Theorem 5.1 gives pointwise and uniform asymptotic theory for g(x).This theorem presents a novel asymptotic linearity representation of

√Nα′(β−β0) that

accounts for the estimation error∫µ(X,Z)d(PN−P )(Z) of the integral

∫µ(X,Z)dP (Z).

Page 8: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

7

Example 2 (Conditional Average Treatment Effect). Let Y 1 and Y 0 be the poten-tial outcomes, corresponding to the response of a subject with and without receiving atreatment, respectively. Let D ∈ 1, 0 indicate the subject’s presence in the treatmentgroup. The object of interest is the Conditional Average Treatment Effect

g(x) := E[Y 1 − Y 0|X = x].

Since an individual cannot be treated and non-treated at the same time, only the actualoutcome Y o = DY 1 + (1−D)Y 0, but not the treatment effect Y 1 − Y 0, is observed.

A standard way to make progress in this problem is to assume unconfoundedness(Rosenbaum and Rubin (1983)). Suppose there exists an observable control vector Zsuch that the treatment status D is independent of the potential outcomes Y 1, Y 0 con-ditionally on Z, and X ⊆ Z.

Assumption 2.1 (Unconfoundedness). The treatment status D is independent of thepotential outcomes Y 1, Y 0 conditionally on Z: Y 1, Y 0 ⊥ D|Z.

Define the conditional probability of treatment receipt as s0(Z) = E[D = 1|Z]. Con-sider an orthogonal signal Y of Robins and Rotnitzky (1995) type:

Y (η) := µ(1, Z)− µ(0, Z) +D[Y o − µ(1, Z)]

s(Z)− (1−D)[Y o − µ(0, Z)]

1− s(Z),(2.3)

where µ0(D,Z) = E[Y o|D,Z] is the conditional expectation function of Y o given D,Z.Corollary 5.2 shows that (2.3) is orthogonal to the estimation error of the nuisanceparameter η0(Z) := (s0(Z), µ0(1, Z), µ0(0, Z)).

Example 3 (Regression Function with Partially Missing Outcome). Suppose a re-searcher is interested in the conditional expectation given a covariate vector X of avariable Y ∗

g(x) := E[Y ∗|X = x]

that is partially missing. LetD ∈ 1, 0 indicate whether the outcome Y ∗ is observed, andY o = DY ∗ be the observed outcome. Since the researcher does not control the presencestatus D, a standard way to make progress is to assume existence of an observable controlvector Z such that Y ∗ ⊥ D|Z.

Assumption 2.2 (Missingnesss at Random). The presence indicator D is indepen-dent from the outcome Y conditionally on Z: Y ∗ ⊥ D|Z.

Corollary 5.3 shows that the signal Y , defined as

Y (η) := µ(Z) +D[Y o − µ(Z)]

s(Z),(2.4)

is orthogonal to the estimation error of the nuisance parameter η(Z) := (s(Z), µ(1, Z), µ(0, Z)).Here, the function µ0(Z) = E[Y o|Z] = E[Y |Z,D = 1] is the conditional expectation func-tion of the observed outcome Y o given Z and D = 1.

Page 9: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

8 SEMENOVA AND CHERNOZHUKOV

Example 4 (Experiment with Partially Missing Outcome). Let Y 1 and Y 0 be thepotential outcomes, corresponding to the response of a subject with and without receivinga treatment, respectively. Suppose a researcher is interested in the Conditional AverageTreatment Effect

g(x) := E[Y 1 − Y 0|X = x].

Conditionally on a vector of stratifying variables ZD, he randomly assigns the treat-ment status T ∈ 1, 0 to measure the outcome TY 1 + (1 − T )Y 0. In presence of thepartially missing outcome, let D ∈ 1, 0 indicate the presence of the outcome recordY o = D(TY 1 + (1 − T )Y 0) in the data. Since the presence indicator D may be co-determined with the covariates X, estimating the treatment effect function g(x) on theobserved outcomes only without accounting for the missingnesss may lead to an incon-sistent estimate of g(x).

Since the researcher does not control presence status D, a standard way to makeprogress is to assume Missingnesss at Random, namely existence of an observable controlvector Z such that Y ⊥ D|Z. Setting Z to be a full vector of observables (in particular,X ⊆ Z) makes Missingness at Random (Assumption 2.2) the least restrictive. Define theconditional probability of presence

s0(Z, T ) = E[D = 1|Y o, Z, T ] = E[D = 1|Z, T ]

and the treatment propensity score

h0(Z) := E[T = 1|Z].

An orthogonal signal Y for the CATE g(x) can be obtained as follows:

Y (η) = µ(1, Z)− µ(0, Z) +DT [Y o − µ(1, Z)]

s(Z, T )h(Z)− D(1− T )[Y o − µ(0, Z)]

s(Z, T )(1− h(Z)),(2.5)

where µ0(T,Z) = E[Y o|T,Z] is the conditional expectation function of Y o given T,Z.

Example 5 (Conditional Average Partial Derivative). Let

µ(x,w) := E[Y o|X = x,W = w]

be a conditional expectation function of an outcome Y o given a set of variables X,W .Suppose a researcher is interested in the conditional average derivative of µ(x,w) withrespect to w given X = x, denoted by

g(x) = E[∂wµ(X,w)|w=W |X = x].

An orthogonal signal Y (η) of Newey and Stoker (1993) type takes the form

Y (η) := −∂w log f(W |X)[Y o − µ(X,W )] + ∂wµ(X,W ),(2.6)

where f(W |X = x) is the conditional density of W conditionally on X = x. The nuisanceparameter η = µ(X,W ), f(W |X) consists of the conditional expectation functionµ(X,W ) and the conditional density f(W |X). Corollary 5.4 shows that the Y is orthog-onal to the estimation error of the nuisance parameter η(X,W ) = µ(X,W ), f(W |X).

Page 10: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

9

3. Empirical Application To show the immediate usefulness of the method, weconsider an important problem of inference about structural derivatives. We apply ourmethods to study the household demand for gasoline, a question studied in Hausmanand Newey (1995), Schmalensee and Stoker (1999), Yatchew and No (2001) and Blun-dell et al. (2012). These papers estimated the demand function and the average priceelasticity for various demographic groups. The dependence of the price elasticity on thehousehold income was highlighted in Blundell et al. (2012), who have estimated the elas-ticity by low, middle, and high-income groups and found its relationship with income tobe non-monotonic. To gain more insight into this question, we estimate the average priceelasticity as a function of income and provide simultaneous confidence bands for it.

The data for our analysis are the same as in Yatchew and No (2001), coming fromNational Private Vehicle Use Survey, conducted by Statistics Canada between October1994 and September 1996. The data set is based on fuel purchase diaries and containsdetailed information about fuel prices, fuel consumption patterns, vehicles and demo-graphic characteristics. We employ the same selection procedure as in Yatchew and No(2001) and Belloni et al. (2011), focusing on a sample of the households with non-zerolicensed drivers, vehicles, and distance driven which leaves us with 5001 observations.

The object of interest is the average predicted percentage change in the demand due toa unit percentage change in the price, holding the observed demographic characteristicsfixed, conditional on income. In context of Example 5, this corresponds to the conditionalaverage derivative

g(x) = E[∂wµ(X,Z,W )|X = x],

µ(w, x, z) = E[Y o|X = x, Z = z,W = w],

where Y o is the logarithm of gas consumption, W is the logarithm of price per liter, Xis log income, and Z are the observed subject characteristics such as household size andcomposition, distance driven, and the type of fuel usage. The orthogonal signal Y forthe target function g(x) is given by

Y = −∂w log f(W |X,Z)(Y o − µ(X,Z,W )) + ∂wµ(X,Z,W ),(3.1)

where f(w|x, z) = f(W = w|X = x, Z = z) is the conditional density of the price variableW given income X and subject characteristics Z. The conditional density f(w|x, z)and the conditional expectation functions µ(w, x, z) comprise the set of the nuisanceparameters to be estimated in the first stage.

The choice of the estimators in the first and the second stages is as follows. To estimatethe conditional expectation function µ(w, x, z) and its partial derivative ∂wµ(w, x, z), weconsider a linear model that includes price, price squared, income, income squared, theirinteractions with 28 time, geographical, and household composition dummies. All in all,we have 91 explanatory variables. We estimate µ(w, x, z) using Lasso with the penaltylevel chosen as in Belloni et al. (2014), and estimate the derivative ∂wµ(w, x, z) usingthe estimated coefficients of µ(w, x, z). To estimate the conditional density f(w|x, z), weconsider a model:

W = l(X,Z) + U, U ⊥ X,Z,

Page 11: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

10 SEMENOVA AND CHERNOZHUKOV

where l(x, z) = E[W |X = x, Z = z] is the conditional expectation of price variableW given income variable X and covariates Z, and U is an independent continuouslydistributed shock with univariate density φ(·). Under this assumption, the log density∂w log f(W |X,Z) equals to

∂w log f(W = w|X = x, Z = z) =φ′(w − l(x, z))φ(w − l(x, z))

.

We estimate φ(u) : R → R+ by an adaptive kernel density estimator of Portnoy andKoenker (1989) with Silverman choice of bandwidth. Finally, we plug in the estimatesof µ(w, x, z), ∂wµ(w, x, z) , f(w|z, x) into the Equation 3.1 to get an estimate of Y andestimate g(x) by least squares series regression of Y on X. We try both polynomial basisfunction and B-splines to construct technical regressors.

Figures 1 and 3 report the estimate of the target function (the black line), the pointwise(the dashed blue lines) and the uniform confidence (the solid blue lines) bands for theaverage price elasticity conditional on income, where the significance level α = 0.05.The panels of Figure 1 correspond to different choices of the first-stage estimates of thenuisance functions µ(w, x, z) and f(w|x, z) and dictionaries of technical regressors. Thepanels of Figure 3 correspond to the subsamples of large and small households and todifferent choices of the dictionaries.

The summary of our empirical findings based on Figure 1 and 3 is as follows. We findthe elasticity to be in the range (−1, 0) and significant for majority of income levels. Theestimates based on B-splines (Figures 1c, 1d) are monotonically increasing in income,which is intuitive. The estimates based on polynomial functions are non-monotonic inincome. For every algorithm on Figure 1 we cannot reject the null hypothesis of constantprice elasticity for all income levels: for each estimation procedure, the uniform confi-dence bands contain the constant function. Figure 3 shows the average price elasticityconditional on income for small and large households.3. For majority of income levels,we find large households to be more price elastic than the small ones, but the differenceis not significant at any income level.

To demonstrate the relevance of demographic data Z in the first stage estimation, wehave shown the average predicted effect of the price change on the gasoline consumption(in logs), without accounting for the covariates in the first stage. In particular, this effectequals to E[∂wµ(X,W )|X = x], where µ(x,w) = E[Y |X = x,W = w] is the conditionalexpectation of gas consumption given income and price. This predictive effect consiststwo effects: the effect of price change on the consumption holding the demographiccovariates fixed, which we refer to as average price elasticity, and the association ofthe price change with the change in the household characteristics that also affect theconsumption themselves. Figure 2 shows this predictive effect, approximated by thepolynomials of degree k ∈ 1, 2, conditional on income. By contrast to the results inFigure 1, the slope of the polynomial of degree k = 1 has a negative relationship betweenincome and price elasticity, which present evidence that the demographics Z confoundthe relationship between income and price elasticity.

3A large household is a household with at least 4 members.

Page 12: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

11

(a) Step 2: l(x, z) is estimated byLasso. Step 3: polynomials of degree 3.

(b) Step 2: l(x, z) is estimated by ran-dom forest. Step 3: polynomials of de-gree 3.

(c) Step 2: l(x, z) is estimated byLasso. Step 3: B-splines of order 2 with1 knot.

(d) Step 2: l(x, z) is estimated by ran-dom forest. B-splines of order 2 with 1knot.

Fig 1: 95% confidence bands for the best linear approximation of the average priceelasticity conditional on income with accounting for the demographic controls in thefirst stage. The black line is the estimated function, the dashed(solid) blue lines arethe pointwise (uniform) confidence bands. The estimation algorithm has three steps:(1) first-stage estimation of the conditional expectation function µ(w, x, z), (2) second-stage estimation of the conditional density f(w|x, z), and (3) third-stage estimationof the target function g(x) by least squares series. Step 1 is performed using Lassowith standardized covariates and the penalty choice λ = 2.2

√nσΦ−1(1 − γ/2p), where

γ = 0.1/ log n and σ is the estimate of the residual variance. Step 2 is performed byestimating the regression function of l(x, z) = E[W |X = x, Z = z] and estimating thedensity f(w − l(x, z)) of the residual w − l(x, z) by adaptive kernel density estimatorof Portnoy and Koenker (1989) with the Silverman choice of bandwidth. The regressionfunction l(x, z) is estimated lasso (1a, 1c) and random forest (1b, 1d). Step 3 is performedusing B-splines of order 2 with the number of knots equal to one (1c, 1d) and polynomialfunctions of order 3. (1a, 1b). B = 200 weighted bootstrap repetitions.

Page 13: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

12 SEMENOVA AND CHERNOZHUKOV

(a) Polynomial degree q = 1 (b) Polynomial degree q = 2

Fig 2: 95% confidence bands for the best linear approximation of the average price elas-ticity conditional on income without accounting for the demographic controls in thefirst stage. The black line is the estimated function, the dashed blue lines and the solidblue lines are the pointwise and the uniform confidence bands. The estimation algo-rithm has three steps: (1) first-stage estimation of the conditional expectation functionµ(w, x) = E[Y |W = w,X = x], (2) second-stage estimation of the conditional densityf(W = w|X = x), and (3) third-stage estimation of the target function g(x) by leastsquares series. Step 1 is performed using least squares series regression using polynomialfunctions 1, x, . . . , xq, q = 3 whose power q is chosen by cross-validation out of 1, 2, 3.Step 2 is performed by kernel density estimator with the Silverman choice of bandwidth.Step 3 is performed using polynomial functions 1, x, . . . , xq and is shown for q = 1 andq = 2. B = 200 weighted bootstrap repetitions.

Page 14: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

13

(a) Large Households, Polynomials ofdegree 3.

(b) Small Households, Polynomials ofdegree 3.

(c) Large Households, B-splines of de-gree 2 with 1 knot.

(d) Small Households, B-splines of de-gree 2 with 1 knot.

Fig 3: 95% confidence bands for the best linear approximation of the average price elas-ticity conditional on income with accounting for the demographic controls in the firststage by household size. The black line is the estimated function, the dashed(solid) bluelines are the pointwise (uniform) confidence bands. The estimation algorithm has threesteps: (1) first-stage estimation of the conditional expectation function µ(w, x, z), (2)second-stage estimation of the conditional density f(w|x, z), and (3) third-stage estima-tion of the target function g(x) by least squares series. Step 1 is performed using Lassowith standardized covariates and the penalty choice λ = 2.2

√nσΦ−1(1 − γ/2p), where

γ = 0.1/ log n and σ is the estimate of the residual variance. Step 2 is performed byestimating the regression function of l(x, z) = E[W |X = x, Z = z] and estimating thedensity f(w − l(x, z)) of the residual w − l(x, z) by adaptive kernel density estimatorof Portnoy and Koenker (1989) with the Silverman choice of bandwidth. The regressionfunction l(x, z) is estimated lasso. Step 3 is performed using B-splines of order 2 with thenumber of knots equal to one (3c, 3d) and using non-orthogonal polynomial functions ofdegree 3 (3a, 3b). B = 200 weighted bootstrap repetitions.

Page 15: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

14 SEMENOVA AND CHERNOZHUKOV

4. Main Theoretical Results.

Notation. For two sequences of random variables aN , bN , N > 1 : aN .P bn means aN =OP (bN ). For two sequences of numbers aN , bN , N > 1, aN . bN means aN = O(bN ). Leta∧b = mina, b, a∨b = maxa, b. The `2 norm of a vector is denoted by ‖·‖, the `1 normis denoted by ‖ ·‖1, the `∞ is denoted by ‖ ·‖∞, and the `0 - norm denotes the number ofnon-zero components of a vector. Given a vector δ ∈ Rp and a set of indices T ⊂ 1, ..., p,we denote by δT the vector in Rp for which δTj = δj , j ∈ T and δTj = 0, j 6∈ T .

Let ξd := supx∈X ‖p(x)‖ = supx∈X (∑d

j=1 pj(x)2)1/2. For a matrix Q, let ‖Q‖ be the

maximal eigenvalue of Q. For a random variable V , let ‖V ‖P,q := (∫|V |qdP )1/q. Let

ζd := supx∈X ‖p(x)‖∞.The random sample (Vi)

Ni=1 is a sequence of independent copies of a random element

V taking values in a measurable space (V,AV) according to a probability law P . We shalluse empirical process notation. For a generic function f and a generic sample (Xi)

Ni=1,

denote a sample average by

ENf(Xi) :=1

N

N∑i=1

f(Xi)

and a√N -scaled, demeaned sample average by

GNf(Xi) :=1√N

N∑i=1

(f(Xi)− Ef(Xi)).

All asymptotic statements below are with respect to N →∞. Let

En,kf(xi) :=1

n

∑i∈Jk

f(xi), Gn,kf(xi) :=1

n

∑i∈Jk

f(xi)− E[f(xi)|(Vi)i∈Jck

]For an observation index i ∈ Jk that belongs to a fold Jk, k ∈ 1, 2, ..,K, define Yi(η) =Yi(ηk), i ∈ Jk, where ηk is estimated on Jck as in Definition 1.1. Define η(Zi) = ηk(Zi), i ∈Jk.

4.1. Asymptotic Theory for Least Squares Series

Assumption 4.1 (Identification). Let Q := Ep(X)p(X)′ = Qd denote populationcovariance matrix of technical regressors. Assume that ∃ 0 < Cmin < Cmax < ∞ thatdo not depend on d s.t. Cmin < min eig(Q) < max eig(Q) < Cmax ∀d.

Assumption 4.2 (Growth Condition). We assume that the sup-norm of the technicalregressors ξd := supx∈X ‖p(x)‖ = supx∈X (

∑dj=1 pj(x)2)1/2 grows sufficiently slow:√

ξ2d logN

N= o(1).

Page 16: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

15

Assumption 4.3 (Misspecification Error). There exists a sequence of finite constantsld, rd such that the norms of the misspecification error are controlled as follows:

‖rg‖P,2 :=

√∫rg(x)2dP (x) . rd and ‖rg‖P,∞ := sup

x∈X|rg(x)| . ldrd.

Assumption 4.3 introduces the rate of decay of the misspecification error. Specifically,the sequence of constants rd bounds the mean squared misspecification error. In addition,the sequence ldrd bounds the worst-case misspecification error uniformly over the domainof X X , where ld is the modulus of continuity of the worst-case error with respect tomean squared error.

Define the sampling error U as follows:

U := Y − g(X).

Assumption 4.4 (Sampling Error). The second moment of the sampling error Uconditionally on X is bounded from above by σ2:

supx∈X

E[U2|X = x] .P σ2.

To describe the first-stage rate requirement, Assumption 4.5 introduces a sequence ofnuisance realization sets TN for the nuisance parameter η0. As sample size N increases,the sets TN shrink around the true value η0. The shrinkage speed is described in termsof the statistical rates BN and ΛN .

Assumption 4.5 (Small Bias Condition). There exists a sequence εN = o(1), suchthat with probability at least 1 − εN , the first stage estimate η, obtained by cross-fitting(Definition 1.1), belongs to a shrinking neighborhood of η0, denoted by TN . Uniformlyover TN , the following mean square convergence holds:

BN :=√N sup

η∈TN‖Ep(X)[Y (η)− Y (η0)]‖ = o(1),(4.1)

ΛN := supη∈TN

(E‖p(X)[Y (η)− Y (η0)]‖2)1/2 = o(1).(4.2)

In particular, ΛN can be bounded as

ΛN . ξd supη∈TN

(E(Y (η)− Y (η0))2)1/2.

Assumption 4.5 is stated in a high-level form in order to accommodate various ma-chine learning estimators. We demonstrate the plausibility of Assumption 4.5 for a high-dimensional sparse model for Example 3 (see Example 6).

Page 17: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

16 SEMENOVA AND CHERNOZHUKOV

4.1.1. Pointwise Limit Theory

Theorem 4.1 (Pointwise Limit Theory of Orthogonal Estimator). Let Assumptions4.1, 4.2, 4.3, 4.4, 4.5 hold. Then, the following statements hold:

(a) The second norm of the estimation error is bounded as:

‖β − β0‖2 .P

√d

N+[√ d

Nldrd ∧ ξdrd/

√N],

which implies a bound on the mean squared error of the estimate p(x)′β of thepseudo-target function p(x)′β0:

(EN (p(Xi)′(β − β0))2)1/2 .P

√d

N+[√ d

Nldrd ∧ ξdrd/

√N].

(b) For any α ∈ Sd−1 := α ∈ Rd : ‖α‖ = 1 the estimator β is approximately linear:

√Nα′(β − β) = α′Q−1GNp(Xi)(Ui + rg(Xi)) +R1,N (α),

where the remainder term R1,N (α) is bounded as

R1,N (α) .P ΛN +BN +

√ξ2d logN

N

(1 + max

ldrd√d, ξdrd,ΛN +BN

).

(c) Define the asymptotic covariance matrix of the (β − β0) as follows:

Ω = Q−1Ep(X)p(X)′(U + rg(X))2Q−1.

If R1,N (α) = oP (1) and the Lindeberg condition holds: supx∈X E[U21|U |>M |X =x]→ 0, M →∞, then the pointwise estimator is approximately Gaussian:

limN→∞

supt∈R

∣∣∣∣∣P(√

Nα′(β − β0)√α′Ωα

< t

)− Φ(t)

∣∣∣∣∣ = 0.(4.3)

In particular, for any point x0 ∈ X for α = p(x0)‖p(x0)‖ , the estimator p(x0)′β of the

pseudo-target value p(x0)′β0 is asymptotically normal:

limN→∞

supt∈R

∣∣∣∣∣P(√

Np(x0)′(β − β0)√p(x0)′Ωp(x0)

< t

)− Φ(t)

∣∣∣∣∣ = 0.(4.4)

Theorem 4.1 is our first main result. Under small bias condition, Orthogonal Estimatorhas the oracle rate, oracle asymptotic linearity representation and asymptotic varianceΩ, where the oracle knows the true value of the first-stage nuisance parameter η0.

Page 18: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

17

4.1.2. Uniform Limit Theory Let α(x) := p(x)/‖p(x)‖ denote the normalized valueof technical regressors p(x). Define their Lipshitz constant as:

ξLd = supx,x′∈X ,x 6=x′

‖α(x)− α(x′)‖‖x− x′‖

.

Assumption 4.6 (Tail Bounds). There exist m > 2 such that the upper bound of them’th moment of |U | is bounded conditionally on X:

supx∈X

E[|U |m|X = x] . 1.

Assumption 4.6 bounds the tail of the distribution of the sampling error U .

Assumption 4.7 (Basis). Basis functions are well-behaved, namely (i)(ξLd )2m/(m−2) logN

N .1 and log ξLd . log d for the same m as in Assumption 4.6.

Assumption 4.8 (Condition for Matrix Estimation). There exists a sequence εN =o(1), such that with probability at least 1 − εN , the first stage estimate η, obtained bycross-fitting (Definition 1.1), belongs to a shrinking neighborhood of η0, denoted by TN .Uniformly over TN , the following convergence holds:

κN := supη∈TN

(E max16i6N

(Yi(η)− Yi(η0))2)1/2 = o(1).(4.5)

Theorem 4.2 (Uniform Limit Theory of Orthogonal Estimator). Let Assumptions4.1, 4.2, 4.3, 4.4, 4.5, 4.6, 4.7 hold.

(a) The estimator is approximately linear uniformly over the domain X :

|√Nα(x)′(β − β0)− α′(x)GNp(Xi)[Ui + rg(Xi)]| 6 R1,N (α(x))

where R1,N (α(x)), summarizing the impact of unknown design and the first stageerror, obeys

supx∈X

R1,N (α(x)) .P

√ξ2d logN

N(N1/m

√logN +

√dldrd)+

(BN + ΛN )(1 +

√ξ2d logN

N) =: R1N

uniformly over x ∈ X . Moreover,

|√Nα(x)′(β − β0)− α′(x)GNp(Xi)Ui| 6 R1,N (α(x)) +R2,N (α(x))

where R2,N (α(x)), summarizing the impact of misspecification error, obeys

R2,N (α(x)) .P

√logNldrd =: R2N

uniformly over x ∈ X .

Page 19: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

18 SEMENOVA AND CHERNOZHUKOV

(b) The estimator p(x)′β of the pseudo-target p(x)′β0 converges uniformly over X :

supx∈X|p(x)′(β − β0)| .P

ξd√N

[√

logN + R1N + R2N ].

(c) In addition, suppose Assumption 4.8 holds and R1,N + R2,N .√

logN . Then, Ωcan be consistently estimated by a sample analog:

(4.6) Ω := Q−1ENp(Xi)p(Xi)′(Yi(η)− p(Xi)

′β)2Q−1.

Theorem 4.2 is our second main result in the paper. Under small bias condition,Orthogonal Estimator achieves oracle asymptotic linearity representation uniformly overthe domain X ⊂ Rr of the covariates of interest X.

Remark 4.1 (Optimal Uniform Rate in Holder class). Suppose the true functiong(x) belongs to the Holder smoothness class of order k, denoted by Σk(X ). Supposeldrd . d−k/r, ξd .

√d, R1N + R2N . (logN)1/2. Then, the optimal number d of technical

regressors that comprise a vector p(x) obeys

d (logN/N)−r/(2k+r).

This choice of d yields the optimal uniform rate:

supx∈X|g(x)− g(x)| .P

( logN

N

)r/(2k+r).

Our result on strong approximation by a Gaussian process plays an important role inour second result on inference that is concerned with weighted bootstrap. Consider a setof weights h1, h2, . . . , hN that are i.i.d. draws from the standard exponential distributionand are independent of the data. For each draw of such weights, define the weightedbootstrap draw of the least squares estimator as a solution to the least squares problemweighted by h1, h2, . . . , hN , namely

βb ∈ arg minb∈Rk

EN [hi(Yi − p(Xi)′b)2].

For all x ∈ X , define gb(x) = p(x)βb. The following theorem establishes validity ofweighted bootstrap for approximating the distribution of series process.

Theorem 4.3 (Validity of Weighted Bootstrap). (a) Let Assumption 4.6 be sat-isfied with m > 3. In addition, assume that R1,N = oP (a−1

N ) and a6Nd

4ξ2d(1 +

l3dr3d)

2 log2N/N → 0. Then, for some Nd ∼ N(0, Id)

√Nα(x)′(β − β0)

‖α(x)′Ω1/2‖=d

α(x)′Ω1/2

‖α(x)′Ω1/2‖Nd + oP (a−1

N )(4.7)

in l∞(X ),

Page 20: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

19

√Nα(x)′(β − β0)

‖α(x)′Ω1/2‖=d

α(x)′Ω1/2

‖α(x)′Ω1/2‖Nd + oP (a−1

N )

in l∞(X ) so that for e(x) = Ω1/2p(x),

√Np(x)′(β − β0)

‖e(x)‖=d

e(x)

‖e(x)‖Nd + oP (a−1

N )

in l∞(X ).(b) Weighted bootstrap process satisfies:

√Nα(x)′(βb − β) = α(x)′GN [(hi − 1)p(Xi)(Ui + rg(Xi))] +Rb1N (α(x)),

where the remainder obeys

Rb1N (α(x)) .P

√ξ2d log3N

N(N1/m

√logN +

√dldrd)+

(BN + ΛN )(1 +

√ξ2d logN

N) =: Rb1N

(c)√Np(x)′(βb − β)

‖e(x)‖=d e(x)′

‖e(x)‖Nd + oP (a−1

N ) in `∞(X ), and so

(d)√Ngb(x)− g(x)

‖e(x)‖=d e(x)′

‖e(x)‖Nd + oP (a−1

N ) in `∞(X ).

Theorem 4.3 (a) establishes strong approximation of α(x)′(β − β0) by a Gaussianprocess. Theorem 4.3 (b) verifies validity of weighted bootstrap. Consider the followingt-statistic:

tN (x) :=g(x)− g(x)

σ(x),

where σ(x) =

√p(x)′Ωp(x)/N and Ω is defined in (4.6). Denote the bootstrapped t-

statistic as

tbN (x) :=gb(x)− g(x)

σ(x)(4.8)

and the critical value cN (1−α) as the 1−α-quantile of supx∈X |tbN (x)|. Define the uniformconfidence bands for g(x) as

[i(x), i(x)] := [g(x)− cN (1− α)σ(x), g(x) + cN (1− α)σ(x)], x ∈ X .(4.9)

Page 21: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

20 SEMENOVA AND CHERNOZHUKOV

Corollary 4.1 (Uniform Validity of Confidence Bands). Let Assumptions 4.1, 4.2,4.3, 4.4, 4.5,4.6, 4.7, 4.8 hold. Let Assumption 4.6 be satisfied with m > 4. Let vN =

(Emax16i6N |Ui|2)1/2 and κN from Assumption 4.8 obey (vN ∨ 1 + ldrd)(

√ξ2d logN

N+

κN ∨1)+κ2N = o( 1√

logN). Furthermore, let R1,N , R2,N from Theorem 4.2(a) obey R1,N +

R2,N . (logN)−1/2. Finally, let d, ξd, N be such that d4ξ2d(1 + l3dr

3d)

2 log5N/N → 0,

N1/m

√ξ2d log2N

N = o( 1√logN

) and supx∈X√Nrg(X)/‖p(X)‖ = o( 1√

logN). Then,

P(

supx∈X|tN (x)| 6 cN (1− α)

)= 1− α+ o(1).

As a consequence, the confidence bands defined in (4.9) satisfy

P(g(x) ∈ [i(x), i(x)] ∀x ∈ X

)= 1− α+ o(1).

The width of the confidence bands 2rN (1− α)σN (x) obeys

2cN (1− α)σN (x) .P σN (x)√

logN .

√ξ2d logN

N

uniformly over x ∈ X .

5. Applications In this section we apply the results of Section 4.1 for empiricallyrelevant settings, described in Examples 1, 2, 3, 5.

5.1. Continuous Treatment Effects In the setup of Example 1, let the data vectorV = (D,Z, Y o) consist of the treatmentD ∈ R, exogenous covariates Z, and the observedresponse Y o = Y D. Let X = D be the conditioning vector. Define the true value of first-stage nuisance functions as

s0(x|z) =dProb(X 6 t|Z = z)

dt

∣∣∣∣t=x

,

µ0(x, z) = E[Y o|X = x, Z = z],

w0(x) =

∫z∈Z

s0(x|z)dPZ(z)

Assumption 5.1 (Unconfoundedness in the case of Continuous Treatment). (1).Consistency. D = d implies Y o = Y d. (2) There exist 0 < πmin < πmax < π such thatπmin 6 s(x, z) 6 πmax for all (x, z) ∈ X×Z. (3) Ignorability. E[Y d|D,Z] = E[Y d|Z].

Under Assumption 5.1, the target function g(x) can be identified as

g(x) = E[Y x] = EZE[Y x|Z]

= EZE[Y x|X = x, Z]

= EZE[Y o|X = x, Z] =

∫µ(x, Z)dP (Z).

Page 22: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

21

We provide sufficient low-level conditions on the functions s(x|z), µ(x, z), w(x) suchthat the pointwise and uniform Gaussian approximations of the target function g(x)(Theorems 4.1 and 4.2) hold.

Definition 5.1 (First-Stage Rate for Continuous Treatment Effects). Assume thatthere exists a sequence of numbers εN = o(1) and sequences of neighborhoods SN ofs0(·, ·), MN of µ0(·, ·), WN of w0(·) such that the first-stage estimate

s(·, ·), µ(·, ·), w(·)

belongs to the set SN×MN×WN w.p. at least 1 − εN . These neighborhoods shrinkat the following rates:

sN := sups∈SN

(E(s(X|Z)− s0(X|Z))2)1/2

mN := supµ∈MN

(E(µ(X,Z)− µ0(X,Z))2)1/2

wN := supw∈WN

(E(w(X)− w0(X))2)1/2.

Definition 5.1 introduces L2-convergence rates for the three nuisance functions: condi-tional density s(X|Z), the regression function µ(X,Z), and the marginal density w(X).

Assumption 5.2 (Assumptions on the First Stage Rate). Let sN ,mN ,wN be as inDefinition 5.1. We assume that the rates sN ,mN ,wN decay sufficiently fast:

√N√d(mNsN∨

mNwN ) = o(1) and ξd(sN ∨ mN ∨ wN ) = o(1). Furthermore, assume that there ex-ists a constant C < ∞ that bounds the functions in MN uniformly over their domainsupµ∈MN

supx,z∈X×Z |µ(x, z)| < C,

sups∈SN

supx,z∈X×Z

maxs(x|z), 1

s(x|z) < C,

supw∈WN

supx∈X

1

w(x)< C.

Corollary 5.1 (Continuous Treatment Effects). Let Assumptions 5.1 and 5.2 hold.Then, the orthogonal signal Y (η), defined in (2.2), satisfies Assumption 4.5. Therefore,the statements of Theorems 4.1 ,4.2, 4.3 and Corollary 4.1 hold.

The orthogonal signal in Corollary 5.1 takes the form

Yi(η) = w(Xi)Y oi − µ(Xi, Zi)

s(Xi|Zi)+

∫µ(Xi, Z)dP (Z).(5.1)

Its second summand∫µ(Xi, Z)dP (Z) involves taking the integral with respect to the

unknown distribution P (Z) and needs to be estimated. Below we give a feasible estimator

Y †i (η) of the signal Yi(η):

Y †i (η) = w(Xi)Y oi − µ(Xi, Zi)

s(Xi|Zi)+

1

n− 1

∑j∈Jk,j 6=i

µ(Xi, Zj), i ∈ Jk,(5.2)

Page 23: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

22 SEMENOVA AND CHERNOZHUKOV

where the second summand involves the sample average over data (Vj)j∈Jk excluding

observation i, and µ(x, z) is estimated on Jck. Define the feasible estimator β† of BestLinear Predictor as:

β† = Q−1 1

N

N∑i=1

p(Xi)Y†i (η).(5.3)

Since µ(X,Z) enters linearly in (2.2), the error term Y †i (η)−Yi(η) does not introduce bias

in β† but introduces an extra summand in its asymptotic linearity representation. Wedemonstrate this representation below. For a given function µ(x, z), define its demeanedanalog as

µ0(x, z) = µ(x, z)−∫µ(x, Z)dP (Z).

Observe that

β† = β + Q−1 1

K

K∑k=1

1

n(n− 1)

n∑i=1

∑i 6=j

p(Xi)µ0(Xi, Zj)

=: β + Q−1 1

K

K∑k=1

1

n(n− 1)

∑i,j∈Jk,i 6=j

τ(Vi, Vj ; µ)

Conditionally on the data (Vj)j∈Jck , 1n(n−1)

∑i,j∈Jk,i 6=j τ(Vi, Vj ; µ) is a U -statistic of order

2 with the kernel function

τ(Vi, Vj ;µ) =1

2(p(Xi)µ

0(Xi, Zj) + p(Xj)µ0(Xj , Zi)).

Its Hajek projection function is equal to

τ1(v;µ) := Eτ(v, Vi;µ) = Ep(X)µ0(X, z) = τ1(z;µ),

where z ⊂ v is a subvector of v.

Theorem 5.1 (Pointwise and Uniform Limit Theory for Continuous Treatment Ef-fects). Suppose Assumptions 4.1, 4.2, 4.3, 4.4, 5.1 and 5.2 hold. Then, the followingstatements hold.

(a) For any α ∈ Sd−1 := α ∈ Rd : ‖α‖ = 1 the estimator β† is approximately linear:√Nα′(β† − β) = α′Q−1GN [p(Xi)(Ui + rg(Xi)) + τ1(Zi;µ0)] +R1,N (α) +R†1,N (α),

where R†1,N (α), summarizing the remainder of the U -statistic projection and thefirst-stage error, obeys

R†1,N (α) .P N−1/2C(

√d+ ξdN

−1)

+√N√dζd

√ξ2d logN

N

(log d

N+

(log d

N

)3/2

+

(log d

N

)5/4)

+ mN

(1 +√d

√ξ2d logN

N

).

Page 24: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

23

(b) Let

Σ† = E[p(X)(U + rg(X)) + τ1(Z;µ0)][p(X)(U + rg(X)) + τ1(Z;µ0)]>

andΩ† = Q−1Σ†Q−1.

If R1,N (α)+R†1,N (α) = oP (1), ξdldrd = o(√N), and the Lindeberg condition holds:

supx∈X E[U21|U |>M |X = x] → 0, M → ∞, then the pointwise estimator is ap-proximately Gaussian:

limN→∞

supt∈R

∣∣∣∣∣P(√

Nα′(β† − β0)√α′Ω†α

< t

)− Φ(t)

∣∣∣∣∣ = 0.(5.4)

(c) In addition, suppose Assumptions 4.6, 4.7 hold with ξLd /Cmin > e2/16 ∨ e. The

estimator β† is approximately linear uniformly over the domain X :

|√Nα(x)′(β† − β0)− α′(x)GN [p(Xi)(Ui + rg(Xi)) + τ1(Zi;µ0)]|

6 R1,N (α(x)) +R†1,N (α(x))

where

R†1,N (α(x)) .P N−1/2C(

√d logN + ξdN

−1 log2N)

+√N√dζd

√ξ2d logN

N

(log d

N+

(log d

N

)3/2

+

(log d

N

)5/4)

+√dmN

(1 +

√ξ2d logN

N

).

Estimation of the average potential outcome in the presence of continuous treatmentis a growing literature. Here we compare our Corollary 5.1 to Kennedy et al. (2017) whointroduced the doubly robust score for the ATE in the presence of continuous treat-ment. First, by virtue of sample splitting, Theorem 5.1 does not impose any complexityrequirements on the estimator of the first-stage nuisance parameters. In particular, un-like Theorem 2 of Kennedy et al. (2017), Theorem 5.1 does not require the functionclass containing µ(X,Z), s(X|Z), w(X) to have bounded uniform entropy integrals. Asa result, our method accommodates a wide class of modern regularized methods to beemployed to estimate first-stage parameters. For example, L2-convergence rates for thenuisance function µ0(X,Z) are available for linear lasso, logistic lasso (Belloni et al.(2013)), deep neural networks (Farrell et al. (2018), Schmidt-Hieber (2019)), and ran-dom forest (Wager and Walther (2015)). Third, Theorem 5.1 offers both pointwise anduniform inference about g(x), while Kennedy et al. (2017) offers only pointwise inferenceresult. Fourth and foremost, Kennedy et al. (2017) employs local linear regression inthe second stage and delivers convergence at a

√Nhr rate, where h = h(N) is the ker-

nel bandwidth. In contrast, our method delivers√N - approximation for the normalized

Page 25: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

24 SEMENOVA AND CHERNOZHUKOV

projectionp(x)′β0

‖p(x)‖. As a result, the estimation of the integral

∫µ(x, Z)dP (Z) introduces

an extra term of asymptotic linearity representation which does not appear in kernelasymptotic approximations.

5.2. Conditional Average Treatment Effect Using the setup of Example 2, let Y 1 andY 0 be the potential outcomes, D ∈ 1, 0 indicate the presence in the treatment group,Y o = DY 1+(1−D)Y 0 be the actual outcome, s0(z) = E[D = 1|Z = z] be the propensityscore and µ0(d, z) = E[Y o|D = d, Z = z] be the conditional expectation function. Weprovide sufficient low-level conditions on the regression functions µ0(1, ·), µ0(0, ·) and thepropensity score s0(·) such that the pointwise and uniform Gaussian approximations ofthe target function g(x) (Theorems 4.1 and 4.2) hold.

Assumption 5.3 (Strong Overlap). A The propensity score is bounded above andbelow: ∃π0 > 0 0 < π0 < s0(z) < 1− π0 < 1 ∀z ∈ Z.

B The propensity score is bounded below: ∃π0 > 0 0 < π0 < s0(z) < 1 ∀z ∈ Z.

In context of Example 2 Assumption 5.3(a) ensures that the probability of assignmentto the treatment and control group is bounded away from zero. In context of Example 3Assumption 5.3(b) ensures that the probability of observing the response Y ∗ is boundedaway from zero.

Definition 5.2 (First Stage Rate for CATE with Binary Treatment). Given the truefunctions s0(·), µ0(1, ·), µ0(0, ·) and sequences of shrinking neighborhoods SN of s0(·), MN

of µ0(1, ·), MN of µ0(0, ·) define the following rates:

sN := sups∈SN

(E(s(Z)− s0(Z))2)1/2,

mN := supµ∈MN

(E(µ(1, Z)− µ0(1, Z))2)1/2 ∨ supµ∈MN

(E(µ(0, Z)− µ0(0, Z))2)1/2,

where the expectation is taken with respect to Z.

We will refer to sN as the propensity score rate and mN as the regression functionrate.

Assumption 5.4 (Assumptions on the Propensity Score and the Regression Func-tion). Assume that there exists a sequence of numbers εN = o(1) and sequences ofneighborhoods SN of s0(·), MN of µ0(1, ·) and µ0(0, ·) such that the first-stage estimates(·), µ(1, ·), µ(0, ·) belongs to the set SN×MN×MN w.p. at least 1− εN . The ratessN ,mN of Definition 5.2 decay sufficiently fast: ξd(sN ∨mN ) = o(1), and

√N√dsNmN = o(1).

Finally, assume that there exists C > 0, C <∞ that bounds the functions in MN uniformlyover their domain supµ∈MN

supz∈Z supd∈1,0 |µ(d, z)| < C sups∈SN supz∈Z1s(z) < C.

Page 26: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

25

Plausibility of Assumption 5.4 is discussed in the introduction of the paper. In casethe propensity score and regression function can be well-approximated by a logistic (lin-ear) high-dimensional sparse model, Assumption 5.4 holds under a low-level conditionsanalogous to those in Example 6.

Corollary 5.2 (Gaussian Approximation for Conditional Average Treatment Effect).Under Assumptions 5.3[A] and 5.4, Y (η), given by Equation 2.3, satisfies Assumption4.5. Then, the statements of Theorems 4.1 ,4.2, 4.3 and Corollary 4.1 hold for Condi-tional Average Treatment Effect.

5.3. Regression Function with Partially Missing Outcome Using the setup of Exam-ple 3, let Y ∗ be a partially missing outcome variable, D ∈ 1, 0 indicate the presenceof Y ∗ in the sample, Y o = DY ∗ be the observed outcome, s0(Z) = E[D = 1|Z] be thepropensity score and µ0(Z) = E[Y o|D = 1, Z] be the conditional expectation function.We provide sufficient low-level conditions on the regression functions µ(Z), s(Z) suchthat Theorems 4.1, 4.2, 4.3 hold.

Assumption 5.5 (Assumptions on the Propensity Score and the Regression Function).Assume that there exists a sequence of numbers εN = o(1) and sequences of neighborhoodsSN of s0(·), MN of µ0(·) such that the first-stage estimate s(·), µ(·) belongs to the setSN×MN w.p. at least 1− εN . The propensity score rate sN defined in Definition 5.2and the regression function rate mN := supµ∈MN

(E(µ(Z)−µ0(Z))2)1/2 decay sufficientlyfast: ξd(sN ∨mN ) = o(1), and

√N√dsNmN = o(1).

Finally, assume that there exists C > 0, C <∞ that bounds the functions in MN uniformlyover their domain supµ∈MN

supz∈Z |µ(z)| < C sups∈SN supz∈Z | 1s(z) | < C.

Corollary 5.3 (Gaussian Approximation for Regression Function with PartiallyMissing Outcome). Under Assumptions 5.3 [B] and Assumption 5.5 the orthogonalsignal Y (η), given by Equation 2.4, satisfies Assumption 4.5. Then, the statements ofTheorems 4.1 ,4.2, 4.3 and Corollary 4.1 hold for regression function with PartiallyMissing Outcome.

Here give an example of a model and and a first-stage estimator that satisfy Assump-tion 5.4.

Example 6 (Partially Missing Outcome with High-Dimensional Sparse Design). Con-sider the setup of Example 3. Let the observable vector (D,X,DY ∗) consist of the covari-ate vector of interest X and a partially observed variable Y ∗, whose presence is indicatedby D ∈ 1, 0. In addition, suppose there exists an observable vector Z such that Miss-ingnesss at Random (Assumption 2.2) is satisfied conditionally on Z. Let pµ(z), ps(z)be high-dimensional basis functions of the vector Z that approximate the conditional

Page 27: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

26 SEMENOVA AND CHERNOZHUKOV

expectation functions µ0(z), s0(z) using the linear and logistic links, respectively:

µ0(z) = pµ(z)′θ0 + rµ(z)(5.5)

s0(z) = L(ps(z)′δ0) + rs(z) :=

exp(ps(z)′δ)

exp(ps(z)′δ) + 1+ rs(z)(5.6)

where θ, δ are the vectors in Rpθ ,Rpδ whose dimensions are allowed to be larger thanthe sample size N , and rµ(z), rs(z) are the misspecification errors of the respective linkfunctions that vanish as described in Assumptions 5.6, 5.7. For each γ ∈ θ, δ, denotea support set

Tγ := j : γj 6= 0, j ∈ 1, 2, .., pγand its cardinality, which we refer to as sparsity index of γ,

sγ := |T | = ‖γ‖0 ∀ γ ∈ θ, δ.

We allow the cardinality of sδ, sθ to grow with N .Let rN ,∆N → 0 be a fixed sequence of constants approaching zero from above at a

speed at most polynomial in N : for example, rN > 1Nc for some c > 0, `N = logN . For

γ ∈ θ, δ, let cγ , Cγ , κ′γ , κ′′γ and ν ∈ [0, 1] be positive constants that do not depend on

N . Finally, let ‖V ‖PN ,2 := (N−1∑N

i=1 V2i )1/2.

Assumption 5.6 (Regularity Conditions for Linear Link). We assume that the fol-lowing standard conditions hold. (a) With probability 1−∆N , the minimal and maximalempirical RSE are bounded from below by κ′µ and from above by κ′′µ:

κ′µ 6 inf‖δ‖06sθ`N ,‖δ‖=1

‖Dpµ(Z)‖PN ,2 6 sup‖δ‖06sθ`N ,‖δ‖=1

‖Dpµ(Z)‖PN ,2 6 κ′′µ.

(b) There exists absolute constants Bθ, Cθ, cθ > 0: regressors max16j6pθ |pµ,j(Z)| 6Bθ a.s. and cθ 6 max16j6pθ Epµ,j(Z)2 6 Cθ (c) With probability 1−∆N , ENr2

µ(Zi) 6

Cθsθ log(pθ ∨N)/N . (d) Growth restriction: for some rN = o(1), log(pθ ∧N) . rNN1/3.

(e) The moments of the model are boundedly heteroscedastic: cθ 6 E[(Y − µ0(Z))2|Z] 6Cθ, max16j6pθ E[|pµ,j(Z)(Y −µ0(Z))|3 + |pµ,j(Z)Y |3] 6 Cθ. (f) With probability 1−∆N ,

max16j6pθ

[EN − E][p2µ,j(Z)(Y − µ0(Z))2 + p2

µ,j(Z)Y 2] 6 rNN−1/2.

Assumption 5.7 (Regularity Conditions for Logistic Link). We assume that thefollowing standard conditions hold. With probability 1 −∆N , the minimal and maximalempirical RSE are bounded from below by κ′s and from above by κ′′s :

κ′s 6 inf‖δ‖06sδ`N ,‖δ‖=1

‖ps(Z)′δ‖PN ,2 6 sup‖δ‖06sδ`N ,‖δ‖=1

‖ps(Z)′δ‖PN ,2 6 κ′′s .

(b) There exist absolute constants Bδ, Cδ, cδ > 0: regressors max16j6pδ |ps,j(Z)| 6 Bδ a.s.and cδ 6 max16j6pδ Eps,j(Z)2 6 Cδ (c) With probability 1−∆N , ENr2

s(Zi) 6 Cδsδ log(pδ∨N)/N . (d) Growth restriction: for some rN = o(1), log(pδ ∧ N) . rNN

1/3. (e) Withprobability 1−∆N ,

max16j6pδ

[EN − E][p2δ,j(Z)(D − s0(Z))2] 6 rN .

Page 28: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

27

Assumptions 5.6 and 5.7 are a simplification of the Assumption 6.1-6.2 in Belloni et al.(2013). The following estimators of µ0(Z) and s0(Z) are available.

Definition 5.3 (Lasso Estimator of the Regression Function). Let λ = 1.1√NΦ−1(1−

0.05/(N ∨pθ logN)) and Ψθ = diag(l1, l2, . . . , lpθ) be a diagonal matrix of data-dependent

penalty loadings chosen as in Algorithm 6.1 in Belloni et al. (2013). Define θ as a solu-tion to the following optimization problem:

θ := arg minθ∈Rpθ

ENDi(Yoi − pµ(Zi)

′θ)2 + λ‖Ψθθ‖1

and a first-stage estimate of µ as

µ(z) := pµ(z)′θ.

Definition 5.4 (Lasso Estimator of the Propensity Score). Let λ = 1.1√NΦ−1(1−

0.05/(N ∨pδ logN)) and Ψδ = diag(l1, l2, . . . , lpδ) be a diagonal matrix of data-dependentpenalty loadings chosen as in Algorithm 6.1 in Belloni et al. (2013). Let s > 0 be a positiveconstant. Define δ as a solution to the following optimization problem:

δ := arg minδ∈Rpδ

EN [log(1 + exp(ps(Zi)′δ))−Dips(Zi)

′δ] + λ‖Ψδδ‖1

and a first-stage estimate of s0 as

s(z) := max(s/2,L(ps(z)′δ)).

Lemma 5.1 (Sufficient Conditions for Assumption 5.5). Suppose Assumptions 5.6and 5.7 hold. Then, the following statements hold. (1) There exists Cθ < ∞ be such

that w.p. 1− o(1): ‖pµ(Z)′(θ − θ0)‖PN,2 6 Cθ

√sθ log pθN

and ‖θ − θ0‖1 6 Cθ

√s2θ log pθN

.

There exists Cδ < ∞ be such that w.p. 1 − o(1): ‖ps(Z)′(δ − δ0)‖PN,2 6 Cδ

√sδ log pδN

and ‖δ − δ0‖1 6 Cδ

√s2δ log pδN

.

(2) Define the nuisance realization sets MN and SN as:

MN :=

µ(z) = pµ(z)′θ : θ ∈ Rpθ : ‖pµ(Z)′(θ − θ0)‖PN,2 6 Cθ

√sθ log pθN

,(5.7)

‖θ − θ0‖1 6 Cθ

√s2θ log pθN

SN :=

s(z) = L(ps(z)

′δ) : δ ∈ Rpδ : ‖pδ(Z)′(δ − δ0)‖PN,2 6 Cδ

√sδ log pδN

,(5.8)

‖δ − δ0‖1 6 C ′δ

√s2δ log pδN

Page 29: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

28 SEMENOVA AND CHERNOZHUKOV

Then, w.p. 1−o(1), µ(·) ∈MN and the set MN shrinks at rate mN :=

√sθ log pθN

. Then,

w.p. 1− o(1), s(·) ∈ SN and the set SN shrinks at rate sN :=

√sδ log pδN

.

(3) Suppose ξd(mN ∨ sN ) = o(1) and the product of sparsity indices sθsδ grows suffi-ciently slow:

√N√dmNsN =

√d

√sθsδ log pθ log pδ

N= o(1).

Then, Assumption 5.5 holds.

5.4. Conditional Average Partial Derivative Using the setup of Example 5, let Y o bean outcome of interest, µ(x,w) := E[Y o|X = x,W = w] be a conditional expectation ofY o on X,W , and f(W |X) be the conditional density of W given X. We provide sufficientlow-level conditions on the regression functions f(W |X), µ(X,W ) such that the pointwiseand uniform Gaussian approximations of the target function g(x) (Theorems 4.1 and 4.2)hold.

Definition 5.5 (First Stage Rate). Given a true function f0(·|·), µ0(·, ·), let FN ,MN

be a sequence of shrinking neighborhoods of f0(·|·) and µ0(·, ·) constrained as follows:

fN := supf∈FN

(E(f(W |X)− f0(W |X))2)1/2 ∨ supf∈FN

(E(∂wf(W |X)− ∂wf0(W |X))2)1/2

mN := supµ∈MN

(E(µ(X,W )− µ0(X,W ))2)1/2 ∨ supµ∈MN

(E(∂wµ(X,W )− ∂wµ0(X,W ))2)1/2

where expectation is taken with respect to W,X.

We will refer to fN as the density rate and mN as regression function rate.

Assumption 5.8 (Assumptions on the Conditional Density and the Regression Func-tion). Assume that there exists a sequence of numbers εN = o(1) and a sequenceof neighborhoods FN ,MN of functions f0(·|·), µ0(·, ·) such that the first-stage estimatef(·), µ(·) belongs to the set FN×MN w.p. at least 1− εN . The rates fN ,mN definedin Definition 5.5 decay sufficient fast: ξd(fN ∨mN ) = o(1) and

√N√dfNmN = o(1).

Finally, assume that there exists C > 0 that bounds the functions in FN ,MN uniformlyover their domain:

supµ∈MN

supx,w∈X×W

|µ(x,w)| < C

and

supf∈FN

supx,w∈X×W

maxf(w|x),1

f(w|x) < C.

Corollary 5.4 (Gaussian Approximation for Conditional Average Partial Deriva-tive). Let Assumption 5.8 hold. Then, the orthogonal signal Y , given by Equation 2.4,satisfies Assumption 4.5. Then, the statements of Theorems 4.1 ,4.2, 4.3 and Corollary4.1 hold for Conditional Average Partial Derivative.

Page 30: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

29

6. Proofs

6.1. Useful Technical Lemmas In this section we collect the main technical toolsthat our analysis rely upon, namely Law of Large Numbers for Matrices, and maximalinequalities for U -statistics.

Lemma 6.1 (LLN for Matrices). Let Qi, i = 1, 2, . . . , N be i.n.i.d symmetric non-negative d×d-matrices such that d > e2 and ‖Qi‖ 6 M a.s. Let Q = 1

N

∑Ni=1 Epip′i

denote average value of population covariance matrices. Then, the following statementholds:

E‖Q−Q‖ 6√M(1 + ‖Q‖) logN

N.

In particular, if Qi = pip′i with ‖pi‖ 6 ξd, then

E‖Q−Q‖ 6

√ξ2d(1 + ‖Q‖) logN

N.

Proof. Proof can be found in Rudelson (1999).

Lemma 6.2 (Conditional Convergence Implies Unconditional). Let Xmm>1 andYmm>1 be sequences of random vectors. (i) If for εm → 0,P(‖Xm‖ > εm|Ym) →P 0,then P(‖Xm‖ > εm) → 0. In particular, this occurs if E[‖Xm‖q/εqm|Ym] →P 0 for someq > 1, by Markov inequality. (ii) Let Amm>1 be a sequence of positive constants.If ‖Xm‖ = OP (Am) conditional on Ym, namely, that for any `m → ∞,P(‖Xm‖ >`mAm|Ym) →P 0 , then Xm = OP (Am) unconditionally, namely, that for any `m →∞,P(‖Xm‖ > `mAm)→ 0.

Proof. The Lemma is a restatement of Lemma 6.1 of Chernozhukov et al. (2016a)

Lemma 6.3 (A maximal inequality for canonical U -statistics). Let (Vi)Ni=1 be a sample

of i.i.d random variables in a separable and measurable space (V,V). Let τ : V×V → Rdbe an AV×AV-measurable, symmetric, and canonical kernel such that E|τm(V1, V2)| <∞for all m ∈ 1, 2, . . . , d. Let Vn = 1

n(n−1)

∑16i 6=j6n τ(Vi, Vj),

M = max16i 6=j6n

max16m6d

|τm(Vi, Vj)|,

and Dq = max16m6d(E|τm(V1, V2)|q)1/q, q > 0. If 2 6 d 6 exp(bn) for some constantb > 0, then there exists an absolute constant K > 0 such that

E[|Vn|∞] 6 K(1 + b1/2)

log d

ND2 +

(log d

N

)5/4

D4 +

(log d

N

)3/2

‖M‖4.

Page 31: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

30 SEMENOVA AND CHERNOZHUKOV

The Lemma is a restatement of Theorem 5.1 of Chen (2018).Let F be a class of symmetric measurable functions f : Vr → R. We assume that

there is a symmetric measurable envelope F for F such that P rF 2 < ∞. Consider theassociated U -process

U (r)n (f) =

1

|In,r|∑

(i1,i2,...,ir)∈In,r

f(Vi,1, Vi,2, . . . , Vi,r), f ∈ F ,

where In,r is a collection of permutations. For each k ∈ 1, 2, . . . , r, the Hoeffdingprojection (with respect to P ) is defined by

πk(f)(x1, x2, . . . xk) := (δx1 − P ) . . . (δxk − P )P r−kf,

where Pf :=∫fdP . Let σk be any positive constant such that supf∈F ‖P r−kf‖Pk,2 6

σk 6 ‖P r−kF‖Pk,2 whenever ‖P r−kF‖Pk,2 > 0 and let σk = 0 otherwise. Finally, let Mk

be defined as

Mk = max16i6[n/k]

(P r−kF )(V iki(k−1)+1),

where V iki(k−1)+1 = (Vi(k−1)+1, Vi(k−1)+2, . . . Vik) and Jk(δ) be a uniform entropy integral

as defined in Chen and Kato (2019) (Equation (19), p.22). The following statementshold.

Lemma 6.4 (Local maximal inequalities for U -processes). Suppose that F is point-wise measurable and that Jk(1) < ∞ for k = 1, 2, . . . , r. Let δk = σk/‖P r−kF‖Pk,2 fork = 1, 2, . . . , r. Then

nk/2E[‖U (k)n (πkf)‖F ] . Jk(δk)‖P r−kF‖Pk,2 +

J2k (δk)‖Mk‖P,2

δ2k

√n

.

The Lemma is a restatement of Theorem 5.1 of Chen and Kato (2019).

Lemma 6.5 (Local maximal inequalities for U -processes indexed by VC-type classes).If F is pointwise measurable and VC-type with characteristics A > e2(r−1)/16 ∨ e andv > 1, then

nk/2E[‖U (k)n (πkf)‖F ] . σk

v log(A‖P r−kF‖Pk,2/σk)

k/2+‖Mk‖P,2√

n

v log(A‖P r−kF‖Pk,2/σk)

kThe Lemma is a restatement of Corollary 5.3. of Chen and Kato (2019).

Page 32: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

31

7. Auxiliary Lemmas

Lemma 7.1 (No Effect of First Stage error). Suppose Assumption 4.5 holds. Then,

√N‖ENpi[Yi(η)− Yi(η0)]‖ = OP (BN + ΛN ) = o(1)(7.1)

Proof. Define an event EN := ηk ∈ TN ∀k ∈ [K], such that the nuisance param-eter estimate ηk belongs to the realization set TN for each fold k ∈ [K]. By union bound,this event holds w.h.p.

P(EN ) > 1−KεN = 1− o(1).

ENpi[Yi(η)− Yi(η0)] =1

K

K∑k=1

En,kpi[Yi(η)− Yi(η0)]− E[pi[Yi(η)− Yi(η0)]|(Vi)i∈Jck

]︸ ︷︷ ︸I1,k

+ E[pi[Yi(η)− Yi(η0)]|(Vi)i∈Jck

]− Epi[Yi(η)− Yi(η0)]︸ ︷︷ ︸

I2,k

Conditionally on (Vi)i∈Jk , the estimator η = ηk is non-stochastic. On the event EN

E[‖√nI1,k‖2|EN , (Vi)i∈Jck ] 6 E[‖pi[Yi(η)− Yi(η0)]‖2|EN , (Vi)i∈Jck ]

6 supη∈TN

E‖pi[Yi(η)− Yi(η0)]‖2 = Λ2N(Assumption 4.5)

Hence,√nI1,k = OP (ΛN ) by Lemma 6.2. To bound I2,k, recognize that on the event EN

E[‖√nI2,k‖|EN , (Vi)i∈Jck ] 6 sup

η∈TN

√n‖Epi(Yi(η)− Yi(η0))|(Vi)i∈Jk‖

6 supη∈TN

√n‖Epi(Yi(η)− Yi(η0))‖ 6 BN .

Therefore,√nI2,k = OP (BN ) and

∑Kk=1

1K

√N(I1,k + I2,k) = OP (ΛN +BN ).

In what follows, let pi := p(Xi), ri := rg(Xi), Q = Epip′i, Σ := Epip′i(Ui + ri)2. Let β

be the Orthogonal Estimator of Definition 1.2 and Ui := Yi(η)−p′iβ be the estimated re-gression error. Let vN := (Emax16i6N |Ui|2)1/2 and κN := supη∈TN (Emax16i6N (Yi(η)−Yi(η0))2)1/2. By Assumption 4.8, κN = o(1). Define the estimator of the matrix Σ as

Σ := ENpip>i U2i

and Ω := Q−1ΣQ−1.

Lemma 7.2 (Matrices Estimation). Let R1N , R2N be as defined in Theorem 4.2. As-

sume that (1) R1N+R2N .√

logN and (2) (vN∨1+ldrd)(

√ξ2d logN

N +κN∨1)+κ2N = o(1).

Page 33: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

32 SEMENOVA AND CHERNOZHUKOV

Then, the following bounds hold

‖Σ− Σ‖ .P (vN ∨ 1 + ldrd)(

√ξ2d logN

N+ κN ∨ 1) + κ2

N , and

‖Ω− Ω‖ .P (vN ∨ 1 + ldrd)(

√ξ2d logN

N+ κN ∨ 1) + κ2

N .

Let σ(x) =√p(x)′Ωp(x)/N and σ(x) =

√p(x)′Ωp(x)/N . Uniformly over x ∈ X ,

∣∣∣∣ σ(x)

σ(x)− 1

∣∣∣∣ .P ‖Ω− Ω‖ .P (vn ∨ 1 + ldrd)(

√ξ2d logN

N+ κN ∨ 1) + κ2

N(7.2)

Proof of Lemma 7.2. The difference Σ− Σ can be decomposed as

Σ− Σ = ENpip′i[U2i − Ui + ri2]︸ ︷︷ ︸I1

+ENpip′iUi + ri2 − Σ︸ ︷︷ ︸I2

.

In Belloni et al. (2015), it was shown that I2 .P (vN ∨ 1 + ldrd)

√ξ2d logN

N . Therefore, itsuffices to prove a bound on I1. Recognize that

Ui = Yi(η0)− p′iβ0 + p′iβ0 − p′iβ + Yi(η)− Yi(η0)

= (Ui + ri) + (p′i(β0 − β)) + (Yi(η)− Yi(η0)) = a+ b+ c.

Plugging |(a+ b+ c)2− a2| = |2a(b+ c) + (b+ c)2| 6 2(c2 + b2 + |ac|+ |ab|) into I1 gives:

I1 := ‖ENpip′i[U2i − Ui + ri2]‖ 6 2‖ENpip′i(Yi(η)− Yi(η0))2‖

+ 2‖ENpip′i(Ui + ri)(Yi(η)− Yi(η0))‖

+ 2‖ENpip′i(p′i(β − β0))2‖

+ 2‖ENpip′i(Ui + ri)p′i(β − β0)‖

:= 2(C2 +AC +B2 +AB).

Define an event EN := ηk ∈ TN ∀k ∈ [K], such that the nuisance parameter estimateηk belongs to the realization set TN for each fold k ∈ [K]. By union bound, this eventholds w.h.p.

P(EN ) > 1−KεN = 1− o(1).

For each k ∈ 1, 2, . . . ,K, conditionally on (Vj)j∈Jck and the event EN ,

maxi∈Jk

(Yi(η)− Yi(η0))2 .P E[maxi∈Jk

(Yi(η)− Yi(η0))2|(Vj)j∈Jck , EN ] + oP (1)

. supη∈TN

Emaxi∈Jk

(Yi(η)− Yi(η0))2 = κ2n.

Page 34: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

33

Therefore,max

i∈1,2,...,N(Yi(η)− Yi(η0))2 .P Kκ

2n + oP (1).

C2 .P max16i6N

(Yi(η)− Yi(η0))2‖ENpip′i‖

.P max16i6N

(Yi(η)− Yi(η0))2OP (1) .P κ2n κ2

N

The bound on AC can be seen as

AC .P max16i6N

|Yi(η)− Yi(η0)| max16i6N

|Ui|+ |ri|‖ENpip′i‖

.P max16i6N

|Yi(η)− Yi(η0)| max16i6N

|Ui|+ |ri|OP (1)

.P (κN ∨ 1)(vN ∨ 1 + ldrd),

where we have used max16i6N |Yi(η)− Yi(η0)| 6 1 ∨max16i6N |Yi(η)− Yi(η0)|2 .P κ2N .

The bounds on B2 and AB are established in Belloni et al. (2015) as

B2 +AB .P (vN ∨ 1 + ldrd)

√ξ2d logN

N.

Collecting the bounds on C2 +AC +B2 +BC yields:

I1 6 2(C2 +AC +B2 +AB) .P (vN ∨ 1 + ldrd)(

√ξ2d logN

N+ κN ∨ 1) + κ2

N .(7.3)

Thus, the statement of Lemma follows from Theorem 4.6 of Belloni et al. (2015) and(7.3). Finally, (7.2) follows from Lemma 5.1 in Belloni et al. (2015).

7.1. Proofs of Theorems from Section 4.1

Proof of Theorem 4.1 (a).‖β − β0‖ = ‖Q−1ENpiYi(η)− β0‖

6 ‖Q−1‖‖ENpi[Yi(η)− Yi(η0)]‖︸ ︷︷ ︸S1

+ ‖Q−1‖‖ENpi[Yi(η0)− p′iβ0]‖︸ ︷︷ ︸S2

= S1 + ‖Q−1‖‖ENpi [Yi(η0)− g(xi)]︸ ︷︷ ︸Ui

‖+ ‖Q−1‖‖ENpi [g(xi)− p′iβ0]︸ ︷︷ ︸ri

= S1 + ‖Q−1‖‖ENpiUi‖+ ‖Q−1‖‖ENpiri‖

‖ENpiUi‖ .P (E[‖ENpiUi‖2

])1/2(Markov)

6 (EU2i p′ipi/N)1/2(i.i.d data)

. σ√d/N(7.4)

Page 35: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

34 SEMENOVA AND CHERNOZHUKOV

‖ENpiri‖ .P (E[‖ENpiri‖2

])1/2(Markov)

. ldrd

√E‖pi‖2N

= ldrd

√d

N(Assumption 4.3)

Alternatively,

‖ENpiri‖ .P (E[‖ENpiri‖2

])1/2(Markov)

6 ξd

√Er2

i

N= ξdrd/

√N(Assumption 4.3)

With high probability, ‖Q−1‖ 6 2‖Q−1‖ 6 2/Cmin. Lemma 7.1 implies

‖S1‖ 6 ‖Q−1‖‖ENpi[Yi(η)− Yi(η0)‖ 6 ‖Q−1‖[BN/√N + ΛN/

√N]

= oP (1√N

)

Proof of Theorem 4.1 (b). By Definition 1.2,

β = Q−1ENpiYi(η)

Decomposing

√N [ENpiYi(η)− Qβ0] =

√NENpi[Yi(η)− Yi(η0)] +

√NENpi[Yi(η0)− p′iβ0]

=√NENpi[Yi(η)− Yi(η0)] + GNpi[Yi(η0)− p′iβ0]

=√NENpi[Yi(η)− Yi(η0)] + GNpiUi + GNpiri

we obtain:

√Nα′[β − β0] =

√Nα′Q−1[ENpiYi(η)− Qβ0](7.5)

=√Nα′Q−1[ENpi[Yi(η)− Yi(η0)]](7.6)

+ α′Q−1GN [pi[ri + Ui]]

= α′Q−1GN [pi[ri + Ui]]

+ α>[Q−1 −Q−1]GN [pi(Ui + ri)]︸ ︷︷ ︸I1

+√Nα>Q−1ENpi[Yi(η)− Yi(η0)]︸ ︷︷ ︸

I2

+√Nα>[Q−1 −Q−1]ENpi[Yi(η)− Yi(η0)]︸ ︷︷ ︸

I3

Total remainder term equals:

R1,N (α) = I1 + I2 + I3

Page 36: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

35

Decomposing I1 into sampling and approximation parts:

I1 =√Nα>[Q−1 −Q−1]ENpiUi︸ ︷︷ ︸

I1,a

+α>[Q−1 −Q−1]GNpiri︸ ︷︷ ︸I1,b

Definition of regression error E[Ui|xi] = 0 and E[U2i |xi] . σ2 yields:

E[I1,a|(xi)Ni=1] = 0

E[I21,a|(xi)Ni=1] 6 α>[Q−1 −Q−1]Q[Q−1 −Q−1]ασ2

.Pξ2d logN

Nσ2( Lemma 6.1)

Therefore, I1,a = oP (

√ξ2d logN

N ). Using similar argument,

|I1,b| .P

√ξ2d logN

N[ldrd√d ∧ ξdrd]

|I2| .P ‖Q‖−1‖√NENpi[Yi(η)− Yi(η0)]‖ .P 1/Cmin[ΛN +BN ] = o(1)

|I3| .P ‖α‖‖[Q−1 −Q−1]‖‖√NENpi[Yi(η)− Yi(η0)]‖

.P

√ξ2d logN

N[ΛN +BN ] = o(1)

Therefore, with probability approaching one,

supη∈TN

‖R1,N (α)‖ .P ΛN +BN +

√ξ2d logN

N

(1 + max

ldrd√d, ξdrd,ΛN +BN

)

Proof of Theorem 4.1 (c). Proof of Theorem 4.1 (c) follows from Theorem 4.2 inBelloni et al. (2015).

Proof of Theorem 4.2(a). Similar to the Equation 7.5, define

I1(x) = α(x)>[Q−1 −Q−1]GN [pi(Ui + ri)]

I2(x) =√Nα(x)>Q−1ENpi[Yi(η)− Yi(η0)]

I3(x) =√Nα(x)>[Q−1 −Q−1]ENpi[Yi(η)− Yi(η0)]

Decompose

√Nα(x)>(β − β0) =

√Nα(x)>Q−1GN [pi[ri + Ui]] + I1(x) + I2(x) + I3(x)

Page 37: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

36 SEMENOVA AND CHERNOZHUKOV

Step 1. Bound on I1(x) is shown in Step 1 of Lemma 4.2 of Belloni et al. (2015). Wecopy their proof for completeness. Let us show that

supx∈X|α(x)>[Q−1 −Q−1]GNpiUi| .P N

1/m

√ξ2d log2N

N

Conditional on the data, let T = t1, t2, .., tN ∈ RN : ti = α(x)>(Q−1 −Q−1)piUi, x ∈X . Define the norm ‖ · ‖2N,2 =

∑Ni=1 t

2i on RN . Let (γi)

Ni=1 be independent Rademacher

random variables P(γi = 1) = P(γi = −1) = 12 . Symmetrization inequality implies:

E supx∈X|α(x)>[Q−1 −Q−1]GNpiUi| 6 2Eγ sup

x∈X|α(x)>[Q−1 −Q−1]GN [γipiUi]|

6 C

∫ θ

0(logN(ε, T, ‖‖N,2))1/2dε := J

where N(ε, T, ‖‖N,2) is the covering number of set T and

θ = 2 supt∈T‖t‖N,2 6 2‖Q−1 −Q‖‖Q‖1/2 max

16i6N|Ui|

Since for any x 6= x′:[[α(x)− α(x′)]>[Q−1 −Q−1]piUi

]L2P6 ξLd ‖x− x′‖‖Q−1 −Q−1‖‖Q‖1/2 max

16i6N|Ui|

we have for some C > 0:

N(ε, T, ‖‖N,2) 6 (‖Q−1 −Q−1‖‖Q‖1/2 max16i6N |Ui|

ε)d

∫ θ

0(logN(ε, T, ‖‖N,2))1/2dε 6 ‖Q−1 −Q−1‖‖Q‖1/2 max

16i6N|Ui|

∫ 2

0

√d log1/2(CξLd /ε)dε

By Assumption 4.6, we have

E max16i6N

|Ui| 6 (E( max16i6N

|Ui|)m)1/m 6 (EN |Ui|m)1/m 6 N1/m

By Assumption 4.2, ‖Q−Q‖ .P

√ξ2d logN

N . Finally, from Assumption 4.7,

log ξLd . log d . logN

J .P N1/m

√ξ2d log2N

N

Step 2. Observe that:

supx∈X|α(x)>[Q−1 −Q−1]GN [piri]| .P

√ξ2d logN

Nldrd√d

Page 38: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

37

Steps 1 and 2 give the bound on supx∈X I1(x).Step 3. The bound on I2(x) + I3(x) follows from Lemma 7.1

supx∈X|I2(x)| 6 sup

x∈X‖α(x)‖C−1

min‖ENpi[Yi(η)− Yi(η0)]‖

.P BN + ΛN = oP (1)

supx∈X|I3(x)| 6 sup

x∈X‖α(x)‖‖Q−1 − Q−1‖‖ENpi[Yi(η)− Yi(η0)]‖

.P BN + ΛN = oP (1)

Theorem 4.2(b) follows from Theorem 4.3 in Belloni et al. (2015). Theorem 4.2(c)follows from Lemma 7.2.

Proof of Theorem 4.3. Step 1. Note that βb solves the least squares problem forthe rescaled data

√hiYi(η),

√hipi, i = 1, 2, . . . , N. We wish to apply Theorem 4.2(a)

to the original problem (Yi(η), pi) and the weighted problem (√hiYi(η),

√hipi). Rec-

ognize that (hi)Ni=1 are independent of the data (Vi)

Ni=1, Ehi = 1, Eh2

i = 2. We gothrough the steps of the proof of Theorem 4.2(a). Let Ib1(x), Ib2(x), Ib3(x) be the analogsof I1(x), I2(x), I3(x) in the weighted problem. Belloni et al. (2015) show that

supx∈X|Ib1(x)| .P

√ξ2d log3N

N(1 +

√dldrd).

We wish to prove supx∈X |Ib2(x) + Ib3(x)| .P

√ξ2d logN

N (ΛN + BN ) = oP (1). To provethis, we re-establish Lemma 7.1 for the weighted problem. By Assumption 4.5 (1) andE[hi|(Vi)Ni=1] = E[hi] = 1,

supη∈TN

‖√nE[hipi[Yi(η)− Yi(η0)]

]. BN = o(1)

By Assumption 4.5 (2) and E[h2i |(Vi)Ni=1] = 2,

E[‖Gn,k[hipi[Yi(η)− Yi(η0)]]‖2

∣∣∣∣(Vi)i∈Jck , EN] 6 supη∈TN

E‖pihi[Yi(η)− Yi(η0)]‖2]

6 2 supη∈TN

E‖[pi[Yi(η)− Yi(η0)]‖2 . Λ2N = o(1)

Step 2. Application of Theorem 4.2(a) to the original and the weighted problem yields(4.7). By Theorem 4.5 from Belloni et al. (2015), Equation (4.7) implies the rest ofTheorem 3.3.

Page 39: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

38 SEMENOVA AND CHERNOZHUKOV

Proof of Corollary 4.1. Step 1. For aN =√

logN , we have that

‖Ω− Ω‖ .P (vN ∨ 1 + ldrd)(

√ξ2d logN

N+ κN ∨ 1) + κ2

N = o(1√

logN).

As in the proof of Theorem 4.3 and invoking Theorem 4.2(a), we can find Nd ∼ N(0, Id)so that

|√Nα(x)′(β − β0)− α(x)′Ω1/2Nd| = oP (

1√logN

).

Repeating the steps of the proof of Theorem 5.4 from Belloni et al. (2015) and invokingLemma 7.2 (b) (instead of Lemma 5.1 from Belloni et al. (2015)) gives

t(x) = t∗(x) + oP (a−1N ) in `∞(X ),

where t∗(x) =p(x)′Ω1/2Nd/

√N

σ(x), x ∈ X is the Gaussian coupling to t(x) and aN =

√logN .Step 2. Repeating the steps of the proof of Theorem 5.5 from Belloni et al. (2015)

and invoking Theorem 4.2 (instead of Lemma 5.2), Step 1(instead of Theorem 5.4), andLemma 7.2 (b) (instead of Lemma 5.1) gives

supx∈X|t(x)| = sup

x∈X|t∗(x)|+ oP (

1√logN

).

Step 3. Repeating the steps of the proof of Theorem 5.6 from Belloni et al. (2015) andinvoking Theorem 4.2, (instead of Lemma 5.2), Step 1(instead of Theorem 5.4), Step 2(instead of Theorem 5.5), and Lemma 7.2 (b) (instead of Lemma 5.1) gives the statementof the corollary.

Proof of Corollary 5.1. Let us show that Y (η) given by Equation 2.2 is an or-thogonal signal that satisfies Assumption 4.5.

Bayes rule implies that p(Z = z|X = x) =s0(x|z)p(z)w0(x)

. Plugging both statements in

Page 40: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

39

E(Y (η)− Y (η0)|X = x) gives

E (Y (η)− Y (η0)|X = x) =

∫z∈Z

µ0(x, z)− µ(x, z)

s(x|z)/w(x)

s0(x|z)w0(x)

p(z)dz

+

∫z∈Z

[µ(x, z)− µ0(x, z)]p(z)dz

=w(x)

w0(x)

∫z

(µ0(x, z)− µ(x, z))(s0(x|z)− s(x|z))s(x|z)

p(z)dz

+w(x)− w0(x)

w0(x)

∫z[µ0(x, z)− µ(x, z)]p(z)dz

+

∫z∈Z

[µ0(x, z)− µ(x, z)]p(z)dz

+

∫z∈Z

[µ(x, z)− µ0(x, z)]p(z)dz

=: S1(x) + S2(x) + S3(x) + S4(x),

where S3(x) =∫z∈Z [µ0(x, z)− µ(x, z)]p(z)dz = −S4(x). Recognize that

‖Ep(X)(Y (η)− Y (η0))‖2 =d∑j=1

(Epj(X)E[(Y (η)− Y (η0))|X])2.

For each j ∈ 1, 2, . . . , d, Cauchy-Scwartz inequality implies(E[pj(X)E[(Y (η)− Y (η0))|X]

])2

6 Epj(X)2 (E|S1(X) + S2(X)|)2

and therefore

E‖p(X)(Y (η)− Y (η0))‖2 6 ‖p(X)‖2P,2 (E|S1(X) + S2(X)|)2 6 2d([E|S1(X)|]2 + [E|S2(X)|]2),

where we have used ‖p(X)‖2P,2 = E‖p(X)‖2 . d.Define an event EN := ηk ∈ TN ∀k ∈ [K], such that the nuisance parameter

estimate ηk belongs to the realization set TN for each fold k ∈ [K]. By union bound, thisevent holds w.h.p.

P(EN ) > 1−KεN = 1− o(1).

Conditionally on this event, the individual bias terms can be bounded as follows:

[E|S1(X)|]2 6i supx∈X

(w(x)

w0(x)

)2(EX∫z

(µ0(X, z)− µ(X, z))(s0(X|z)− s(X|z))s(X|z)

p(z)dz

)2

6ii

(C

πmin

)2

C2

(EX,Z

[(µ0(X,Z)− µ(X,Z))(s0(X|Z)− s(X|Z))

w0(X)

s0(X|Z)

])2

.iii

(C2

πmin

)2( πmax

πmin

)2

m2Ns

2N

[E|S2(X)|]2 6

(1

πmin

)2( πmax

πmin

)2

(E(w(X)− w0(X))2)E[(µ(X,Z)− µ0(X,Z))]2

. w2Nm

2N ,

Page 41: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

40 SEMENOVA AND CHERNOZHUKOV

By Assumption 5.2, ‖Ep(X)(Y (η) − Y (η0))‖ =√N√doP (wNmN + sNmN ) = o(1).

Therefore, Assumption 4.5 (1) is satisfied with BN :=√N√d(wNmN + sNmN ). To

prove Assumption 4.5 (2), we decompose Y (η)− Y (η0) as follows

Y (η)− Y (η0) = (µ0(X,Z)− µ(X,Z))

(w(X)

s(X,Z)

)+ (Y o − µ0(X,Z))

(w(X)

s(X,Z)− w0(X)

s0(X|Z)

)+

∫z∈Z

(µ(X,Z)− µ0(X,Z))dP (Z)

= M1 +M2 +M3

EM21 = E(µ0(X,Z)− µ(X,Z))2 sup

x∈X

(w(x)

s(x|z)

)2

. m2N C4

EM22 6 σ2E

[w(X)− w0(X)

s(X,Z)+

w0(X)

s0(X|Z)s(X|Z)(s0(X|Z)− s(X|Z))

]2

. 2σ2C2w2N + 2σ2C2

(πmax

πmin

)2

s2N

EM23 = EX

[ ∫z∈Z

(µ(X,Z)− µ0(X,Z))dP (Z)

]2

6 EX[E[(µ(X,Z)− µ0(X,Z))

w0(X)

s0(X|Z)|X]

]2

.

(πmax

πmin

)2

m2N .

Therefore, Assumption 4.5 hold with BN :=√N√d(wNmN + sNmN ) and ΛN =

ξd(mN ∨wN ∨ sN ).

Proof of Theorem 5.1. Define an event EN := ηk ∈ TN ∀k ∈ [K], such thatthe nuisance parameter estimate ηk belongs to the realization set TN for each fold k ∈ [K].By union bound, this event holds w.h.p.

P(EN ) > 1−KεN = 1− o(1).

Fact I. For each fixed value z and v = Q−1α the following inequality holds:

|v′τ1(z;µ)| = |EX [(v′p(X)) · (µ0(X; z)))]|6 v′Qv(EX [µ0(X; z)]2)1/2 = α′Q−1α(EX [µ0(X; z)]2)1/2

6 2C−1min(EX [µ(X; z)]2 + EX [

∫µ(X;Z)dP (Z)]2)1/2

6 2√

2C−1minC,

Page 42: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

41

where the last inequality holds by Assumption 5.2.Fact II. The function τ1(z;µ) is linear in µ. In particular, we will use

τ1(z;µ− µ0) = τ1(z;µ)− τ1(z;µ0).

Fact III. For any µ ∈MN , the following inequality holds:

EZ |v′τ1(Z;µ− µ0)|2 6 4C−2min

[ ∫z∈Z

[

∫x∈X

[µ(x; z)− µ0(x; z)]2w0(x)dx]p(z)dz

+

∫x∈X

[

∫z∈Z

(µ(x; z)− µ0(x; z))p(z)dz]2w0(x)dx)]

6 8C−2minEX,Z [µ(X;Z)− µ0(X;Z)]2

w0(x)p(z)

p(x, z)

= 8C−2minEX,Z [µ(X;Z)− µ0(X;Z)]2

w0(x)

s0(x|z). 8C−2

minm2N .

Proof of Theorem 5.1[a].

Step 0. We decompose the remainder term into R†1,k and R†2,k:

R†1,N (α) :=√Nα′(β† − β − 1

N

N∑i=1

τ1(Zi;µ0))

=√Nα′Q−1 1

K

K∑k=1

[1

n(n− 1)

∑i,j∈Jk,i 6=j

τ(Vi, Vj ; µ)− 1

n

∑i∈Jk

τ1(Zi; µ)

]

+√Nα′Q−1 1

K

K∑k=1

[1

n

∑i∈Jk

τ1(Zi; µ)− τ1(Zi;µ0)

]

=:√Nα′Q−1 1

K

K∑k=1

[R†1,k +R†2,k]

=1

K

K∑k=1

[R1,k +R2,k +R′1,k +R′2,k],

where the latter terms are defined as:

R1,k :=√Nα′Q−1R†1,k, R2,k :=

√Nα′Q−1R†2,k

R′1,k :=√Nα′(Q−1 −Q−1)R†1,k, R′2,k :=

√Nα′(Q−1 −Q−1)R†2,k

Step 1. Term R1,k. Conditionally on the data (Vj)j∈Jck , R†1,k is a degenerate U -statistic

of order 2 and µ is treated as fixed. To bound R†1,k, we invoke Theorem 5.1 of Chen andKato (2019) (see Lemma 6.4) with the function class

Fα = α′Q−1τ(·, ·; µ)

Page 43: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

42 SEMENOVA AND CHERNOZHUKOV

consisting of a single element which we assume to be measurable. The entropy integralof Fα J2(δ) = δ. Its envelope Fα(·, ·) : (V, V ′) → C−1

min‖τ(V, V ′; µ)‖ is bounded in LP,2norm:

E[‖τ(V, V ′; µ)‖2|EN , (Vj)j∈Jck

]6 sup

µ∈MN

E‖τ(V, V ′;µ)‖2

6 C2 1

2(‖p(X)‖2P,2 + ‖p(X ′)‖2P,2) 6 dC2.

Furthermore, for each k ∈ 1, 2, . . . ,K

max16i6[n/2],i∈Jk

‖τ(V2(i−1)+1, V2i; µ)‖ 6 supx∈X‖p(x)‖C = ξdC.

Applying Theorem 5.1 of Chen and Kato (2019) (see Lemma 6.4) to α′Q−1R†1,k with (in

their notation) r = k = 2 and (in their notation) σ2 = ‖F‖P,2 . dC gives

nE[|α′Q−1R†1,k|

∣∣∣∣EN , (Vj)j∈Jck] . C√d+ Cn−1/2ξd.

By Markov inequality, R1 :=√Nα′Q−1R†1,k .P N−1/2C

√d + ξdN

−1C = oP (1) condi-

tionally on (Vj)j∈Jck . By Lemma 6.2, R1,k =√Nα′Q−1R†1,k = oP (1) unconditionnally.

Step 2. Term R′1,k. Recognize that R′1,k .P

√ξ2d logN

N ‖R†1,k‖.

‖R†1,k‖ 6√d‖R†1,k‖∞.(7.7)

We invoke Theorem 5.1 of Chen (2018) (restated in Lemma 6.3) for the canonical two-

sample U -statistic R†1,k conditionally on the event EN and the sample (Vj)j∈Jck . On thisevent, τ(Vi, Vj ; µ) is bounded in absolute norm:

‖R†1,k‖∞ 6 2 max16j6d

|pj(X)|C := 2ζdC.

Therefore, (in the notation of Theorem 5.1 of Chen (2018)), the constants D2, D4,M4

are bounded by ζdC. Thus,

E[‖R†1,k‖∞] . ζdC((

log d

N

)3/2

+log d

N+

(log d

N

)5/4)

and ‖R†1,k‖∞ = CζdOP((

log d

N

)3/2

+log d

N+

(log d

N

)5/4)= oP (1). Thus,

R′1,k .P

√Nζd

√dξ2d logN

NOP

((log d

N

)3/2

+log d

N+

(log d

N

)5/4).

Page 44: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

43

Step 3. Term R2,k. Conditionally on the data (Vj)j∈Jck , the remainder term R†2,k is asample average of mean zero i.i.d r.v.:

R†2,k := En,k[τ1(Zi; µ)− τ1(Zi;µ0)] =i En,kτ1(Zi; µ− µ0),

where i follows from Fact II. Thus,

E[(√nα′Q−1R†2,k)

2|EN , (Vj)j∈Jck ] 6 α′Q−1E[τ1(Zi; µ− µ0)][τ1(Zi; µ− µ0)]>Q−1α

6ii 8C−2minm

2N

ii follows from Fact III. Thus, the conditional variance is bounded as

E[(√nα′Q−1R†2,k)

2|EN , (Vj)j∈Jck ] = O(m2N ) = o(1).

By Markov inequality and Lemma 6.2, R2 =√Nα′Q−1R†2,k = OP (mN ).

Step 4. Term R′2,k. Conditionally on the data (Vj)j∈Jck , R†1,k is a degenerate U -statisticof order 2 and µ is treated as fixed. Finally, the bound on the norm follows as

E[‖√nR†2,k‖

2|EN , (Vj)j∈Jck ] 6 E[‖τ1(Zi; µ)− τ1(Zi;µ0)‖2|EN , (Vj)j∈Jck ]

6 supµ∈MN

E‖τ1(Zi;µ− µ0)‖2

6d∑j=1

Ep2j (X) sup

η∈TNE(µ0(X,Z)− µ0

0(X,Z))2 w0(X)

s0(X|Z). dm2

N .

Therefore, ‖√nR†2,k‖ = OP (

√dmN ) and R′2,k .P

√dmN

√ξ2d logN

N .Step 5. Collecting the terms gives the bound

R†1,N (α) .P (N−1/2√d+ ξdN

−1) +√Nζd

√dξ2d logN

N

log d

N+ mN

(1 +

√dξ2d logN

N

)Proof of Theorem 5.1[b]. Consider a triangular array of i.i.d mean zero r.v. χNi, i =

1, 2, . . . , N .

χNi =α′Q−1[pi(Ui + ri) + τ1(Zi;µ0)]√

N‖α′(Ω†)1/2‖.

We verify the conditions of Lindeberg’s CLT for χNi. Assumption 4.3 and Fact I imply

var(

N∑i=1

χNi) = 1 and |χNi| 6 (ξd|Ui|+ ξdldrd + C√d)/√N

Fix δ > 0 and c > 0. The Lindeberg’s condition takes the form

NEχ2Ni1|χNi|>cδ 6

2

α′Ω†α

(E(α′Q−1pi(Ui + ri))

21|χNi|>δ + E(α′Q−1τ1(Zi;µ0))21|χNi|>δ

).

2α′Q−1α

α′Ω†α

(E(sup

x∈XE[U2

i |Xi = x] + l2dr2d + dC2)1|Ui|>(δ

√N−C

√d)/ξd−ldrd

)= o(1),

Page 45: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

44 SEMENOVA AND CHERNOZHUKOV

where we have used (δ√N −

√dC)/ξd − ldrd →∞ and ldrd = o((δ

√N −

√dC)/ξd).

Proof of Theorem 5.1[c]. Step 1. Conditionally on the data (Vj)j∈Jck , R†1,k is a degen-erate U -statistic of order 2 and µ is treated as fixed. Consider a function class

F = α(x)′Q−1τ(·, ·; µ), x ∈ X

whose envelope F := C−1min‖τ(·, ·; µ)‖ is square integrable by Proof of Theorem 5.1[b](a).

We determine the bracket size. Recognize that

|[α(x)− α(x′)]>Q−1τ(V, V ′; µ)| 6 ξLd ‖x− x′‖‖Q−1‖‖τ(V, V ′; µ)‖

and therefore

supQN(F , L2(Q), ε‖F‖L2

Q) 6 (

ξLd /Cmin

ε)r

Plugging in σ = ‖F‖L2P6√dC, A = ξLd /Cmin, V = r into Corollary 5.3 of Chen and

Kato (2019) (see Lemma 6.5), we obtain:

supx∈X|√Nα(x)>Q−1R†1,k| .P N

−1/2√dC log ξLd +N−1ξd log2 ξLd C(7.8)

.P N−1/2√dC logN +N−1ξd log2N C(7.9)

Step 2. We establish the following bound

supx∈X|√Nα(x)>Q−1R†2,k| .P ‖Q−1‖‖R†2,k‖ = OP (

√dmN )

supx∈X|√Nα(x)>(Q−1 − Q−1)R†2,k| .P ‖Q−Q‖‖R†2,k‖ = OP

(√ξ2d logN

N

√dmN

)

Proof of Corollary 5.2. Let us show that Y (η) given by Equation 2.3 is an or-thogonal signal that satisfies Assumption 4.5.

Y (η)− Y (η0) = [µ(1, Z)− µ0(1, Z)][1− D

s0(Z)]︸ ︷︷ ︸

S1

− [µ(0, Z)− µ0(0, Z)][1− 1−D1− s0(Z)

]︸ ︷︷ ︸S′1

+ [s0(Z)− s(Z)][[Y 1 − µ0(1, Z)]

D

s(Z)s0(Z)︸ ︷︷ ︸S2

− [Y 0 − µ0(0, Z)]1−D

(1− s(Z))(1− s0(Z))

]︸ ︷︷ ︸

S′2

+ [s0(Z)− s(Z)][[µ(1, Z)− µ0(1, Z)]

D

s(Z)s0(Z)︸ ︷︷ ︸S3

− [µ(1, Z)− µ0(0, Z)]1−D

(1− s(Z))(1− s0(Z))

]︸ ︷︷ ︸

S′3

.

Page 46: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

45

Let us see that Assumption 4.5 (1) is satisfied with BN :=√dmNsN . The terms Si, S

′i, i ∈

1, 2 are mean zero conditionally on Z. Assumption 2.1 implies that any technicalregressor p(X) is uncorrelated with S1 + S′1 + S2 + S′2.

E[[S1 + S′1 + S2 + S′2]|Z

]= 0 ⇒ Ep(X)

[[S1 + S′1 + S2 + S′2]

]= 0

‖Ep(X)[S3 + S′3]‖2 =d∑j=1

(Epj(X)[S3 + S′3])2 6d∑j=1

Ep2j (X)[E|S3|+ E|S′3|]2

62

π20

C2d(mN )2(sN )2

Let us see that Assumption 4.5 (2) is satisfied with ΛN = ξd(min(m2N , s

2N )∨mN ∨sN ).

ES21 = E[µ(1, Z)− µ0(1, Z)]2E

[[1− D

s0(Z)]2|Z = z

].

m2N

π20

E(S′1)2 = E[µ(0, Z)− µ0(0, Z)]2E[[1− 1−D

1− s0(Z)]2|Z = z

].

m2N

π20

E[S22 + (S′2)2] = E[s(Z)− s0(Z)]2E

[[[Y 1 − µ0(1, Z)]2|Z = z

] 1

s0(Z)s2(Z)

+ [Y 0 − µ0(0, Z)]2|Z = z] 1

(1− s0(Z))(1− s(Z))2

]. s2

N

C2

π0

E[S23 + (S′3)2] .min(m2

N , s2N )C2

π0

Therefore, Assumption 4.5 (2) is satisfied with ΛN = ξd(min(m2N , s

2N ) ∨mN ∨ sN ).

Proof of Corollary 5.3. Let us show that Y (η) given by Equation 2.4 is an or-thogonal signal that satisfies Assumption 4.5.

Y (η)− Y (η0) = [µ(Z)− µ0(Z)][1− D

s0(Z)]︸ ︷︷ ︸

S1

+ [s0(Z)− s(Z)][Y − µ0(Z)]D

s(Z)s0(Z)︸ ︷︷ ︸S2

+ [s0(Z)− s(Z)][µ(Z)− µ0(Z)]D

s(Z)s0(Z)︸ ︷︷ ︸S3

Let us see that Assumption 4.5 (1) is satisfied with BN :=√dmNsN . The terms S1, S2

are mean zero conditionally on Z. Assumptions 2.2 implies that any technical regressorp(X) is uncorrelated with S1 + S2.

‖Ep(X)S3‖2 6d∑j=1

Ep2j (x)[E|S3|]2 . dm2

Ns2N

Page 47: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

46 SEMENOVA AND CHERNOZHUKOV

ES21 = E[µ(Z)− µ0(Z)]2E

[[1− D

s0(Z)]2|Z = z

]. m2

N

ES22 = E[s(Z)− s0(Z)]2E

[[Y − µ0(Z)]2|Z = z

] 1

s0(Z)s2(Z). s2

N

C2

π0

ES23 = E[s0(Z)− s(Z)]2[µ(Z)− µ0(Z)]2

1

s2(Z)s0(Z). min(m2

N , s2N )C2

π0

Therefore, Assumption 4.5 (2) is satisfied with ΛN = ξd(E(Y (η)−Y (η0))2)1/2 . ξd[mN ∨sN ].

Proof of Corollary 5.4. Let us show that Y (η) given by Equation 2.6 is an or-thogonal signal that satisfies Assumption 4.5.

Y (η)− Y (η0) = −∂w log f0(W |X)[µ(X,W )− µ0(X,W )] + ∂w[µ(X,W )− µ0(X,W )]︸ ︷︷ ︸S1

+ [∂w log f0(W |X)− ∂w log f(W |X)][Y − µ0(X,W )]︸ ︷︷ ︸S2

+ [∂w log f0(W |X)− ∂w log f(W |X)][µ(X,W )− µ0(X,W )]︸ ︷︷ ︸S3

Let us see that Assumption 4.5 (1) is satisfied with BN :=√dmN fN . The terms S1, S2 are

mean zero conditionally on Z. Since X ⊂ Z, any technical regressor p(X) is uncorrelatedwith S1 + S2

‖Ep(X)S3‖2 = ‖Ep(X)S3‖2 6d∑j=1

Ep2j (x)[E|S3|]2 . df2Nm

2N

ES21 6 E(−∂w log f0(W |X)[µ(X,W )− µ0(X,W )])2 + E(∂w[µ(X,W )− µ0(X,W )])2 . m2

N

ES22 6 (E

[[Y − µ0(X,W )]2|X,W

])f2N . f2N

ES23 6 (E[µ(X,W )− µ0(X,W )]2)f2N . min[f2N ∨m2

N ]

Therefore, Assumption 4.5 (2) is satisfied with ΛN = ξd(E(Y (η)−Y (η0))2)1/2 . ξd[mN ∨fN ].

7.2. Proof of Lemma 5.1

Proof. Step 1. Define Y = DY , pµ(Z) = Dpµ(Z), rµ(Z) = Drµ(Z), and ε = D[Y −µ0(Z)]. Here we verify the conditions of Theorem 6.1 from Belloni et al. (2013) for themodel Y , pµ(Z), rµ(Z), ε. Let us show that the original coefficient θ0, defined in Equation(5.5), satisfies

Y = pµ(Z)′θ0 + rµ(Z) + ε, E[ε|D] = 0

Page 48: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

47

Indeed,

E[Y |D,Z] = E[DY |D,Z] = DE[Y |Z,D] = D[µ0(Z)] = Dpµ(Z)′θ0 +Dr(Z)

= pµ(Z)′θ0 + rµ(Z).

E[ε|D] = E[E[[Y − pµ(Z)′θ0 − rµ(Z)]|D,Z

]|D] = 0

Recognize that Assumption 5.6 implies that an analog of Assumption 5.6 holds forpµ(Z), Y . Assumption 5.6 (a) directly assumes bounded Restricted Sparse Eigenaval-ues for Observed Regressors pµ(Z) = pµ(Z)D. Assumption 5.6 (b) is satisfied withEpµ,j(Z)2 = EDp2

µ,j(Z) > sEp2µ,j(Z) =: c′ where c′ := cs is the new lower bound on

the moments of Epµ,j(Z) for observed regressors. Assumption 5.6 (c) is satified with theErµ(Z)2 6 Erµ(Z)2 6 Csθ log(pθ ∨N)/N . By Theorem 6.1 of Belloni et al. (2013), w.p.

1 − o(1), pµ(·)′θ ∈ MN . Under Assumption 5.7, Theorem 6.2 of Belloni et al. (2013)

implies that w.p. 1− o(1), L(ps(·)′δ) ∈ SN .Step 2. Let C ′θ = (C2

θ +√

2Bθ)1/2. For any θ ∈ Rpθ such that pµ(·)′θ ∈ MN the

following inequality holds:

EZ(pµ(Z)′(θ − θ0))2 6 ‖θ − θ0‖2PN,2+∣∣(θ − θ0)>[ENpµ(Zi)p

>µ (Zi)− Epµ(Zi)p

>µ (Zi)](θ − θ0)

∣∣6 ‖θ − θ0‖2PN,2 + ‖θ − θ0‖21 max

16i,j6pθ

∣∣∣∣ENpi,µ(Zi)p>j,µ(Zi)− Epi,µ(Zi)p

>j,µ(Zi)

∣∣∣∣.iP C

2θm

2N +Bθ

√s2θ log pθN

2 log pθN

.P (C ′θ)2m2

N ,

where i follows by McDiarmid maximal inequality for bounded random variables. There-fore, the nuisance realization set MN shrinks at rate mN .

Step 3. Observe that (EZ(L(ps(Z)′δ)−L(ps(Z)′δ0))2)1/2 6 supt∈R L′(t)(EZ(ps(Z)′(δ−δ0))2)1/2 6 1

4(EZ(ps(Z)′(δ − δ0))2)1/2. By the argument similar to Step 2, the nuisancerealization set SN shrinks at rate sN .

References

Abrevaya, J., Hsu, Y.-C., and Lieli, R. (2015). Estimating conditional average treatment effects. Journalof Business and Economic Statistics, pages 485–505.

Athey, S. and Imbens, G. (2015). Recursive partitioning for heterogeneous causal effects.https://arxiv.org/abs/1504.01132.

Belloni, A., Chernozhukov, V., Chetverikov, D., and Fernandez-Val, I. (2011). Conditional quantileprocesses based on series or many regressors.

Belloni, A., Chernozhukov, V., Chetverikov, D., and Kato, K. (2015). Some new asymptotic theory forleast squares series: Pointwise and uniform results. Journal of Econometrics, 186(2):345–366.

Belloni, A., Chernozhukov, V., Fernandez-Val, I., and Hansen, C. (2013). Program evaluation and causalinference with high-dimensional data. arXiv preprint arXiv:1311.2645.

Belloni, A., Chernozhukov, V., and Hansen, C. (2014). Inference on treatment effects after selectionamongst high-dimensional controls. Journal of Economic Perspectives, 28(2).

Belloni, A., Chernozhukov, V., and Wei, Y. (2016). Post-selection inference for generalized linear modelswith many controls. Journal of Business & Economic Statistics, 34(4):606–619.

Page 49: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

48 SEMENOVA AND CHERNOZHUKOV

Blundell, R., Horowitz, J., and Parey, M. (2012). Measuring the price responsiveness of gasoline demand:Economic shape restrictions and nonparametric demand estimation. Quantitative Economics, (3):29–51.

Buhlmann, P. and van der Geer, S. (2011). Statistics for high-dimensional data. Springer Series inStatistics.

Chen, X. (2018). Gaussian and bootstrap approximations for high-dimensional u-statistics and theirapplications. https://projecteuclid.org/euclid.aos/1522742432, 46(2):642–678.

Chen, X. and Christensen, T. (2015). Optimal uniform convergence rates and asymptotic normality forseries estimators under weak dependence and weak conditions. Journal of Econometrics, 188:447–465.

Chen, X. and Kato, K. (2019). Jackknife multiplier bootstrap: Finite sample approximations to theu-process supremum with applications. doi:10.1007/s00440-019-00936-y.

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., et al. (2016a). Double machinelearning for treatment and causal parameters. arXiv preprint arXiv:1608.00060.

Chernozhukov, V., Escanciano, J. C., Ichimura, H., and Newey, W. (2016b). Locally robust semipara-metric estimation.

Colangelo, K. and Lee, Y.-Y. (2019). Double debiased machine learning nonparametric inference withcontinuous treatments. https://www.ifs.org.uk/uploads/CW5419-Double-debiased-machine-learning-nonparametric-inference-with-continuous-treatments.pdf.

Fan, Q., Hsu, Y.-C., Lieli, R., and Zhang, Y. (2019). Estimating conditional average treatment effectswith high-dimensional data. https://arxiv.org/pdf/1908.02399.pdf.

Farrell, M., Liang, T., and Misra, S. (2018). Deep neural networks for estimation and inference: Appli-cation to causal effects and other semiparametric estimands. https://arxiv.org/abs/1809.09953.

Gill, R. and Robins, J. (2001). Causal inference for complex longitudinal data: The continuous case.https://projecteuclid.org/euclid.aos/1015345962, 29(6):1785–1811.

Graham, B. (2011). Efficiency bounds for missing data models with semiparametric restrictions. Econo-metrica, 79(2):437–452.

Graham, B., Pinto, C., and Egel, D. (2012). Inverse probability tilting for moment condition modelswith missing data. Review of Economic Studies, 79(3):1053 – 1079.

Grimmer, J., Messing, S., and Westwood, S. (2017). Estimating heterogeneous treatment effects andthe effects of heterogeneous treatments with ensemble methods. https://web.stanford.edu/ jgrim-mer/het.pdf.

Hahn, J. (1998). On the role of the propensity score in efficient semiparametric estimation of averagetreatment effects. Econometrica, 66(2):315–331.

Hausman, J. and Newey, W. (1995). Nonparametric estimation of exact consumers surplus and dead-weight loss. Econometrica, (63):1445–1476.

Hirano, K., Imbens, G., and Reeder, G. (2003). Efficient estimation of average treatment effects underthe estimated propensity score. Econometrica, 71(4):1161–1189.

Horwitz, D. and Thompson, D. (1952). A generalization of sampling without replacement from a finiteuniverse. Journal of American Statistical Association, 47(260).

Jacob, D., Hardle, W., and Lessmann, S. (2019). Group average treatment effects for observationalstudies. https://arxiv.org/abs/1911.02688.

Kennedy, E., Ma, Z., McHugh, M., and Small, D. (2017). Nonparametric methods for doubly robust esti-mation of continuous treatment effects. Journal of the Royal Statistical Society: Series B, 79(4):1229–1245.

Luo, Y. and Spindler, M. (2016). High-dimensional l2 boosting: Rate of convergence. arXiv:1602.08927.Newey, W. (2007). Convergence rates and asymptotic normality for series estimators. Journal of Econo-

metrics, 79(1):147–168.Newey, W. (2009). Two-step series estimation of sample selection models. Econometrics Journal, 12:217–

229.Newey, W. and Stoker, T. (1993). Efficiency of weighted average derivative estimators and index models.

Econometrica, 61(5):1199–1223.Neyman, J. (1959). Optimal asymptotic tests of composite statistical hypotheses. Probability and Statis-

tics, 213(57).Oprescu, M., Syrgkanis, V., and Wu, Z. (2019). Orthogonal random forest for causal inference.

Page 50: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

49

Portnoy, S. and Koenker, R. (1989). Adaptive l-estimation for linear models. The Annals of Statistics.Robins, J. and Rotnitzky, A. (1995). Semiparametric efficiency in multivariate regression models with

missing data. Journal of American Statistical Association, 90(429):122–129.Rosenbaum, P. and Rubin, D. (1983). The central role of the propensity score in observational studies

for causal effects. Biometrika, 70(1):41–55.Rudelson, M. (1999). Random vectors in the isotropic position. Journal of Functional Analysis, 164(1):60–

72.Schmalensee, R. and Stoker, T. (1999). Household gasoline demand in the united states. Econometrica,

(67):645–662.Schmidt-Hieber, J. (2019). Nonparametric regression using deep neural networks with relu activation

function. https://arxiv.org/pdf/1908.00695.pdf.Wager, S. and Walther, G. (2015). Adaptive concentration of regression trees, with application to random

forests. https://arxiv.org/abs/1503.06388.Yatchew, A. and No, J. A. (2001). Household gasoline demand in canada. Econometrica, pages 1697–1709.Zimmert, M. and Lechner, M. (2019). Nonparametric estimation of causal heterogeneity under high-

dimensional confounding. https://arxiv.org/abs/1908.08779.

Page 51: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

50 SEMENOVA AND CHERNOZHUKOV

8. Simulation Evidence In this section we examine the finite sample performanceof the Orthogonal Estimator through the Monte Carlo experiments in context of Example6. We compare the Orthogonal Estimator to other more naive strategies, such as InverseProbability Weighting (IPW) and Ordinary Least Squares (OLS) on the complete dataonly. We show that under the small misspecification of a linear model, all three estimatorshave similar performance, while under larger misspecification, only Orthogonal Estimatorremains valid.

Let us describe our simulation design. Using the setup of Example 6, we generate arandom sample (Di, Xi, Zi, Y

∗i )N=500

i=1 from the following data generating process. Thecontrol vector Z, dim(Z) = 500, Z ∼ N(0, T (ρ)), ρ = 0.5 is generated from a normaldistribution N(0, T (ρ)), where the covariance matrix T (ρ) is the Toeplitz covariance ma-trix with the correlation parameter ρ = 0.5. The propensity score s0(z) = L(z′δ), where

L(t) :=exp t

exp t+ 1is the logistic function, and the parameter δ = (1, 1

2 , . . . ,1

100 , 0, . . . , 0).

The regression function µ(z) = z′γ is a linear function of the control vector, where the

parameter γ = [(1, 122, . . . , 1

(d−1)2)′, c( 1

d2, . . . , 1

300

2)′, 0, . . . , 0] and c is a design constant.

The outcome variable Y and the presence indicator D are generated by

D ∼ B(L(Z ′δ)),

Y ∗ = Z ′θ + ε, ε ∼ N(0, 1),(8.1)

whereB(p) stands for a Bernoulli draw with probability of success p. Suppose a researcheris interested in the conditional expectation of Y given the first d = 6 control variablesX = [Z1, Z2, ..., Zd]:

g(x) := E[Y ∗|X = x].

He approximates g(x) using a linear form p(x)′β, where the vector of technical transfor-mations

p(x) := (1, x)′

consists of the constant and a degree one polynomial of vector x. Let Y o = DY ∗ be theobserved outcome. Having established the setup, let us describe the estimators whosefinite sample performance we compare:

• Ordinary Least Squares: βOLS := (ENDip(Xi)p(Xi)′)−1 ENDip(Xi)Y

oi

• Inverse Probability Weighting: βIPW :=

(EN

Di

s(Zi)p(Xi)p(Xi)

′)−1

ENDi

s(Zi)p(Xi)Y

oi

• Orthogonal Estimator: βOE := (ENp(Xi)p(Xi)′)−1 ENp(Xi)

[ Di

s(Zi)[Y oi − µ(Zi)] +

µ(Zi)]

where the nonparametric estimates of the propensity score s and the regression functionµ are estimated as in Example 6 using the cross-fitting procedure in Definition 1.1.

Table 1 shows the bias, standard deviation, root mean squared error, and rejectionfrequency for Ordinary Least Squares, Inverse Probability Weighting, and OrthogonalEstimator under small misspecification, which is achieved by scaling the coefficient on theomitted controls by a small constant (c = 0.1). In that case, all the three estimators have

Page 52: Simultaneous Inference for Best Linear Predictor of the … · 2018-06-20 · Simultaneous Inference for Best Linear Predictor of the Conditional Average Treatment E ect and Other

51

small bias and good coverage property. Since the linear model is close to the true one,OLS is best linear conditionally unbiased estimator, and therefore has smaller variancethan IPW and OE.

Table 2 shows finite sample properties of IPW, OE, and OLS under large misspeci-fication, which is achieved by scaling the coefficient on the omitted controls by a smallconstant (c = 20). As expected, OLS suffers from selection bias, IPW incurs the first-order bias due to the propensity score estimation error, but OE remains valid. In thecase of large misspecification, OE achieves 8% to 100% bias reduction compared to IPW.Moreover, OE maintains valid inference and has its rejection frequency close to the nom-inal under both small and large misspecification.

OLS IPW OE OLS IPW OE OLS IPW OE OLS IPW OEBias St.Error RMSE Rej.Freq.

β1 = 1 1 0.001 -0.002 -0.014 0.049 0.060 0.061 0.049 0.060 0.063 0.080 0.080 0.090β2 = 0.5 0.005 0.003 0.004 0.101 0.126 0.124 0.102 0.126 0.125 0.067 0.090 0.100β3 = 0.3 -0.010 -0.004 -0.009 0.091 0.120 0.119 0.092 0.120 0.119 0.030 0.037 0.050β4 = 0.2 -0.000 -0.002 -0.002 0.096 0.118 0.116 0.096 0.118 0.116 0.037 0.060 0.050β5 = 0.2 0.002 0.005 0.004 0.100 0.116 0.118 0.100 0.116 0.118 0.070 0.060 0.057

Table 1Bias, St.Error, RMSE, Rejection Frequency of OLS, IPW, OE. Small misspecification. Test size

α = 0.05, design constant c = 0.1, R2 = 0.5, Number of Monte Carlo Repetitions 300.

OLS IPW OE OLS IPW OE OLS IPW OE OLS IPW OEBias St.Error RMSE Rej.Freq

β1 = 1 1 0.109 0.113 0.023 0.863 1.156 0.655 0.869 1.161 0.656 0.097 0.107 0.110β2 = 0.5 -0.233 -0.456 -0.092 1.806 2.052 1.346 1.821 2.102 1.349 0.083 0.107 0.120β3 = 0.3 0.025 0.127 -0.056 1.916 2.200 1.339 1.916 2.203 1.340 0.120 0.100 0.103β4 = 0.2 -0.060 -0.007 -0.004 1.886 2.284 1.353 1.887 2.284 1.353 0.093 0.143 0.113β5 = 0.2 0.065 0.001 -0.019 1.865 2.274 1.303 1.866 2.274 1.303 0.113 0.117 0.087

Table 2Bias, St.Error, RMSE, Rejection Frequency of OLS, IPW, OE. Large misspecification. Test size

α = 0.1, design constant c = 20, R2 = 0.2, Number of Monte Carlo Repetitions 300.