non-standard rates of convergence of criterion-function ... · pinkse, jean-françois richard,...
TRANSCRIPT
Non-Standard Rates of Convergence of
Criterion-Function-Based Set Estimators∗
JASON R. BLEVINS
The Ohio State University, Department of Economics
Working Paper 13-02
December 19, 2013†
Abstract. This paper establishes conditions for consistency and potentially non-standard
rates of convergence for set estimators based on contour sets of criterion functions. These
conditions cover the standard parametric rate n−1/2, non-standard polynomial rates such as
n−1/3, and an extreme case of arbitrarily fast convergence. We also establish the validity of a
subsampling procedure for constructing confidence sets for the identified set. We then provide
more convenient sufficient conditions on the underlying empirical processes for cube root
convergence. We show that these conditions apply to a class of transformation models under
weak semiparametric assumptions which may be partially identified due to potentially limited-
support regressors. We focus in particular on a semiparametric binary response model under a
conditional median restriction and show that a set estimator analogous to the maximum score
estimator is essentially cube-root consistent for the identified set when a continuous but possi-
bly bounded regressor is present. Arbitrarily fast convergence occurs when all regressors are
discrete. Finally, we carry out a series of Monte Carlo experiments which verify our theoretical
findings and shed light on the finite sample performance of the proposed procedures.
Keywords: partial identification, cube-root asymptotics, semiparametric models, limited sup-
port regressors, transformation model, binary response model, maximum score estimator.
JEL Classification: C13, C14, C25.
∗This paper is based in part on Chapter 2 of my Duke University dissertation. I am grateful to the members of
my committee, Han Hong, Shakeeb Khan, Paul Ellickson, and Andrew Sweeting, for their guidance and support.
I also thank Arie Beresteanu, Federico Bugni, Stephen Cosslett, Yanqin Fan, Joseph Hotz, Robert de Jong, Joris
Pinkse, Jean-François Richard, Xiaoxia Shi, Elie Tamer, and Jörg Stoye for helpful discussions and comments, and
seminar participants at Duke University, Ohio State University, the University of Illinois at Urbana-Champaign, the
University of Pittsburgh, the University of Texas at Austin, Vanderbilt University, the 2009 Triangle Econometrics
Conference and the 2010 Annual Meeting of the Midwest Econometrics Group for useful comments.†This paper is a revision of a working paper whose first draft appeared on October 31, 2011.
1
1. Introduction
Relaxing distributional assumptions in econometric models is often desirable but can lead to a
failure of point identification. Fortunately, a recent and growing literature on partially identified
models has shown that in many cases we can still carry out inference about the parameters of
interest even under assumptions weaker than those known to provide point identification.1 In
particular, a criterion-function-based approach to set estimation, which has close parallels to
classical extremum estimation, has proven useful for analyzing moment equality and inequality
models. Set estimators for this class of models, under certain regularity conditions, are essentiallypn-consistent (Chernozhukov, Hong, and Tamer, 2007).
However, many interesting econometric models fall outside of this class or do not satisfy the
regularity conditions. This includes transformation models under weak semiparametric assump-
tions such as the semiparametric binary response model we discuss in detail below. As such,
this paper develops the asymptotic properties of criterion-function-based set estimators and a
subsampling-based method for obtaining confidence sets in a new class of models containing
those cases of interest. Of particular interest is a set estimator analogous to the maximum score
estimator of Manski (1975, 1985). Kim and Pollard (1990) showed that the point estimator is
cube-root-consistent and has a non-standard limiting distribution. These properties are often
perceived as disadvantages, but the maximum score estimator has proven to be important and
useful due to its robustness, both to unknown error distributions and to heteroskedasticity of
unknown form. Furthermore, recent work on classical MCMC-based inference by Jun, Pinkse,
and Wan (2011) makes point estimation and inference for this model much more accessible.
The first primary contribution of this paper is to extend the criterion-function-based set
estimation approach to cases for which non-standard rates of convergence may arise. Work on
criterion-function-based estimation and inference in partially identified models started with
Manski and Tamer (2002), who analyzed a semiparametric binary response model with interval-
valued data under a conditional quantile restriction. They derived the sharp identified set for the
model, proposed a set estimator, defined as an appropriately-chosen contour set of a modified
maximum score objective function, and showed that it was consistent. Chernozhukov, Hong,
and Tamer (2007) developed a broad framework for criterion-function-based estimation and
established general conditions for consistency and rates of convergence of estimators in this
class. They also proposed a subsampling-based procedure for obtaining confidence sets with
some pre-specified coverage probability. Subsampling-based inference was further explored
by Romano and Shaikh (2008, 2010) and Andrews and Guggenberger (2009) while Bugni (2010)
and Canay (2010) have proposed bootstrap procedures. These and many other authors have
1See Manski (2003) and Tamer (2010) and the references therein for broad surveys of this literature.
2
studied inference in moment equality and inequality models, including but certainly not limited
to Andrews, Berry, and Jia (2004), Pakes, Porter, Ho, and Ishii (2006), Beresteanu and Molinari
(2008), Beresteanu, Molchanov, and Molinari (2011), Imbens and Manski (2004), Stoye (2009),
Kim (2008), Khan and Tamer (2009), Menzel (2008), Andrews and Shi (2013), Andrews and Soares
(2010), and Yildiz (2012).
We work under modifications of the conditions of Chernozhukov, Hong, and Tamer (2007) in
order to analyze and perform inference in the non-standard models we consider. A mechanism
analogous to Kim and Pollard’s “sharp-edge effect” is shown to drive cube-root convergence of
set estimators in a class of partially identified models. We also characterize a class of models
in which a natural discontinuity in the limiting objective function results in arbitrarily fast
convergence. We then verify that a subsampling procedure can be used for inference in these
cases. We also derive sufficient conditions for consistency, cube root convergence, and arbitrarily
fast convergence that are easier to verify in some cases, as we illustrate later in the context of a
binary response model.
Our second main contribution is to apply our general results to study identification and
estimation of a semiparametric binary response model under conditional median independence.
Even though the assumptions of the standard maximum score model are already weak, it may
be desirable to relax them even further for additional robustness. Known conditions for point
identification require at least one regressor to be continuous and have sufficiently rich support
(Manski, 1988; Horowitz, 1998). Furthermore, when additional smoothness conditions on the
distributions of the error term and the explanatory variables are satisfied, one can estimate the
parameters at a faster rate, approaching the parametric rate, using the smoothed maximum score
estimator (Horowitz, 1992). However, when the regressors are discrete or bounded, the model
may only be partially identified. Therefore, we propose a set estimator that is closely analogous
to the maximum score point estimator and retains many of its properties: it is essentially cube-
root-consistent for the true parameter value when the model is point identified and for the
identified set otherwise. Thus, the estimator does not differentiate between point and partial
identification.
Even in the point identified case, when it is consistent for a singleton, the maximum score
estimator is effectively a set estimator because the sample criterion function is a step function.
Formally treating it as a set estimator is therefore a very natural way to proceed, but existing
conditions in the literature are not well-suited for analyzing this model because of its irregular
features. However, our aforementioned general asymptotic results allow us to analyze this more
robust set estimator for binary response models. The set estimator we propose is shown to
converge in probability at a rate arbitrarily close to n−1/3 when a continuous (but potentially
bounded) regressor is present. When all regressors are discrete, the estimator is shown to
3
converge arbitrarily fast to the identified set. We also establish the validity of the proposed
subsampling procedure for constructing confidence sets of a pre-specified level. Finally, we
discuss the results of several Monte Carlo experiments, which verify our theoretical findings and
shed light on the finite sample properties of the estimator.
This paper is also related to a growing literature concerned with semiparametric estimation
of models with limited support regressors, typically involving either discrete or interval-valued
regressors. Bierens and Hartog (1988) showed that there are infinitely many single-index rep-
resentations of the mean regression of a dependent variable when all covariates are discrete.
Horowitz (1998) discussed the generic non-identification of single-index and binary response
models with only discrete regressors, a result which serves to motivate our analysis. Manski
and Tamer (2002) considered partial identification and estimation of binary response models
with an interval-valued regressor. Honoré and Tamer (2006) discuss partial identification due to
the initial conditions problem in dynamic random effects discrete choice models with discrete
regressors. Magnac and Maurin (2008) considered a similar model, but in the cases of discrete
or interval-valued regressors and in the presence of a special regressor which satisfies both a
partial independence and a large support condition. Honoré and Lleras-Muney (2006) estimated
a partially identified competing risks model with interval outcome data and discrete explanatory
variables. Komarova (2008) proposed consistent estimators, based on a linear programming
procedure, of the identified set in a binary response model with discrete regressors. Finally, the
importance of the theoretical topics addressed in this paper are highlighted by empirical work
using maximum score methods, such as that of Bajari, Fox, and Ryan (2008).
The remainder of this paper is organized as follows. Section 2 contains our main results
on consistency, rates of convergence, and confidence sets in general models. More practical
sufficient conditions are given in Section 3. We then verify these sufficient conditions in the
context of a semiparametric binary response model in Section 4, developing a set estimator
analogous to the maximum score estimator. The results from a series of Monte Carlo experiments
based on these models are presented in Section 5. Section 6 concludes. Longer proofs and
auxiliary results are collected in the appendix.
2. Criterion-Function-Based Set Estimators
This paper concerns econometric models characterized by a finite vector of parameters of
interest, denoted θ, which lie in some parameter spaceΘ⊂Rk . This includes semiparametric
models with infinite-dimensional components ψ ∈ Ψ that are not of interest. For example,
ψ might include unknown functions, such as the distribution of an error term. Let P denote
the data generating process—the true distribution of observables—and let Pθ,ψ denote the
4
distribution induced by (θ,ψ). Suppose that the model is well-specified in the sense that there
exist primitives (θ0,ψ0) such that Pθ0,ψ0 = P . Both θ0 and ψ0 are unknown to the researcher, but
we shall focus on the case where θ0 is of primary interest.
A model is point identified if θ0 is the only element of Θ such that the model would be
consistent with the population distribution P for some ψ ∈Ψ. On the other hand, the model is
partially identified if there are multiple elements θ ∈Θ that could have generated P in the sense
that there is some ψ such that Pθ,ψ = P . The set of all such θ is the identified set and is denoted
Θ0. Formally,Θ0 depends on P and is defined as
Θ0(P ) ≡ {θ ∈Θ : ∃ψ ∈Ψ such that Pθ,ψ = P
}.
We will simply writeΘ0 when the context is clear, with the dependence on P being implicit. Note
that this definition encompasses point identification, sinceΘ0 may be a singleton. In this case,
consistent point or set estimators will converge in probability to that singleton.
This paper focuses on set inference in models where the identified set is characterized by
some criterion function Q. LetΘ1 ≡ argmaxΘQ denote the set of maximizers of Q. We are thus
careful to distinguish between the theoretical identified setΘ0, which is sharp by definition, and
the set to be estimated, Θ1, which is characterized by the criterion function. Typically, Θ1 can
easily be shown to be a superset ofΘ0, but sharpness is not immediate and must be shown. It
may not be sharp if the criterion function does not include some information aboutΘ0 (e.g., it is
based on necessary conditions or implications from the model that might not fully characterize
Θ0). In the application to semiparametric binary response models, we show thatΘ1 is sharp, so
thatΘ1 =Θ0, but we maintain the distinction throughout for generality.
Following Manski and Tamer (2002) and the subsequent literature, we apply the analogy
principle in defining an estimator, which suggests estimatingΘ1 using Θn which is defined to be
the set of maximizers of the finite sample objective function Qn . However taking only the set of
maximizers may result in an inconsistent estimator. Instead, we analyze a class of estimators
defined in terms of upper contour sets of Qn . Let Cn(τn) denote the upper contour set of level
τn , defined as
(1) Cn(τn) ≡{θ ∈Θ : Qn(θ) ≥ sup
ΘQn −τn
},
where τn is a non-negative “slackness” sequence which converges zero in probability. Estimators
of this form have been used by Chernozhukov, Hong, and Tamer (2007), Romano and Shaikh
(2008, 2010), Bugni (2010), Kim (2008), Yildiz (2012), and many others.
To discuss notions of convergence and consistency, we must be precise about which metric
space we are working in. Again following the literature, we define convergence in terms of the
Hausdorff distance, a generalization of Euclidean distance to spaces of sets. Let (Θ,d) be a metric
5
space where d is the standard Euclidean distance. For a pair of subsets A,B ⊂Θ, the Hausdorff
distance between A and B is
(2) dH (A,B) ≡ max
{supθ∈B
ρ(θ, A), supθ∈A
ρ(θ,B)
},
where ρ(θ, A) ≡ infθ∈A d(θ, θ) is the shortest distance from the point θ to the set A. Intuitively,
the Hausdorff distance between A and B is the farthest distance between an arbitrary point in
one of the sets to the nearest neighbor in the other set.
Let B c denote the complement of a set B inΘ. In a slight abuse of notation, we also write Bε
to denote an ε-expansion of a set B in Θ, defined as Bε ≡ {θ ∈Θ : ρ(θ,B) ≤ ε}. We write a ∨b to
denote max{a,b} and a ∧b to denote min{a,b}.
In the following we provide conditions under which the sequence of set estimates Θn ≡Cn(τn)
for some sequence τn converges in probability toΘ1 in the Hausdorff metric, derive the rate of
convergence in various cases, and address the problem of constructing confidence sets forΘ1.
Our work in this section follows Chernozhukov, Hong, and Tamer (2007) with some small but
important departures, which we discuss below, that are needed to address the cases of interest.
2.1. Consistency
Theorem 1 provides conditions on Q, Qn , and the sequence τn to ensure that Θn ≡ Cn(τn) is
consistent forΘ1.
Assumption A1 (Compactness). Θ is a nonempty, compact subset of Rk .
Assumption A2 (Well-Separated Maximum). There exists a population criterion function Q such
that for all η> 0, there exists a δη > 0 such that supΘ\Θη1
Q ≤ supΘQ −δη.
Assumption A3 (Uniform Convergence). There exists a sample criterion function Qn and a
known sequence of constants an →∞ such that supΘ |Qn −Q| =Op (1/an).
Assumptions A1 and A3 are analogous to the standard compactness and uniform conver-
gence conditions for consistency of M-estimators for singletons (cf. Amemiya, 1985; Newey
and McFadden, 1994). Assumption A2 requires the population objective function to have a
well-separated maximum. This serves to rule out pathological cases that can arise in the absence
of continuity. It is satisfied, for example, when Q is a continuous function or a step function.
Assumption A3 requires that the sample criterion function Qn converge uniformly in proba-
bility to the limiting criterion function Q. This assumption is similar to others in the literature
on set estimation in that it also requires that the rate of uniform convergence is known. In this
sense it is slightly stronger than the usual assumption for M-estimators, but this rate is easy
6
to determine in applications. For example, for the semiparametric binary response model of
Section 4 and any other model that satisfies the sufficient conditions of Section 3, we show that
an =pn.
Theorem 1 (Consistency). Suppose that Assumptions A1–A3 hold and let τn be a nonnegative
sequence of random variables such that τnp→ 0. Then, supθ∈Θn
ρ(θ,Θ1)p→ 0. Furthermore, if
anτnp→∞, then limn→∞ P (Θ1 ⊆ Θn) = 1 and dH (Θn ,Θ1)
p→ 0.
Proof: Appendix A.
This theorem is similar to existing set consistency theorems such as Proposition 3 of Manski
and Tamer (2002) and Theorem 3.1 of Chernozhukov, Hong, and Tamer (2007), but our conditions
are different in ways that accommodate the cases of interest in this paper. For example, the
limiting objective function of Manski and Tamer (2002) is continuous, but we will consider cases
in which this function is discontinuous. Additionally, this theorem applies to cases where the
maximum of the population objective function is unknown, but in the moment equality and
inequality models considered by Chernozhukov, Hong, and Tamer (2007) the minimum of the
population objective function can be normalized to some known constant, typically zero.
Note that the first conclusion of Theorem 1 actually holds without the slackness sequence:
Θn becomes arbitrarily close to being a subset ofΘ1 in probability for any sequence τn = op (1),
including τn = 0. The slackness sequence is introduced to ensure that the other inclusion holds,
that Θn covers Θ1 in probability, by slightly expanding the contour sets by an amount which
becomes negligible as n →∞. By expanding it at the right rate—with τn converging to zero
in probability, but not faster than 1/an—we ensure that Θn is large enough to cover Θ1 with
probability approaching one. Combining these two results yields consistency in the Hausdorff
metric.
As an example, suppose that Qn is known to converge to Q uniformly at the rate an =pn. In
this case we can obtain a consistent estimator by choosing τn to be a sequence which converges
to zero slower than n−1/2. Permissible choices are, for example, τn =plnn/n and τn = n−0.49.
2.2. Polynomial Rates of Convergence
The rate of convergence of the Hausdorff distance dH (Θn ,Θ1) is the slowest rate at which the
component distances in (2), supθ∈Θ1ρ(θ,Θn) and supθ∈Θn
ρ(θ,Θ1), converge to zero. The second
part of Theorem 1 establishes that with only Assumptions A1–A3, the first distance equals zero
with probability approaching one. Thus, the rate of convergence of the second component
distance determines the overall rate.
In this section we consider models for which Qn satisfies a polynomial curvature condition.
In particular, when Qn(θ) is stochastically bounded from above by a polynomial in ρ(θ,Θ1)
7
Θ
Θ1
Op (a−γ2n )
Qn(θ)
FIGURE 1. Polynomial majorant condition with γ1 = 2.
outside of a shrinking neighborhood ofΘ1, we show that the rate of convergence depends on the
curvature of the bounding polynomial, the rate at which the neighborhood shrinks, and the rate
at which τn converges to zero.
Assumption A4 (Polynomial Majorant). There exist positive constants δ, c, γ1, and γ2 with
γ1γ2 ≥ 1 such that for any ε ∈ (0,1) there are positive constants cε and nε such that for all n ≥ nε,
Qn(θ) ≤ supΘ
Q − c · (ρ(θ,Θ1)∧δ)γ1
uniformly on the set {θ ∈Θ : ρ(θ,Θ1) ≥ (cε/an)γ2 } with probability at least 1−ε.
Assumption A4 is a small but critical generalization of Condition C.2 of Chernozhukov, Hong,
and Tamer (2007). In particular, their assumption is a special case of the above where γ1 = 1/γ2
and where the maximum of Q is known to be zero. Most importantly, we allow the degree of the
bounding polynomial to differ from the exponent determining the rate at which the sequence
neighborhoods of Θ1 shrinks. As in the case of M-estimators for point identified models, this
generalization allows for models with non-standard rates of convergence (cf. van der Vaart, 1998,
Theorem 5.52). As an example, Figure 1 illustrates a possible quadratic bounding polynomial
(γ1 = 2). Note that this bound must hold only in probability.
Theorem 2 (Rate of Convergence with a Polynomial Majorant). Suppose that Assumptions A1–A4
hold. If τnp→ 0 and anτn
p→∞, then dH (Θn ,Θ1) =Op (τγ2n ).
Proof: Appendix A.
Although the rate an does not appear explicitly in the conclusion of Theorem 2, the rate of
convergence of Θn depends implicitly on an because the curvature and rate constants γ1 and γ2
must be chosen relative to an in order to satisfy Assumption A4.
8
Θ1 Θ
Q(θ)
δ
FIGURE 2. Constant majorant condition
2.3. Arbitrarily Fast Convergence
This section addresses a special case in which the limiting objective function Q is not continuous,
but has a discontinuity at the boundary of the set Θ1 as illustrated by Figure 2. This occurs,
for example, in semiparametric binary response models with discrete regressors, where Q is a
step function. We show that in such cases Θn converges arbitrarily fast in probability toΘ1, as
opposed to the (potentially) non-standard but polynomial rates we found above.
Assumption A4’ (Constant Majorant). There exists a δ > 0 such that Q(θ) ≤ supΘQ −δ for all
θ ∈Θc1.
Theorem 3 (Rate of Convergence with a Constant Majorant). Suppose that Assumptions A1, A3,
and A4’ hold. If τnp→ 0 and anτn
p→∞, then Θn =Θ1 with probability approaching one.
Proof: Appendix A.
Thus, when Q exhibits a jump at the boundary of the identified set, the probability that the
estimate actually equals the identified set can be made arbitrarily close to one by choosing n
large enough. This is equivalent to saying that Θn converges arbitrarily fast toΘ1. That is, for any
sequence rn , including powers of n and exponential forms, rndH (Θn ,Θ1)p→ 0. Also, we note that
Assumption A4’ implies Assumption A2, so in contrast to Theorem 2 for polynomial rates, the
latter assumption is no longer needed.
Intuitively, under the conditions of Theorem 3, with probability approaching one we are able
to perfectly distinguish values of θ that belong toΘ1 from those that do not. This happens be-
cause Qn is converging uniformly to Q at a rate that is faster than the rate at which τn approaches
zero, while at the same time τn will eventually become smaller than δ, the size of the discrete
jump. The result is that the contour sets Cn(τn) become identically equal toΘ1 with probability
approaching one.
Figure 3 illustrates the notion that, due to the discrete nature of Qn(θ), there are only a finite
(though potentially very large) number of possible estimates Θn . For the realization of Qn in the
9
Θ
Qn(θ)
Θ4Θ3Θ2Θ1
FIGURE 3. A realization of Qn and the partition ofΘ it generates
figure, the contour sets determine a partition ofΘ into four disjoint sets: Θ=Θ1 ∪Θ2 ∪Θ3 ∪Θ4.
In the present framework, where Θn is defined to maximize Qn within some threshold τn , so that
it includes all values of θ for which Qn(θ) ≥ supQn −τn , there are four possible estimates: Θ2,
Θ2 ∪Θ3, Θ2 ∪Θ3 ∪Θ1, and Θ2 ∪Θ3 ∪Θ1 ∪Θ4. In higher dimensions, and for large sample sizes,
the combinatorics of the problem dictate that the number of possibilities becomes large very
quickly. On the other hand, as n →∞, the contour sets of Qn approach those of Q, and the set of
possible estimates contains a set equal toΘ1 with probability approaching one.
For a more concrete example, consider the objective function for the maximum score estima-
tor with discrete regressors in (7), restated slightly here:
Q(θ) = ∑x∈X
P (x)[2P (y = 1 | x)−1
]1{
x ′θ ≥ 0}
.
The support of x is finite, so the expectation becomes a sum of indicator functions which depend
on the parameter vector θ. The population objective function is a step function in this setting,
with the size of the jump near the identified set—the value δ in Assumption A4’—being related
to the probabilities P (x) and P (y = 1 | x). We will address this estimator in more detail in section
Section 4.
Arbitrarily fast rates of convergence arise in other areas of econometrics. The situation is
similar to that of M-estimation with a finite parameter space, such as when estimating the sign of
a parameter, whereΘ= {−1,1}. For example, Andrews and Guggenberger (2008) illustrate a case
where the rate of convergence of the least squares estimator in a nearly-unit-root AR(1) model
is arbitrarily fast. Bhattacharya (2009) found an arbitrarily fast rate of convergence for a set
estimator in the context of treatment assignment problems where a large number of individuals
are optimally assigned to a finite number of treatments.
10
2.4. Confidence Sets
We now consider the problem of constructing a sequence of sets Bn which have the asymptotic
confidence property
(3) limn→∞P (Θ1 ⊆ Bn) ≥ 1−α
for a given value ofα ∈ (0,1). This is a complex problem in general, if the sets Bn are not restricted
to be members of a more tractable family of sets. As such, we focus on the case where Bn is a
sequence of upper contour sets of Qn . Under this restriction, the problem of choosing a sequence
of arbitrary sets is reduced to that of choosing a sequence of levels κn .
Now, the coverage of a particular contour set can be inferred using the following statistic, for
some scaling sequence bn > 0 (discussed below):
(4) Rn ≡ bn
(supΘ
Qn − infΘ1
Qn
).
This statistic is of interest because it can be related directly to the coverage probability. For any
sequence of levels κn we have
P (Θ1 ⊆Cn(κn/bn)) = P
(Qn(θ) ≥ sup
ΘQn −κn/bn ∀θ ∈Θ1
)= P
(infΘ1
Qn ≥ supΘ
Qn −κn/bn
)= P (Rn ≤ κn) .
Thus, coverage probabilities of upper contour sets are related to quantiles of the distribution of
Rn .
To obtain asymptotic confidence sets, we will use subsampling to approximate quantiles
of the limiting distribution of Rn (see Politis, Romano, and Wolf (1999) for a thorough study of
the theory and methods of subsampling). We assume that an iid sample is available, so that
each subsample of size m < n is itself an iid sample. The procedure also requires that the scaling
sequence bn is chosen such that Rn converges in distribution, although the limiting distribution
need not be known.
Assumption A5 (Random Sampling). The sample consists of n independent observations from
the population distribution of observables.
Assumption A6 (Convergence of Rn). Suppose that P (Rn ≤ c) → P (R ≤ c) for some random
variable R for each continuity point c of the distribution of R.
Intuitively, by selecting a sequence κn that converges to the 1−α quantile of R, denoted
c(1−α) ≡ inf{c : P (R ≤ c) ≥ 1−α}, then the sequence of contour sets Cn(κn/bn) should have
11
asymptotic coverage 1−α. First, we formalize this statement in the following lemma. Then we
will describe an algorithm for constructing an appropriate sequence κn and provide conditions
under which it is valid.
Lemma 1. If Assumption A6 holds, then for any sequence κn such that κnp→ c(1−α) for some
α ∈ (0,1) with c(1−α) > 0, then
P (Θ1 ⊆Cn(κn/bn)) ≥ (1−α)+op (1).
Furthermore, if the distribution function of R is continuous, then the above inequality holds with
equality.
Proof of Lemma 1. Observe that
P (Θ1 ⊆Cn(κn/bn)) = P (Rn ≤ κn) = P (R ≤ c(1−α))+op (1) ≥ (1−α)+op (1).
The first equality holds by definition of Cn in (1) and Rn in (4) and the second by Assumption A6
and κnp→ c(1−α). The inequality follows by definition of the quantile c(1−α). When the
distribution function of R is continuous, then the quantile is unique and the inequality holds
with equality. ■
Finally, note that evaluating the statistic Rn requires knowledge of the unknown setΘ1. We
work under the assumption that we can approximate Rn by substituting a consistent estimator
Θn forΘ1.
Assumption A7 (Approximability of Rn). For any sequence {Θn} of subsets ofΘwith dH (Θn ,Θ1) =op (a−γ2
n ), we have P(Rn ≤ c
) → P (R ≤ c) for each continuity point c of the distribution of R,
where Rn ≡ bn(supΘQn − infΘn Qn
).
Both Assumptions A6 and A7 must be verified in specific applications. We show that they
indeed hold for the semiparametric binary response model considered in Section 4.
Now, we provide an algorithm for constructing sequences κn with the desired asymptotic
confidence property in (3). Inference based on subsampling was suggested for moment equality
and inequality models by Chernozhukov, Hong, and Tamer (2007) and Romano and Shaikh
(2010). Our contribution is to establish that this approach, when modified appropriately, is also
valid in the class of potentially irregular models we consider. In particular, the scaling sequence
bn is not the same as the rate of uniform convergence an in the class of models we consider. For
the cube-root-consistent set estimator for the binary response model in Section 4, we show that
the appropriate rate is bn = n2/3.
12
Algorithm 1. Suppose that the asymptotic coverage probability 1−α is given, that the sequence
τn is chosen so that the estimator Θn =Cn(τn) is consistent, and that the sequence bn satisfies
Assumption A6.
1. Choose a subsample size m < n such that m →∞ and m/n → 0 as n →∞. Let Mn = (nm
)denote the number of subsets of size m.
2. Let Qn,m, j denote the sample objective function evaluated using the j -th subsample of
size m. For each j = 1, . . . , Mn , calculate and store the statistic
(5) Rn,m, j ≡ bm
(supθ∈Θ
Qn,m, j (θ)− infθ∈Θn
Qn,m, j (θ)
).
3. Choose κn to be the 1−α quantile of the values {Rn,m, j }Mnj=1.
4. Report Θn as a consistent estimate and Cn(κn/bn) as a confidence set.
The following theorem establishes the validity of this algorithm for obtaining the desired
sequence κn under the maintained assumptions.
Theorem 4. Suppose that Assumptions A1–A7 are satisfied and that m →∞ and m/n → 0 as
n → ∞. Let 1 −α denote the desired coverage level and let κn be the sequence produced by
Algorithm 1. Then, κnp→ c(1−α).
Proof: Appendix A.
Therefore, in light of Lemma 1, the sequence κn yields a sequence of confidence sets with
the desired asymptotic coverage.
2.5. Confidence Sets with Arbitrarily Fast Convergence
Previously we considered a subsampling-based algorithm for constructing confidence sets. In the
special case of arbitrarily fast convergence, we first we show that the approximability condition
of Assumption A7, required for the validity of the subsampling algorithm, easily holds in this
case. Then, we provide a characterization of general sequences Bn—not necessarily contour
sets—which have the desired asymptotic confidence property (3).
Lemma 2 (Approximability of Rn). Suppose Assumption A6 holds, letΘn be a sequence of subsets
of Θ such that limn→∞ P (Θn =Θ1) = 1, and define Rn = bn(supΘQn − infΘn Qn
). Then P (Rn ≤
c) → P (R ≤ c) for each continuity point c of the distribution of R.
13
Proof of Lemma 2. Let ε > 0. Then, there exists an nε such that for all n ≥ nε, P (Θn =Θ1) ≥1−ε. Since P
(infΘn bnQn = infΘ1 bnQn
)= P (Θn =Θ1), we have∣∣Rn −Rn
∣∣ p→ 0. Furthermore, by
Assumption A6 we have Rnd→ R. Therefore, Rn
d→ R. ■
Therefore, Algorithm 1 is valid for inference in models for which there exists an arbitrarily fast
estimator ofΘ1. For a general sequence Bn satisfying (3), not necessarily a sequence of contour
sets, we have the following characterization.
Lemma 3. Let Θn and Bn be sequences of subsets of Θ and suppose that limn→∞ P(Θn =Θ1
) =1. Then Bn has the asymptotic confidence property limn→∞ P (Θ1 ⊆ Bn) ≥ 1−α if and only if
limn→∞ P(Θn ⊆ Bn
)≥ 1−α.
Proof of Lemma 3. Notice that
P (Θ1 ⊆ Bn) = P(Θ1 ⊆ Bn ,Θ=Θ1
)+P(Θ1 ⊆ Bn ,Θ 6=Θ1
)= P(Θ⊆ Bn
)+o(1).
The result follows by taking the limit as n tends to infinity. ■
The conclusion of this lemma is stark in the sense that in models for which we have an
estimator with an arbitrarily fast rate of convergence, any sequence of confidence sets Bn which
asymptotically coversΘ1 with probability at least 1−α must also contain Θn with probability at
least 1−α. Since Θn is itself essentially a probability-one confidence set, it is not reasonable for
Bn to be any larger than Θn . So, the practical conclusion is that any such sequence Bn should
be equal to Θn with probability 1−α. Thus, one possible sequence is Bn = Θn with probability
1−α with Bn unrestricted with probability α. It seems reasonable to simply set Bn = Θn with
probability one.
3. Sufficient Conditions
This section derives sufficient conditions which are sometimes easier to verify than the con-
ditions of the theorems above. These conditions will be used in analyzing the application
considered in Section 4, providing concrete examples of their use. Many of these conditions
are stated in terms of empirical process concepts—they are restrictions on the indexing class
of functions which generate the finite sample and population objective functions. We briefly
summarize some standard notation and definitions below, but refer the reader to Section 2 of
Pakes and Pollard (1989) for further details.
Let P be the joint distribution of all observables, denoted Z . We shall maintain Assump-
tion A5, that n iid observations of Z are available for use in estimation. Let Pn denote the
14
associated empirical measure. When there is no ambiguity we write P g instead of∫
g dP to de-
note the integral of some function g with respect to a measure P and Pn g to denote n−1 ∑i g (Zi ).
Let `∞(B) denote the space of uniformly bounded real-valued functions on B endowed with the
uniform metric d∞( f , g ) = supb∈B
∣∣ f (b)− g (b)∣∣.
We focus here on models for which the objective functions can be expressed in terms of a class
of real-valued functions F , where for each θ ∈Θ, Q(θ) = P f (·,θ) and Qn(θ) = Pn f (·,θ) for f (·,θ) ∈F . As such, we work with empirical processes indexed by classes of functions F = { f (·,θ) : θ ∈Θ}.
Alternatively, we use parameter space Θ as the indexing set when convenient. Note that this
assumption does not include objective functions of the kind considered by Chernozhukov, Hong,
and Tamer (2007) and Bugni (2010), for which Q = ∥∥P f (·,θ)∥∥
W (θ) for some approprite weighting
matrix W (θ).
Each such model has a different indexing class F . Let Gn =pn(Pn −P ) denote the standard-
ized empirical process indexed by F . Note that P , Pn , and Gn all map classes F to functions in
`∞(Θ). We work under conditions below, such as manageability of F , that are sufficient for F to
be P-Donsker, meaning that Gn G in `∞(F ), where denotes weak convergence and G is
a mean-zero Gaussian process indexed by F with almost surely continuous sample paths and
covariance E[ f (Z )g (Z )]−E[ f (Z )] ·E[g (Z )] for all f , g ∈F .
For establishing the asymptotic properties of Θn , we show below that it is sufficient to verify
certain properties of F . As a practical matter, in many cases this is easier than directly verifying
the assumptions of the theorems above. We provide sufficient conditions for consistency, cube-
root convergence, and arbitrarily fast convergence.
3.1. Sufficient Conditions for Consistency
An envelope for F is a function F such that supF
∣∣ f∣∣ ≤ F . For establishing consistency of Θn ,
a sufficient condition for the uniform convergence required by Assumption A3 is that F is
manageable in the sense of Pollard (1989) for a square integrable envelope F . The following
conditions are sufficient for Theorem 1, yielding consistency of Θn for any sequence τn such
that n1/2τnp→ 0.
Assumption B1. Θ is a nonempty, compact subset of Rk and there exists a class of real-valued
functions F = { f (·,θ) : θ ∈Θ} such that Q(θ) = P f (·,θ) and Qn(θ) = Pn f (·,θ) for all θ ∈Θ.
Assumption B2. Q is piecewise continuous onΘ.
Assumption B3. F is manageable for some envelope F such that PF 2 <∞.
Lemma 4. Suppose that Assumptions B1–B3 hold. Then Assumptions A1–A3 hold with an = n1/2.
15
Proof of Lemma 4. Compactness of Θ implies Assumption A1 and piecewise continuity of Q
implies Assumption A2. Since F is manageable with PF 2 <∞, it follows from Corollary 3.2 of
Kim and Pollard (1990) that Assumption A3 holds with an = n1/2. ■
Hence, under Assumptions B1–B3, Θn is consistent in the sense of Theorem 1. The piecewise
continuity of Q of Assumption B2 is not always immediate, but follows from the uniform law
of large numbers when f (·,θ) is continuous in θ with probability one and dominated by some
bounded function F (Newey and McFadden, 1994, Lemma 2.4). Furthermore, Assumption B3
is satisfied in models where F is a Vapnik-Chervonenkis (VC) subgraph class, in the sense of
Dudley (1987), with constant envelope F <∞. In particular, if F is a class of functions such
that {subgraph( f ) : f ∈ F } is a VC class of sets and supF
∣∣ f∣∣ ≤ F < ∞, then F is necessarily
manageable and PF 2 <∞. The following lemma formalizes these results.
Lemma 5. Suppose that Assumption B1 holds. If F is a VC subgraph class such that∣∣ f (·,θ)
∣∣≤ M
for all θ ∈Θ for the constant function M <∞, then Assumption B3 holds. In addition, if f (·,θ) is
continuous in θ with probability one, then Assumption B2 holds.
Proof of Lemma 5. Since F is a VC subgraph class, Lemma 2.12 of Pakes and Pollard (1989)
implies that F is Euclidean in the sense of Nolan and Pollard (1987, Definition 8) for any valid
envelope, including the constant function F = M . Since F is Euclidean, it is also manageable for
F = M (cf. Pakes and Pollard, 1989, p. 1033). Since PF 2 = M 2 <∞, this verifies Assumption B3.
Furthermore, if f (·,θ) is continuous in θ with probability one, since it is dominated by F = M
for all θ, continuity of Q follows from Lemma 2.4 of Newey and McFadden (1994), verifying
Assumption B2. ■
3.2. Sufficient Conditions for Cube Root Convergence
For models where Q is continuous, Kim and Pollard’s heuristic for cube-root consistency of
point-estimators translates well to the partially identified case. Let Γ(θ) ≡ Q(θ)−Q(θ0) and
Γn(θ) ≡Qn(θ)−Qn(θ0). We can decompose Γn(θ) into two components, a trend and a stochastic
component: Γn(θ) = Γ(θ)+ [Γn(θ)−Γ(θ)]. Suppose that the limiting objective function is approx-
imately quadratic in the distance ρ(θ,Θ1) nearΘ1: Γ(θ) =O(ρ2(θ,Θ1)). For models with a feature
Kim and Pollard called the “sharp-edge effect”, the variance of the empirical process component
is Op (ρ(θ,Θ1)/n). Only when the trend overtakes the noise is Γn likely to be below the maximum.
Thus, the maximum is likely to occur when the standard deviation of the random component
is of the same magnitude or larger than the trend. That is, when√ρ(θ,Θ1)/n > ρ2(θ,Θ1) or
ρ(θ,Θ1) < n−1/3. Therefore, Θn , the set of near maximizers of Γn , should be within a neighbor-
hood ofΘ1 on the order of n−1/3.
16
In the partially identified case, however, ρ(θ,Θ1) is only one component of the Hausdorff
distance. Conveniently, the other component was shown to converge arbitrarily fast under fairly
weak assumptions (Theorem 1) and therefore it does not hinder the rate of convergence.
In terms of Theorem 2, the above argument corresponds to the case where γ1 = 2 and γ2 = 2/3.
Since τn can be chosen arbitrarily close to n−1/2, the rate of convergence can be made arbitrarily
close to (n−1/2)γ2 = n−1/3. More formally, the following conditions are sufficient for Θn to attain,
essentially, cube-root consistency.
Assumption B4. There exists a neighborhood Θη11 of Θ1 with η1 > 0 and a positive constant c
such that Q(θ) ≤ supΘQ − cρ2(θ,Θ1) for all θ ∈Θη11 .
Assumption B5. There exists a positive constant η2 such that for all η ≤ η2, the classes Fη ≡{ f (·,θ) : ρ(θ,Θ1) ≤ η} are uniformly manageable with PF 2
η =O(η), where Fη(·) ≡ supFη
∣∣ f (·,θ)∣∣ is
the natural envelope of Fη.
Theorem 5. Suppose that Assumptions B1-B5 hold. Let rn = o(n1/3) and choose τn ∝ r−3/2n . Then
dH (Θn ,Θ1) =Op (r−1n ).
Proof: Appendix A.
Theorem 5 implies that for any sequence rn that diverges slower than n1/3, we can choose the
slackness sequence τn so that Θ converges in probability at the rate r−1n . This rate can thus be
made arbitrarily close to n−1/3 by choice of τn . In practice, under the assumptions of this section,
this is achieved by setting τn close to n−1/2. For example, τn ∝ n−0.49 is a reasonable choice.
3.3. Sufficient Conditions for Arbitrarily Fast Convergence
Finally, note that the constant majorant condition of Assumption A4’ is satisfied when Q is a step
function. Therefore, the following lemma gives sufficient conditions for an arbitrarily fast rate of
convergence.
Lemma 6. Suppose that Assumptions B1–B3 hold and that Q is a step function. If τnp→ 0 and
n1/2τnp→∞, then for any positive sequence rn with rn →∞, rndH (Θn ,Θ1)
p→ 0.
Proof of Lemma 6. As established by Lemma 4, Assumptions B1–B3 are sufficient for Assump-
tions A1–A3 with an = n1/2. Since Q is a step function, Assumption A4’ holds for all δ <supΘQ − supΘ\Θ1
Q. The result follows from Theorem 3. ■
17
4. Semiparametric Binary Response and Transformation Models
Discrete response models have become a standard tool in applied econometrics and their prop-
erties have been studied thoroughly in the econometrics literature (McFadden, 1974; Maddala,
1983; Amemiya, 1985). There is a long line of work studying identification and estimation of
binary response models under various conditions. Once parametric models were understood,
semiparametric methods emerged to estimate models without making parametric assump-
tions about the error distribution. Such methods include maximum score (Manski, 1975, 1985;
Horowitz, 1992), distribution-free maximum likelihood (Cosslett, 1983), average derivative es-
timation (Stoker, 1986), maximum rank correlation (Han, 1987), kernel estimators (Ichimura,
1993; Klein and Spady, 1993), and instrumental variables (Lewbel, 2000). Matzkin (1992) studied
nonparametric identification and estimation of binary response models.
Manski (1988) studies identification of additively separable binary response models in the
presence of a continuous regressor. He compares the identification power of several assumptions,
showing that mean independence has no identifying power but that quantile independence can
be sufficient for point identification. As such, we focus on the binary response model under a
conditional median restriction. Estimators for this model are based on a simple rank condition.
In the point identified case, this includes the maximum score estimator of Manski (1975, 1985)
and smoothed maximum score estimator of Horowitz (1992).
Both these and related semiparametric methods above typically assume the existence of an
exogenous explanatory variable with rich support. Rank conditions have been successful in
estimating more general regression models, but the known conditions for point identification
still include a rich support condition (Han, 1987; Abrevaya, 2000). In practice, however, it is
not uncommon to encounter datasets with genuinely discrete or bounded variables. Without a
regressor with full support on the real line, under semiparametric assumptions, the models we
consider are only partially identified in general (Horowitz, 1998).
We now formalize the basic linear-index binary response model of interest. The model and
assumptions are equivalent to the maximum score model of Manski (1975, 1985), but without
the support and rank assumptions on x. We assume that the conditional median of the error
term u is zero, but we only use the median quantile for simplicity and so the assumption could
be made on any other quantile instead (Manski, 1988). In contrast to Magnac and Maurin (2008),
we make no additional assumptions on the support of u.
Model 1 (Semiparametric Binary Response Model). Let the binary outcome variable y ∈ {0,1} be
defined as
y = 1{x ′θ+u ≥ 0}
18
where x is a random variable with support X ⊆Rk , and θ is the parameter of interest, a member
of some parameter spaceΘ⊂Rk . The distribution of u satisfies Med(u | x) = 0 Fx-a.s.
We note that even in the point identified case, the maximum score estimator is essentially
a set estimator. Because the sample objective function is a step function, there is no guidance
about which point from the set of maximizers to choose as a point estimate. Asymptotically, the
estimator is consistent for any such rule. In the absence of a suitable rule, one may as well regard
the estimator as a set estimator consisting of all values of θ that maximize the objective function.
In this sense, the estimators we present below can be used without regard to whether the model
is point identified or partially identified. Consistency guarantees that the limit is the respective
population point or set of interest.
We also consider a closely related semiparametric transformation model, defined below,
which generalizes Model 1 by substituting an unknown, monotonic functionΛ in place of the
indicator function and by allowing the outcome to be continuous.
Model 2 (Semiparametric Transformation Model). Let the continuous outcome variable y ∈R be
defined as
y =Λ(x ′θ+u)
where x is a random variable with support X ⊆Rk , θ is the parameter of interest, a member of
some parameter space Θ ⊂ Rk , and the function Λ : R→ R is monotonic. The distribution of u
satisfies Med(u | x) = 0 Fx-a.s.
Even with continuous variation in y , θ may only be partially identified if no component
of x has full support on R when u may be heteroskedastic. In contrast, Han (1987) achieved
point identification by assuming one component of x has full support and that u and x are
independent. Without independence Model 2 is very similar to Model 1 in terms of identification
and estimation, with both being characterized in terms of a rank condition that can be used to
construct an objective function, so we will focus on the binary choice case of Model 1 in the
remainder.
4.1. Identification
We begin by reviewing existing conditions for point identification before deriving the identified
set that results after relaxing some of these assumptions. In Model 1, Manski (1975, 1985) showed
that a full rank, full support condition on x was sufficient to point identify θ up to scale. That is,
he assumes that x is not contained in a proper linear subspace ofRk and that the first component
of x has positive density everywhere on R for almost every value of the remaining components.
19
The same conditions were invoked by Han (1987) for the maximum rank correlation estimator
and Horowitz (1992) for the smoothed maximum score estimator. The panel version of this
assumption (for T = 2 time periods) was used by Manski (1987) to establish point identification
of θ (up to scale) in a semiparametric fixed effects panel data model.
Thus, modulo assumptions on the disturbances, point identification of θ hinges on what
one knows, or is willing to assume, about the distribution of x. The validity of a full support
assumption is application-specific. Many common variables such as age, number of children,
years of education, and gender are inherently discrete and so a continuous and full support
assumption is clearly inappropriate in many cases. Similarly, even variables such as income have
only partial support on the real line. One advantage of the estimators proposed in this paper
is that they do not distinguish between the point identified and partially identified cases. That
is, they do not require a regressor with full support, but they exploit the additional information
provided by one when available. We consider two alternative assumptions. The first applies
when at least one component of x is continuous but may be bounded. The second applies when
x is a discrete random variable with finite support.
Assumption C1 (Continuous Regressor). The k-th component of the random vector x, denoted
xk , has positive density everywhere on a set Xk ⊆ R for almost every value of the remaining
components and θk 6= 0.
Assumption C2 (Discrete Regressors). The random vector x is discrete with finite support X ⊂R.
Note that Assumption C1 does not rule out the possibility that Xk =R, but it also includes
cases where the support of x is bounded in some sense. As we discuss in detail below, the
implications of these two assumptions for estimation are very different.
The primitives of Model 1 are θ and Fu|x , but θ is the only finite-dimensional parameter of
interest. The identified set is the collection of parameters θ that can be made consistent with
the data generating process P for some values of the remaining model primitives. The following
lemma provides a tractable representation of the identified set for θ in terms of observables.
Lemma 7. In Model 1, the identified set is
(6) Θ0 ={θ ∈Θ : sgn
(P (y = 1 | x)−1/2
)= sgn(x ′θ) Fx −a.s.}
.
Furthermore,Θ0 is nonempty and convex.
Proof of Lemma 7. That the identified set,Θ0, equals the set on the right hand side of (6) follows
from Proposition 2 of Manski (1988). The setΘ0 is nonempty because θ0 ∈Θ0. To see that the set
is convex, let θ1,θ2 ∈Θ0, let α ∈ (0,1), and define θ ≡αθ1 + (1−α)θ2. Since θ1,θ2 ∈Θ0, from (6) it
20
must be the case that sgn(x ′θ1) = sgn(x ′θ2) Fx-a.s. Furthermore, since α> 0 and (1−α) > 0, (6)
implies
sgn(x ′θ
)= sgn(αx ′θ1 + (1−α)x ′θ2)= sgn
(x ′θ1)= sgn
(P (y = 1 | x)−1/2
)Fx-a.s.
Therefore, θ ∈Θ0. ■
4.2. Consistent Estimation
Conveniently, we can characterize the identified set in this model using the usual maximum
score objective function. The limiting objective function is
Q(θ) = E[(2y −1)sgn(x ′θ)
]and the sample analog is
Qn(θ) = 1
n
n∑i=1
(2yi −1)sgn(x ′iθ).
The following lemma establishes that the set of maximizers of Q provides a sharp characterization
of the identified set, justifying the use of the proposed criterion-function-based set estimators.
We then derive the necessary properties of the objective function in order to obtain consistency
and rates of convergence.
Lemma 8. Under the maintained assumptions of Model 1,Θ1 =Θ0.
Proof of Lemma 8. It follows from the structure of Q and the proof of Lemma 7 that Q is maxi-
mized whenever the signs of the components 2y −1 and x ′β agree Fx-a.s., so that their product
equals 1 and not -1. ■
The objective functions above are generated by applying the population measure P and the
empirical measure Pn to the class of indexing functions of the form f (x, y,θ) = (2y −1)sgn(x ′θ).
SinceΘ is compact, Assumption B1 is verified. It is straightforward to verify that Q is continuous
when Assumption C1 holds and that Q is a step function when Assumption C2 holds. Thus,
Assumption B2 is satisfied in both cases. To establish consistency, we verify Assumption B3 in
the proof of the following theorem.
Theorem 6. In Model 1, under Assumption A5 and either Assumption C1 or C2, for any sequence
τnp→ 0 with n1/2τn
p→∞, dH (Θn ,Θ0)p→ 0.
Proof: Appendix A.
The rate of convergence of Θn to Θ0 depends on the support of x. We obtain the rate
separately under both Assumptions C1 and C2 in the following sections.
21
4.3. Cube-Root Consistency
The properties of the maximum score objective function in the continuous covariate case have
been studied by Kim and Pollard (1990), Abrevaya and Huang (2005), and others. The following
theorem formalizes the cube-root consistency result for the set estimator for our model. Al-
though our assumptions are weaker overall, we still need to make additional assumptions on the
distribution of x, which, for comparison, are intentionally close to the assumptions of Abrevaya
and Huang (2005) and Horowitz (1992) in analyzing the model in the point identified case.
It is well known that θ is only identified up to scale, so we normalize the coefficient on the
last component of x, denoted θk , to be either 1 or −1 and consider estimation of β ∈Rk−1 where
θ = (β′,θk )′. Let x denote the first k −1 components of x. Then x ′θ = x ′β+xk . Being limited to a
finite set, we can estimate θk at a rate faster than the remaining components, so without loss of
generality we only consider the case where θk = 1. Therefore, for stating the following theorem
we abuse notation slightly by writing Q(β) =Q((β′,1)′). Accordingly, let B ⊂Rk−1, B0 ⊂ B , and Bn
denote, respectively, the parameter space for β, the identified set, and the set estimator.
Theorem 7. Suppose that Assumptions A5 and C1 hold in Model 1. In addition, suppose the
following:
a. The components of x and xx ′ have finite first absolute moments.
b. The function ∂ fxk |x (xk | x)/∂xk exists and for some M > 0,∣∣ fxk |x(xk | x)/∂xk
∣∣< M and fxk |x (xk |x) < M for all xk and almost every x.
c. For all u in a neighborhood of 0, all xk in a neighborhood of −x ′β0, almost every x, and some
M > 0, the function fu|x(u | x, xk ) exists and fu|x(u | x, xk ) < M.
d. For all u in a neighborhood of 0, all xk in a neighborhood of −x ′β0, almost every x, and some
M > 0, the function ∂Fu|x(u | x, xk )/∂xk exists and∣∣∂Fu|x(u | x, xk )/∂xk
∣∣< M.
e. B0 is compact and contained in the interior of B.
f. The matrix V (β) ≡ E[2 fu|x(0 | x,−x ′β) fxk |x(−x ′β | x)x x ′] is positive semidefinite for each β ∈
bd(B0).
Then for any sequence τn such that τnp→ 0 and n1/2τn
p→∞, dH (Θn ,Θ0) =Op (τ2/3n ).
Proof: Appendix A.
22
4.4. Arbitrarily Fast Convergence
Here, we verify the constant majorant condition in the context of Model 1. We can then apply
Theorem 3 to show that Θn converges arbitrarily fast toΘ0.
When the support of x is a finite set X , the objective function Q can be rewritten as
(7) Q(θ) = Ex Ey |x[(2y −1)sgn
(x ′θ
)]= ∑x∈X
P (x)[2P (y = 1 | x)−1
]sgn
(x ′θ
).
Therefore, Q is a step function and there exists a real number δ > 0 such that for all θ ∈ Θ \
Θ0, Q(θ) ≤ supΘQ − δ. In particular, δ is bounded below by the smallest nonzero value of∣∣P (x)[2P (y = 1 | x)−1
]∣∣ for any x ∈X . Thus, applying Theorem 3, we have the following result.
Theorem 8. Suppose that Assumptions A5 and C2 hold in Model 1. For any sequence τn such that
τnp→ 0 and n1/2τn
p→∞, then for all positive sequences rn →∞, rndH (Θn ,Θ1)p→ 0.
Komarova (2008) obtained a result similar to Theorem 8 in binary response models with
discrete regressors for a linear-programming-based estimator without a slackness sequence, but
with the additional restriction that none of the population response probabilities can equal one
half. The theorem above applies to the extremum estimator in the presence of the slackness
sequence, which guarantees consistency in a more general class of models, but which weakly
increases the size of the set estimates due to the slackness. Despite this increase, the estimates
still converge arbitrarily fast in probability to the identified set.
4.5. Confidence Sets
Finally, we establish the validity of the Algorithm 1 for obtaining asymptotic confidence sets for
the identified set in the binary response model.
Theorem 9. In Model 1, if τnp→ 0 and n1/2τn
p→ ∞ and either (a) Assumption C1 holds and
bn = n2/3, or (b) Assumption C2 holds and bn = n1/2, then Assumptions A6 and A7 are satisfied
and therefore Algorithm 1 is valid for constructing a sequence of confidence sets satisfying (3).
Proof: Appendix A.
Thus, under Assumption C1, in the presence of a continuous regressor, the appropriate
scaling sequence for the inferential statistic Rn is bn = n2/3. Under Assumption C2, with only
discrete regressors, the relevant scaling sequence is bn = n1/2. We verify that the algorithm
indeed behaves as expected in the Monte Carlo experiments that follow.
23
Q(β)Qn(β)
-0.4
-0.2
0
0.2
0.4
0.6
0.8
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2β
(a) Specification C1
Q(β)Qn(β)
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2β
(b) Specification C2
FIGURE 4. Q(β) and one realization of Qn(β).
5. Monte Carlo Experiments
In this section, we report the results from a series of Monte Carlo experiments for several
specifications.2 We consider four different specifications of the semiparametric binary choice
model from Section 4 which are designed to illustrate the theoretical results above and to help
understand the finite sample properties of the estimator. All specifications have two regressors,
with the second regressor being discrete in all cases. The first two specifications, denoted C1 and
C2, have a continuous first regressor while the two final specifications, denoted D1 and D2, have
a discrete first regressor. The coefficient on the first regressor is normalized to one in all cases,
leaving the coefficient on the second regressor to be estimated. So, we estimate only β where
θ = (1,β), β ∈ B is a scalar, and B ⊂R denotes the parameter space.
Observations in our simulated samples are generated according to the model
yi = 1{x1i +β0x2i +ui ≥ 0}
where ui ∼ N(0,1). The parameter space is B = [−2,2]. We vary the distributions of x1 and x2 and
the population parameter β0 across the four specifications.
Specification C1, the first continuous-regressor specification, is a partially identified model
where x1 ∼ U(1,2), x2 ∼ U({−1,1}), and β0 =−1. That is, x1 has a continuous uniform distribution
on an interval and x2 has a discrete uniform distribution on a finite set. The identified set for this
specification is B0 = [−1,1]. The population objective function for this model, Q, is plotted in
Figure 4(a), along with one realization of the finite-sample objective function Qn for n = 50.
Specification C2 is identical to Specification C1 with the exception that the population param-
eter is β0 =−0.3. This results in the identified set B0 = {−0.3}, a singleton. The objective function
2Fortran 2003 source code to reproduce all figures and tables in this section is available from the author’s website
at http://jblevins.org/research/cuberoot.
24
for this model is plotted in Figure 4(b). Clearly the population objective function is maximized at
a single point while the finite-sample objective function is a step function, as before.
For each specification, we generated R = 1000 simulated data sets for each sample size
n ∈ {250,2000,16000,128000,1024000}. We use these datasets to produce set estimates and
confidence sets using the procedures described in Section 2. We considered four choices for the
slackness sequence τn . The conditions for consistency require that τn tends to zero no faster than
n−1/2. As such, we choose three sequences that are proportional to n−0.49. The fourth sequence
is τn = 0, which corresponds to set estimators obtained by simply maximizing the function (i.e.,
using a degenerate slackness sequence). In general, such estimators are not consistent, but some
of the specifications, in particular the point identified model, are indeed consistent without the
use of a slackness sequence. In practice this would not be known in advance without knowledge
of the joint population distribution of the regressors, which is a justification for using the set
estimator.
We report the set estimates for the first two specifications in Tables 1 and 2. The first column
gives the slackness sequences used, which we describe in more detail below. The second column
reports the sample size, which is increased by factors of eight. The third column reports the
mean of all the set estimates across all R = 1000 replications for each sample size while the fourth
column reports the standard deviations of the endpoints. The final column gives the average
Hausdorff distance between the estimated sets and the true identified set, dH (Bn ,B0).
The first slackness sequence used is τn = n−0.49, where the constant of proportionality is one.
The second and third slackness sequences, which we refer to as Sn−0.49 and Mn−0.49 are chosen
by selecting the constants of proportionality to be equal to, respectively, the supremum of the
functional values over the parameter space and the median of the differences relative to the
maximum value. In practice, we use S ≡ supβ∈B Qn(β) and M ≡ Med{
supβ∈B Qn(β)−Qn(β j )}J
j=1,
where the J grid points β j are uniformly spaced on B . These choices adapt the scale of the
slackness sequence to the scale of the functional values. The constants S and M are determined
for the first (smallest) sample size (for each simulation) and held fixed for all subsequent sample
sizes to ensure consistency (i.e., that τnp→ 0 and anτn
p→∞). The fourth and final sequence used
is τn = 0.
All three of the nonzero sequences are guaranteed to produce consistent estimates, but their
finite sample properties differ. In particular the supremum and median sequences yield good
results overall in terms of balancing the coverage and distance to the identified set of the set
estimates with the coverage of the confidence sets (discussed below). However, neither sequence
appears to be uniformly better than the other. As expected, the zero sequence does not produce
asymptotically correct results for all specifications.
If the estimator is consistent for a particular choice of τn , then the Hausdorff distance should
25
τn n Mean Bn St. Dev. dH
n−0.49
250 [ -1.372, 0.914 ] [ 0.134, 0.523 ] 0.478
2000 [ -1.231, 0.999 ] [ 0.064, 0.203 ] 0.248
16000 [ -1.136, 1.007 ] [ 0.031, 0.002 ] 0.136
128000 [ -1.084, 1.002 ] [ 0.015, 0.001 ] 0.084
1024000 [ -1.051, 1.001 ] [ 0.007, 0.000 ] 0.051
Sn−0.49
250 [ -1.304, 0.673 ] [ 0.130, 0.795 ] 0.597
2000 [ -1.190, 0.874 ] [ 0.066, 0.513 ] 0.312
16000 [ -1.112, 0.984 ] [ 0.032, 0.200 ] 0.131
128000 [ -1.069, 0.997 ] [ 0.016, 0.090 ] 0.073
1024000 [ -1.042, 1.000 ] [ 0.007, 0.000 ] 0.042
Mn−0.49
250 [ -1.154, -0.370 ] [ 0.113, 1.036 ] 1.421
2000 [ -1.077, -0.422 ] [ 0.114, 0.950 ] 1.448
16000 [ -1.046, -0.254 ] [ 0.034, 0.985 ] 1.272
128000 [ -1.026, -0.061 ] [ 0.019, 1.006 ] 1.074
1024000 [ -1.015, 0.205 ] [ 0.011, 0.982 ] 0.805
0
250 [ -1.130, -0.688 ] [ 0.135, 0.859 ] 1.701
2000 [ -1.056, -0.808 ] [ 0.108, 0.673 ] 1.814
16000 [ -1.030, -0.819 ] [ 0.027, 0.621 ] 1.819
128000 [ -1.014, -0.726 ] [ 0.014, 0.708 ] 1.726
1024000 [ -1.007, -0.609 ] [ 0.007, 0.802 ] 1.609
TABLE 1. Estimates for Specification C1, with B0 = [−1,1].
26
τn n Mean Bn St. Dev. dH
n−0.49
250 [ -0.636, 0.031 ] [ 0.198, 0.171 ] 0.441
2000 [ -0.513, -0.091 ] [ 0.073, 0.080 ] 0.258
16000 [ -0.430, -0.172 ] [ 0.036, 0.035 ] 0.151
128000 [ -0.381, -0.220 ] [ 0.017, 0.017 ] 0.091
1024000 [ -0.350, -0.251 ] [ 0.008, 0.007 ] 0.054
Sn−0.49
250 [ -0.466, -0.131 ] [ 0.178, 0.183 ] 0.293
2000 [ -0.417, -0.188 ] [ 0.085, 0.091 ] 0.173
16000 [ -0.372, -0.228 ] [ 0.044, 0.040 ] 0.099
128000 [ -0.346, -0.253 ] [ 0.021, 0.020 ] 0.060
1024000 [ -0.330, -0.270 ] [ 0.010, 0.009 ] 0.035
Mn−0.49
250 [ -0.391, -0.199 ] [ 0.179, 0.195 ] 0.231
2000 [ -0.372, -0.236 ] [ 0.091, 0.094 ] 0.133
16000 [ -0.344, -0.255 ] [ 0.047, 0.045 ] 0.074
128000 [ -0.329, -0.270 ] [ 0.023, 0.023 ] 0.044
1024000 [ -0.320, -0.281 ] [ 0.011, 0.011 ] 0.026
0
250 [ -0.325, -0.269 ] [ 0.189, 0.188 ] 0.173
2000 [ -0.313, -0.297 ] [ 0.094, 0.095 ] 0.083
16000 [ -0.301, -0.297 ] [ 0.047, 0.047 ] 0.039
128000 [ -0.300, -0.299 ] [ 0.024, 0.024 ] 0.020
1024000 [ -0.300, -0.300 ] [ 0.011, 0.011 ] 0.009
TABLE 2. Estimates for Specification C2, with B0 = {−0.3}.
27
converge to zero and the sequence of set estimates should converge to the true identified set.
Furthermore, according to the theoretical results of Section 4 for the continuous-regressor
specifications, when the slackness sequence is such that the estimator is consistent, it should
converge at essentially the rate n−1/3. Thus, when increasing the sample size by factors of eight,
the Hausdorff distance should decrease by approximately half for the sequences proportional
to n−0.49. For Specification C1, reported in Table 1, we find that indeed, the estimates appear
to be consistent approximately at the cube root rate when the slackness sequence is used. The
estimator appears to be inconsistent without the slackness sequence.
For Specification C2, which is actually point identified, the estimator is consistent even for
τn = 0, which is expected given the consistency of the maximum score point estimator (which
would be defined here as any point in the maximizing interval). The benefits of formally treating
the maximum score estimator as a set estimator are the additional robustness to cases where the
support condition may not be satisfied and the avoidance of an ad hoc selection rule (which is
usually implicit) for choosing a point from the set that maximizes the sample criterion function.
The entire set is treated as the estimate since there is usually no a priori reason to prefer any
particular point. As before, for the sequences proportional to τn = n−0.49 we can see that the
estimator achieves approximate cube-root convergence.
Next, we evaluate the performance of the subsampling algorithm discussed in Section 2 for
constructing confidence sets with particular coverage probabilities. These results are reported in
Table 3 for both Specifications C1 and C2. For each sample size, each simulated sample, and each
of the four slackness sequences we use the previously obtained set estimates to approximate
the inferential statistic Rn , using a large number of subsamples for each sample, according to
Algorithm 1. For sample sizes n = 250,2000,16000 we use subsample sizes m = 25,100,400,
respectively. In each case, we generate 1000 subsamples from each of the simulated samples of
size n in order to approximate the limiting distribution of Rn . We then choose the appropriate
quantile for each nominal coverage level 1−α ∈ {0.50,0.75,0.90,0.99}. We thus obtain 1000
confidence sets for each level (one for each simulated sample) and report the resulting coverage
frequency.
The sequence labeled “Infeasible” denotes confidence sets obtained using the subsampling
algorithm when the true identified set is used instead of the estimated set, which is infeasible
in practice. This helps to discern between simulation noise in the coverage frequencies due to
using subsampling and approximation error due to using estimates in place of the true identified
(i.e., using Rn instead of Rn). Since the infeasible coverage frequencies are close to the nominal
levels, the simulation noise appears to be small.
Next, we consider two specifications with only discrete regressors. For Specification D1, x1 ∼U({−2,−1,0,1,2}), x2 ∼ U({−2,−1,0,1,2}), and β0 =−0.5. The identified set for this specification
28
Specification C1 Specification C2
τn n 0.500 0.750 0.900 0.990 0.500 0.750 0.900 0.990
n−0.49
250 0.808 0.887 0.947 0.991 0.802 0.897 0.956 0.995
2000 0.861 0.932 0.971 0.996 0.873 0.953 0.981 0.998
16000 0.861 0.941 0.981 1.000 0.894 0.963 0.989 1.000
Sn−0.49
250 0.753 0.852 0.918 0.982 0.617 0.771 0.886 0.987
2000 0.818 0.905 0.959 0.994 0.705 0.858 0.946 0.992
16000 0.827 0.914 0.967 0.999 0.796 0.903 0.965 0.996
Mn−0.49
250 0.469 0.618 0.764 0.930 0.499 0.699 0.849 0.975
2000 0.526 0.740 0.863 0.970 0.598 0.790 0.917 0.991
16000 0.582 0.771 0.889 0.985 0.686 0.848 0.941 0.996
0
250 0.425 0.618 0.764 0.930 0.324 0.567 0.785 0.962
2000 0.411 0.686 0.829 0.966 0.370 0.629 0.840 0.983
16000 0.466 0.703 0.845 0.978 0.429 0.697 0.880 0.989
Infeasible
250 0.461 0.730 0.901 0.995 0.420 0.706 0.880 0.994
2000 0.430 0.769 0.925 0.998 0.429 0.711 0.912 0.993
16000 0.474 0.765 0.916 0.997 0.477 0.754 0.914 0.994
TABLE 3. Confidence sets for Specifications C1 and C2.
Q(β)Qn(β)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2β
(a) Specification D1
Q(β)Qn(β)
0.1
0.2
0.3
0.4
0.5
0.6
0.7
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2β
(b) Specification D2
FIGURE 5. Q(β) and one realization of Qn(β).
29
τn n Mean Bn St. Dev. dH
n−0.49
250 [ -1.001, -0.012 ] [ 0.142, 0.098 ] 0.046
2000 [ -0.986, -0.011 ] [ 0.080, 0.072 ] 0.024
16000 [ -0.987, -0.016 ] [ 0.077, 0.088 ] 0.028
128000 [ -0.986, -0.018 ] [ 0.081, 0.093 ] 0.032
1024000 [ -0.989, -0.011 ] [ 0.072, 0.073 ] 0.022
Sn−0.49
250 [ -0.952, -0.043 ] [ 0.161, 0.143 ] 0.097
2000 [ -0.958, -0.041 ] [ 0.138, 0.137 ] 0.083
16000 [ -0.962, -0.042 ] [ 0.131, 0.139 ] 0.079
128000 [ -0.951, -0.051 ] [ 0.148, 0.151 ] 0.100
1024000 [ -0.956, -0.038 ] [ 0.141, 0.133 ] 0.082
Mn−0.49
250 [ -0.844, -0.165 ] [ 0.233, 0.236 ] 0.321
2000 [ -0.853, -0.155 ] [ 0.227, 0.231 ] 0.301
16000 [ -0.860, -0.163 ] [ 0.224, 0.234 ] 0.302
128000 [ -0.853, -0.168 ] [ 0.227, 0.236 ] 0.314
1024000 [ -0.852, -0.146 ] [ 0.228, 0.227 ] 0.293
0
250 [ -0.779, -0.236 ] [ 0.249, 0.251 ] 0.456
2000 [ -0.769, -0.256 ] [ 0.249, 0.250 ] 0.486
16000 [ -0.760, -0.258 ] [ 0.250, 0.250 ] 0.497
128000 [ -0.753, -0.254 ] [ 0.250, 0.250 ] 0.500
1024000 [ -0.747, -0.248 ] [ 0.250, 0.250 ] 0.500
TABLE 4. Estimates for Specification D1, with B0 = [−1,0].
30
τn n Mean Bn St. Dev. dH
n−0.49
250 [ -0.852, 0.104 ] [ 0.228, 0.203 ] 0.387
2000 [ -0.506, 0.000 ] [ 0.059, 0.000 ] 0.007
16000 [ -0.500, 0.000 ] [ 0.000, 0.000 ] 0.000
128000 [ -0.500, 0.000 ] [ 0.000, 0.000 ] 0.000
1024000 [ -0.500, 0.000 ] [ 0.000, 0.000 ] 0.000
Sn−0.49
250 [ -0.734, 0.046 ] [ 0.251, 0.148 ] 0.263
2000 [ -0.501, 0.000 ] [ 0.032, 0.000 ] 0.002
16000 [ -0.500, 0.000 ] [ 0.000, 0.000 ] 0.000
128000 [ -0.500, 0.000 ] [ 0.000, 0.000 ] 0.000
1024000 [ -0.500, 0.000 ] [ 0.000, 0.000 ] 0.000
Mn−0.49
250 [ -0.587, 0.007 ] [ 0.197, 0.128 ] 0.109
2000 [ -0.500, 0.000 ] [ 0.000, 0.000 ] 0.000
16000 [ -0.500, 0.000 ] [ 0.000, 0.000 ] 0.000
128000 [ -0.500, 0.000 ] [ 0.000, 0.000 ] 0.000
1024000 [ -0.500, 0.000 ] [ 0.000, 0.000 ] 0.000
0
250 [ -0.543, -0.021 ] [ 0.160, 0.140 ] 0.059
2000 [ -0.500, 0.000 ] [ 0.000, 0.000 ] 0.000
16000 [ -0.500, 0.000 ] [ 0.000, 0.000 ] 0.000
128000 [ -0.500, 0.000 ] [ 0.000, 0.000 ] 0.000
1024000 [ -0.500, 0.000 ] [ 0.000, 0.000 ] 0.000
TABLE 5. Estimates for Specification D2, B0 = [−0.5,0].
31
Specification D1 Specification D2
τn n 0.500 0.750 0.900 0.990 0.500 0.750 0.900 0.990
n−0.49
250 0.569 0.842 0.938 0.955 0.984 0.997 1.000 1.000
2000 0.523 0.753 0.931 0.958 1.000 1.000 1.000 1.000
16000 0.542 0.754 0.914 0.948 1.000 1.000 1.000 1.000
Sn−0.49
250 0.567 0.791 0.840 0.947 0.983 0.997 0.997 1.000
2000 0.523 0.752 0.835 0.942 1.000 1.000 1.000 1.000
16000 0.542 0.752 0.847 0.946 1.000 1.000 1.000 1.000
Mn−0.49
250 0.363 0.565 0.731 0.947 0.970 0.982 0.994 1.000
2000 0.394 0.472 0.743 0.942 1.000 1.000 1.000 1.000
16000 0.400 0.445 0.728 0.946 1.000 1.000 1.000 1.000
0
250 0.094 0.564 0.731 0.947 0.930 0.982 0.994 1.000
2000 0.029 0.436 0.743 0.942 1.000 1.000 1.000 1.000
16000 0.008 0.410 0.728 0.946 1.000 1.000 1.000 1.000
Infeasible
250 0.566 0.841 0.953 1.000 0.953 0.993 1.000 1.000
2000 0.523 0.753 0.931 0.997 1.000 1.000 1.000 1.000
16000 0.542 0.754 0.914 0.989 1.000 1.000 1.000 1.000
TABLE 6. Confidence sets for Specifications D1 and D2.
is B0 = [−1,0]. Specification D2 is similar, but with β0 =−0.3. In this case, the identified set is
B0 = [−0.5,0]. The objective functions are plotted in Figure 5 and the estimates are reported
in Tables 4 and 5. The estimator achieves arbitrarily fast convergence both cases, when the
slackness sequence is selected appropriately.
We proceed as before in constructing confidence sets, with the exception that we use the
scaling sequence bn = n1/2, so that the statistic Rn satisfies Assumption A6 (as established by
Theorem 9). The coverage frequencies are reported in Table 6. Because the population response
probabilities are exactly equal to one half for some values of x in Specification D1, the inference
procedure is non-degenerate. However, in Specification D2, while the inferential procedure is
valid, it is valid in a conservative sense.
6. Conclusion
This paper has developed asymptotic results for a class of criterion-function-based set estimators
which may exhibit non-standard behavior, such as cube-root consistency or arbitrarily fast
consistency. We have demonstrated these results by applying them to develop a robust set
estimator for semiparametric transformation models, and binary choice models in particular,
32
without the need to impose support conditions on the regressors. A series of Monte Carlo results
illustrates the theoretical results and provides insights into the practical finite sample behavior
of the estimator.
The general results established in this paper provide a basis for deriving the properties of
set estimators for other models and suggest several areas for future work in this literature. For
example, although we do not consider it explicitly here, the results could be generalized to
the closely-related multinomial choice model of Manski (1975) and other models based on
similar rank conditions. Finally, our Monte Carlo results suggest that further study of data-
driven procedures for selecting the constant of proportionality in the slackness sequence is an
interesting topic for future work.
A. Proofs
This section contains proofs for results not proven in the text. First we establish a preliminary
result regarding absolute values of functions, then we provide the remaining proofs in order.
Lemma 9. Let f and g be bounded real functions on A ⊂Rn . Then∣∣∣∣supx∈A
f (x)− supx∈A
g (x)
∣∣∣∣≤ supx∈A
∣∣ f (x)− g (x)∣∣ .
Proof of Lemma 9.
supx∈A
∣∣ f (x)− g (x)∣∣= sup
x∈A
[( f (x)− g (x))∨ (g (x)− f (x))
]= sup
x∈A[ f (x)− g (x)]∨ sup
x∈A[g (x)− f (x)]
≥ supx∈A
[f (x)− sup
y∈Ag (y)
]∨ sup
x∈A
[g (x)− sup
y∈Af (y)
]
=[
supx∈A
f (x)− supx∈A
g (x)
]∨
[supx∈A
g (x)− supx∈A
f (x)
]=
∣∣∣∣supx∈A
f (x)− supx∈A
g (x)
∣∣∣∣ .
■
Proof of Theorem 1. The proof proceeds in two steps. We first show that supθ∈Θnρ(θ,Θ1)
p→ 0.
Then, we show that limn→∞ P (Θ1 ⊂ Θn) = 1, which implies that supθ∈Θ1ρ(θ,Θn)
p→ 0. Combining
these steps and using the definition of the Hausdorff distance yields the final result.
33
Step 1 Let η> 0 and ε> 0 be given. Uniform convergence in probability of Qn to Q over the entire
parameter spaceΘ, from Assumption A3, also implies uniform convergence in probability over
the subset Θ \Θη1 , so we have both supΘ |Qn −Q| = Op (1/an) and supΘ\Θη1|Qn −Q| = Op (1/an).
Under Assumption A2, there exists a δη > 0 such that supΘ\Θη1
Q ≤ supΘQ −δη. Combining the
above results, we have
supΘ\Θ
η1
Qn ≤ supΘ\Θ
η1
Q +Op (1/an) ≤ supΘ
Q −δη+Op (1/an) ≤ supΘ
Qn −δη+Op (1/an).
Recall that by definition of Θn ,
infΘn
Qn ≥ supΘ
Qn −τn .
Since δη > 0 is constant and τn = op (1), there exists an integer nε such that for all n ≥ nε
with probability at least 1− ε both the Op (1/an) term and τn are smaller than δη/2 and so
−δη+Op (1/an) < −δη/2 < −τn . Therefore, for n ≥ nε, we have infΘnQn > supΘ\Θ
η1
Qn which
implies Θn ⊆ Θη1 which in turn implies supθ∈Θnρ(θ,Θ1) ≤ η, all with probability at least 1− ε.
Since ε and η were arbitrary, supθ∈Θnρ(θ,Θ1)
p→ 0.
Step 2 By definition of Θn , if supΘQn − infΘ1 Qn < τn , thenΘ1 ⊆ Θn . We have
supΘ
Qn − infΘ1
Qn =[
supΘ
Qn − supΘ
Q
]+
[supΘ
Q − infΘ1
Qn
]≤
∣∣∣∣supΘ
Qn − supΘ
Q
∣∣∣∣+ ∣∣∣∣supΘ
Q − infΘ1
Qn
∣∣∣∣=
∣∣∣∣supΘ
Qn − supΘ
Q
∣∣∣∣+ ∣∣∣∣infΘ1
Q − infΘ1
Qn
∣∣∣∣≤ sup
Θ|Qn −Q|+ sup
Θ1
|Qn −Q|
≤ 2supΘ
|Qn −Q| .
These steps follow by (1) adding and subtracting supΘQ, (2) taking the absolute value, (3) noting
thatΘ1 maximizes Q, (4) using the fact that inf f =−sup(− f ) and applying Lemma 9 twice, and
(5) recalling that Θ1 ⊆Θ. By Assumption A3, 2supΘ |Qn −Q| =Op (1/an). Finally, the condition
that τn approaches zero in probability slower than 1/an implies supΘQn − infΘ1 Qn < τn with
probability approaching one. ■
Proof of Theorem 2. Let ε> 0 be given and let δ, c, γ1, γ2, cε, and nε satisfy Assumption A4. Let
an satisfy Assumption A3 and define
νn ≡(
c1cε∨2τn an
anc1
)1/γ1
.
34
There exists an n′ε ≥ nε such that for all n ≥ n′
ε, with probability at least 1−ε each of the following
are true: (a) νn ≥ (cε/an)γ2 , (b) νn ≤ δ, and (c) supΘ |Qn −Q| ≤ τn . Condition (a) holds since
γ1γ2 ≥ 1 and so for sufficiently large n with probability at least 1−ε
ν1/γ2n ≥
(cεan
) 1γ1γ2 ≥ cε
an.
Condition (b) follows because νn = op (1), due to the assumptions on τn and an , and the fact
that δ is a strictly positive constant. Condition (c) follows from Assumption A3, under which
supΘ |Qn −Q| =Op (1/an), and the condition anτnp→∞. Therefore, for all n ≥ n′
ε, with probabil-
ity at least 1−ε,
supΘ\Θνn
1
Qn <(1)
supΘ
Q − c1(νn ∧δ)γ1 ≤(2)
supΘ
Q − c1νγ1n ≤
(3)supΘ
Q −2τn ≤(4)
supΘ
Qn −τn ≤(5)
infΘn
Qn .
Inequality (1) holds by Assumptions A2 and A4 and condition (a) under which ρ(θ,Θ1) > νn ≥(cε/an)γ2 for θ ∈ Θ \Θνn
1 . Inequality (2) is a direct result of condition (b). Inequality (3) holds
by definition of νn , since c1νγ1n ≥ 2τn . Inequality (4) follows from condition (c). Inequality (5)
follows by definition of Θn . It follows that for n ≥ n′ε, with probability at least 1− ε, the set
Θn ∩ (Θ \Θνn1 ) is empty, or equivalently, Θn ⊆ Θνn
1 . Finally, recall that by Theorem 1 we have
limn→∞ P (Θ1 ⊆ Θn) = 1. Therefore, for all n ≥ n′ε, with probability at least 1−ε, dH (Θn ,Θ1) ≤ νn
and hence dH (Θn ,Θ1) =Op (τγ2n ). ■
Proof of Theorem 3. First, note that that Assumption A4’ implies Assumption A2. Then, from The-
orem 1 we have limn→∞ P (Θ1 ⊆ Θn) = 1. We will prove the result by showing that limn→∞ P (Θn ⊆Θ1) = 1 and therefore the Hausdorff distance dH (Θn ,Θ1) equals zero with probability approach-
ing one. The logic is very similar to that used in Step 1 of the proof of Theorem 1, but without
expanding the setΘ1. We have
supΘ\Θ1
Qn ≤ supΘ\Θ1
Q +Op (1/an) ≤ supΘ
Q −δ+Op (1/an) ≤ supΘ
Qn −δ+Op (1/an),
where the first and last inequalities follow from Assumption A3 and the middle inequality
follows from Assumption A4’, the constant majorant condition. Since τn = op (1) and δ > 0 is
constant, with probability approaching one we have τn < δ/2 and Op (1/an) < δ/2, leading to
supΘ\Θ1Qn < supΘQn −τn ≤ infΘn
Qn , and therefore, Θn ⊆Θ1. ■
Proof of Theorem 4. The proof of Theorem 4 proceeds in three steps and follows the intuition
of Theorem 3.3 of Chernozhukov, Hong, and Tamer (2007). In our case, however, the limiting
distribution of Rn may not be continuous, in which case the confidence sets may be conservative
because the 1−α quantile of the distribution may be equal to higher quantiles as well. First,
we derive upper and lower bounds for Rn,m, j , defined in (5), such that Rn,m, j ≤ Rn,m, j ≤ Rn,m, j
35
with probability approaching one. Second, we prove that the distribution function of Rn,m, j
converges in probability to the distribution function of R (the limiting distribution of Rn), by
showing that the distribution functions of both of these bounds converge to that of R. Third, we
show that cn converges in probability to c(1−α), the desired quantile of the distribution of R.
Step 1 First, we restate the definition of Rn,m, j from (5):
Rn,m, j ≡ bm
(supθ∈Θ
Qn,m, j (θ)− infθ∈Θn
Qn,m, j (θ)
).
By Theorem 2, there exists a sequence εn ∝ τγ2n such that dH (Θn ,Θ1) ≤ εn and hence Θn ⊆Θεn
1
with probability approaching one. To establish the upper asymptotic bound, define Rn,m, j as
Rn,m, j but replace Θn byΘεn1 :
Rn,m, j ≡ bm
(supθ∈Θ
Qn,m, j (θ)− infθ∈Θεn
1
Qn,m, j (θ)
).
Then, for j = 1, . . . , Mn with probability approaching one we have infΘnQn,m, j ≥ infΘεn
1Qn,m, j ,
and therefore Rn,m, j ≤ Rn,m, j . To establish the lower bound, let Kn be the collection of all subsets
K ⊆Θ such that dH (K ,Θ1) ≤ εn and define Rn,m, j as Rn,m, j but replace Θn by the infimum over
all sets K ∈Kn :
Rn,m, j ≡ infK∈Kn
[bm
(supθ∈Θ
Qn,m, j (θ)− infθ∈K
Qn,m, j (θ)
)].
With probability approaching one, for each j there exists a set Θn,m, j ∈Kn such that Rn,m, j is
equal to the term in square brackets for K =Θn,m, j (in other words, Θn ∈ Kn). Therefore, for
j = 1, . . . , Mn with probability approaching one, Rn,m, j ≤ Rn,m, j .
Step 2 Let x be a continuity point of the distribution of R. From Step 1, with probability
approaching one,
Gn,m(x) ≡ 1
Mn
Mn∑j=1
1{Rn,m, j ≤ x} ≤ Gn,m(x) ≡ 1
Mn
Mn∑j=1
1{Rn,m, j ≤ x}
≤ Gn,m(x) ≡ 1
Mn
Mn∑j=1
1{Rn,m, j ≤ x}.
We will show that Gn,m(x)p→ P {R ≤ x}, Gn,m(x)
p→ P {R ≤ x}, and therefore Gn,m(x)p→ P {R ≤ x} as
n →∞ (and thus, as m →∞). Let J m denote the cdf of Rn,m, j . Note that Gn,m(x) is a U-statistic
of degree m. It is bounded since 0 ≤Gn,m(x) ≤ 1. Furthermore, E[Gn,m(x)] = E[1{Rn,m, j ≤ x}] =J m(x), where the last equality holds by nonreplacement sampling, since under Assumption A5
36
each subsample of size m is itself an iid sample. By the Hoeffding inequality for bounded
U-statistics for iid data (Serfling, 1980, Theorem A, p. 201), for any t > 0,
P{
Gn,m(x)− J m(x) ≥ t}≤ exp
[−2t 2 n
m
].
A similar inequality follows for t < 0 by considering −Gn,m(x) and −J m(x). Hence, we have a
bound on the probability of the absolute difference∣∣∣Gn,m(x)− J m(x)
∣∣∣ exceeding t which gives
Gn,m(x) = J m(x)+ op (1) for fixed m. Now, note that Θεn1 is a sequence of sets satisfying the
condition of Assumption A7, since εn ∝ τγ2n and τ
γ2n /aγ2
np→ 0 implies dH (Θεn
1 ,Θ1) = op (a−γ2n ).
Applying Assumption A7 with Θn = Θεn1 and Rn = Rn,m, j yields J m(x) = P {Rn,m, j ≤ x} = P {R ≤
x}+op (1). Combining these results gives Gn,m(x)p→ P {R ≤ x}. A similar argument shows that
Gn,m(x)p→ P {R ≤ x}.
Step 3 Convergence of the distribution function at continuity points implies convergence of the
quantile function at continuity points (Shorack, 2000, Proposition 3.1). Therefore, cn = inf{x :
G(x) ≥ 1−α}p→ c(1−α). ■
Proof of Theorem 5. We prove the result by verifying the conditions of Theorem 2. By Lemma 4,
Assumptions B1–B3 imply Assumptions A1–A3 with an = n1/2, so it suffices to verify Assump-
tion A4. To clarify the proof, we use the following notational conventions: η’s denote distances in
Θ, δ’s denote distances between functional values, ε’s denote arbitrarily small probabilities, and
κ’s denote various constants.
By definition of Gn(θ), we can always write
(8) Qn(θ) = (Pn −P ) f (·,θ)+P f (·,θ) = n−1/2Gn(θ)+Q(θ).
Let η1 and η2 satisfy Assumptions B4 and B5 and choose η to be smaller than the minimum of η1
and η2. Then from Assumption A2 there exists a δη > 0 such that
(9) supΘ\Θ
η1
Q ≤ supΘ
Q −2δη.
Combining (8) and (9) and using Assumption A3 gives, for all θ ∈Θ\Θη1 ,
Qn(θ) ≤ n−1/2Gn(θ)+ supΘ
Q −2δη.
F is P-Donsker by Assumption B3 and since supΘ |·| is continuous in `∞(Θ), supΘ |Gn(θ)| =Op (1) by the continuous mapping theorem. It follows that for any ε1 ∈ (0,1) there exists an n1
such that for all n ≥ n1,
(10) Qn(θ) ≤ supΘ
Q −δη
37
uniformly onΘ\Θη1 with probability at least 1−ε1.
Now, by Assumption B4, there is a neighborhood Θη11 of Θ1 such that Q is approximately
quadratic in the directed distance ρ(θ,Θ1). That is, for some κ1 > 0,
Q(θ) ≤ supΘ
Q −κ1ρ2(θ,Θ1)
for all θ ∈Θη11 . Similarly, by Assumption B5 and Lemma 4.1 of Kim and Pollard (1990), for all
κ2 > 0 there exists a sequence of random variables Mn =Op (1) such that
(11)∣∣(Pn −P ) f (·,θ)
∣∣≤ κ2ρ2(θ,Θ1)+n−2/3M 2
n
for θ ∈Θη21 . Combining these results for κ2 = κ1/2 and using (8) and Assumption A3 yields
Qn(θ) ≤ supΘ
Q − κ1
2ρ2(θ,Θ1)+n−2/3M 2
n
for all θ ∈Θη1∧η21 .
Notice that when n−2/3M 2n is smaller than (κ1/4)ρ2(θ,Θ1), we have
(12) Qn(θ) ≤ supΘ
Q − κ1
4ρ2(θ,Θ1),
which is a polynomial majorant of the form required by Assumption A4. This is true whenever
ρ(θ,Θ1) ≥ 4κ−1/21 n−1/3Mn . Since Mn = Op (1), for any ε3 ∈ (0,1), there exists a κ3 and n3 such
that for all n ≥ n3, ρ(θ,Θ1) ≥ κ3n−1/3 ≥ 4κ−1/21 n−1/3Mn and the bound in (12) holds uniformly on
Θη1 \Θκ3n−1/3
1 with probability at least 1−ε3. (Note that we can always choose n3 large enough
so that κ3n−1/3 is smaller than η < η1 ∧η2, ensuring that the relevant region of the domain is
nonempty.)
To show that Assumption A4 holds, let ε ∈ (0,1) be given. For ε1 = ε/2, choose n1 and δη as
above so that (10) holds uniformly onΘ\Θη1 with probability at least 1−ε1. Then, for ε3 = ε/2,
choose n3 and κ3 such that (12) holds uniformly onΘη1 \Θκ3n−1/3
1 with probability at least 1−ε3.
To summarize, we have shown that
Qn(θ) ≤ supΘ
Q −max{κ1
4ρ2(θ,Θ1),δη
}uniformly onΘ\Θκ3n−1/3
1 with probability at least 1−ε. It follows that Assumption A4 holds with
an = n1/2, δ = δη, c = κ1/4, γ1 = 2, γ2 = 2/3, cε = κ3, and nε = max{n1,n3}. Therefore, for any
sequence rn such that rn = o(n1/3), let τn ∝ r−3/2n . Since n1/3r−1
n →∞, we have (n1/3r−1n )3/2 =
n1/2r−3/2n ∝ n1/2τn
p→∞ and therefore Theorem 2 implies dH (Θn ,Θ1) =Op (τ2/3n ) =Op (r−1
n ). ■
38
Proof of Theorem 6. We verify the sufficient conditions for consistency of Lemma 4, with an =n1/2. Assumptions B1 and B2 were verified above, therefore it only remains to show that F is
manageable for an envelope F for which PF 2 <∞.
Let α,γ ∈R and δ ∈Rk . For each (x, y, t ) ∈X ×{0,1}×R, define g (x, y, t ;α,γ,δ) =αt +γy +δ′xand G = {g (·, ·, ·;α,γ,δ) : α,γ ∈ R and δ ∈ Rk }. Since G is a finite-dimensional vector space of
real-valued functions on X × {0,1}×R, classes of sets of the form {g ≥ r } or {g > r } with g ∈ G
and r ∈ R are VC classes (Pakes and Pollard, 1989, Lemma 2.4). We will use particular choices
of α, γ, and δ to show that F is VC subgraph class, that is, that the collection of subgraphs of
functions in F is a VC class.
First, note that we can rewrite f as
f (x, y,θ) = (1{y > 0}−1{y ≤ 0}
)(1{x ′θ ≥ 0}−1{x ′θ < 0}
)= 1{y > 0, x ′θ ≥ 0}−1{y > 0, x ′θ < 0}−1{y ≤ 0, x ′θ ≥ 0}+1{y ≤ 0, x ′θ < 0}.
For any θ ∈Θ,
subgraph(
f (·, ·,θ))= {
(x, y, t ) ∈X × {0,1}×R : 0 < t < f (x, y,θ) or 0 > t > f (x, y,θ)}
= ({y > 0}∩ {x ′θ ≥ 0}∩ {t ≥ 1}c ∩ {t > 0}
)∪ (
{y > 0}∩ {x ′θ ≥ 0}c ∩ {t >−1}∩ {t ≥ 0}c)∪ (
{y ≥ 0}c ∩ {x ′θ ≥ 0}∩ {t >−1}∩ {t ≥ 0}c)∪ (
{y ≥ 0}c ∩ {x ′θ < 0}∩ {t ≥ 1}c ∩ {t > 0})
.
Now, we construct three functions of the form g j (x, y, t ) =α j t+γ j z+δ′j x with g j ∈G for j = 1,2,3
by choosing coefficients α1 = 0, γ1 = 1, δ1 = 0, α2 = 0, γ2 = 0, δ2 = θ, α3 = 1, γ3 = 0, and δ3 = 0.
Then
subgraph(
f (·, ·,θ))= (
{g1 > 0}∩ {g2 ≥ 0}∩ {g3 ≥ 1}c ∩ {g3 > 0})
∪ ({g1 > 0}∩ {g2 ≥ 0}c ∩ {g3 >−1}∩ {g3 ≥ 0}c)
∪ ({g1 ≥ 0}c ∩ {g2 ≥ 0}∩ {g3 >−1}∩ {g3 ≥ 0}c)
∪ ({g1 ≥ 0}c ∩ {g2 < 0}∩ {g3 ≥ 1}c ∩ {g3 > 0}
).
Since sets of the form {g ≥ 0} or {g > 0} for g ∈ G are VC classes and since this property is
preserved over complements, unions, and intersections (Pakes and Pollard, 1989, Lemma 2.5), it
follows that {subgraph( f ) : f ∈F } is itself a VC class. By Lemma 2.12 of Pakes and Pollard (1989),
F is Euclidean for every envelope, including constant envelope F = 1. Since it is Euclidean, F is
also manageable in the sense of Pollard (1989) (cf. Pakes and Pollard, 1989, p. 1033) as required
by Assumption B3. ■
39
Proof of Theorem 7. Following Abrevaya and Huang (2005), we work with the re-normalized
objective function based on f (x, y,β) = (2y −1)(1{x ′β+xk ≥ 0}−1{x ′β+xk ≥ 0}) for some β ∈ B0,
which is equivalent to adding a constant to the objective function. We prove the result by verifying
Assumptions B1–B5 and using Theorem 5. Assumptions B1–B3 have been verified already by
Theorem 6.
Under assumptions a–f, it follows from Abrevaya and Huang (2005, p. 1200) that ∇ββ′ Q(β) =−V (β) for all β ∈ bd(B0) (where V (β) is defined in assumption f). Therefore, in a neighborhood
N of B0, Q is approximately quadratic and for some c > 0, Q(β) ≤ supQ−cρ2(β,B0). This verifies
Assumption B4.
To show that Assumption B5 holds, let η> 0 and define Fη ≡ { f (·,β) ∈F : ρ(β,B0) ≤ η}. We
will show that the natural envelope Fη of Fη is such that PF 2η =O(η). First, note that by definition
of Fη if f (·,β) ∈Fη, then β ∈ Bη0 . Then,
Fη(x, y) = supβ∈B
η0
∣∣ f (x, y,β)∣∣
= supβ∈B
η0
∣∣1{x ′β≥−xk > x ′β}+1{x ′β≥−xk > x ′β}∣∣
≤ supβ∈B
η0
1{−‖x‖ ·∥∥β− β∥∥≤ x ′β+xk ≤ ‖x‖ ·∥∥β− β∥∥}
.
Now, for all β ∈ Bη0 and β ∈ B0, we have∥∥β− β∥∥≤ ρ(β,bd(B0))+dH (bd(B0), int(B0)) ≤ η+δ
where δ≡ dH (bd(B0), int(B0)) is the largest distance from some point on the interior of B0 to the
boundary of B0. Therefore Fη(x, y) ≤ 1{−(η+δ)‖x‖ ≤ x ′β+xk ≤ (η+δ)‖x‖} and
PF 2η ≤
∫X
∫{
xk∈Xk :−(η+δ)‖x‖≤x ′β+xk≤(η+δ)‖x‖} dFxk |x(xk | x)dFx(x)
≤ 2(η+δ)∫X
supxk∈Xk
fxk |x(xk | x) · ‖x‖ dFx(x).
Since x has finite first absolute moments (assumption a), fxk |x(xk | x) is finite for almost every
x (assumption b), and δ < ∞ since B0 is compact (assumption e), this bound is finite and
PF 2η =O(η). ■
Proof of Theorem 9. Assumption A7 holds in both cases (a) and (b) due to uniform convergence
in probability, which implies stochastic equicontinuity. It remains to verify Assumption A6.
Case (a) Suppose that Assumption C2 holds and bn = n1/2. Since supΘQ is a constant and
Q(θ) = supΘQ for all θ ∈Θ1, we can restate Rn as
Rn = n1/2[
supΘ
Qn − infΘ1
Qn
]= n1/2
[(supΘ
Qn − supΘ
Q
)− infΘ1
(Qn −Q)
].
40
Then,
|Rn | ≤ supΘ
∣∣n1/2(Qn −Q)∣∣+ inf
Θ1
∣∣n1/2(Qn −Q)∣∣= sup
Θ|Gn |+ inf
Θ1|Gn | =Op (1),
where the final equality follows because F is P-Donsker.
Case (b) Suppose that Assumption C1 holds and bn = n2/3. For the following, we rely on two
intermediate results from the proof of Theorem 5 above. First, let η satisfy Assumption B5 and
note that from (11), for all c > 0 there exists a sequence of random variables Mn = Op (1) such
that
Qn(θ) ≤Q(θ)+ cρ2(θ,Θ1)+n−2/3M 2n
for all θ ∈ Θη1 . Second, from (12) there exists a κ > 0 such that Qn is maximized on Θκn−1/3
1
with probability approaching one. Then, with probability approaching one we have supΘQn =sup
Θκn−1/31
Qn and Θκn−1/3
1 ⊂ Θη1 . Also note that supΘ1Q = infΘ1 Q = supΘQ by definition of Θ1.
Combining these results, with probability approaching one we have
0 ≤ Rn = n2/3
supΘκn−1/3
1
Qn − infΘ1
Qn
≤ c ·n2/3
supΘκn−1/3
1
ρ2(θ,Θ1)− infΘ1ρ2(θ,Θ1)
= cκ2 <∞
since supΘκn−1/3
1ρ2(θ,Θ1) = κ2n−2/3 and infΘ1 ρ
2(θ,Θ1) = 0. Therefore, Rn =Op (1) as asserted. ■
References
Abrevaya, J. (2000). Rank estimation of a generalized fixed-effects regression model. Journal of
Econometrics 95, 1–23.
Abrevaya, J. and J. Huang (2005). On the bootstrap of the maximum score estimator. Economet-
rica 73, 1175–1204.
Amemiya, T. (1985). Advanced Econometrics. Harvard University Press.
Andrews, D. and P. Guggenberger (2008). Asymptotics for stationary very nearly unit root
processes. Journal of Time Series Analysis 29, 203–212.
Andrews, D. W. K., S. Berry, and P. Jia (2004). Confidence regions for parameters in discrete games
with multiple equilibria, with an application to discount chain store location. Working paper,
Yale University.
41
Andrews, D. W. K. and P. Guggenberger (2009). Validity of subsampling and “plug-in asymptotic”
inference for parameters defined by moment inequalities. Econometric Theory 25, 669–709.
Andrews, D. W. K. and X. Shi (2013). Inference based on conditional moment inequalities.
Econometrica 81, 609–666.
Andrews, D. W. K. and G. Soares (2010). Inference for parameters defined by moment inequalities
using generalized moment selection. Econometrica 78, 119–158.
Bajari, P., J. T. Fox, and S. P. Ryan (2008). Evaluating wireless carrier consolidation using semi-
parametric demand estimation. Quantitative Marketing and Economics 6, 299–338.
Beresteanu, A., I. Molchanov, and F. Molinari (2011). Sharp identification regions in models with
convex moment predictions. Econometrica 79, 1785–1821.
Beresteanu, A. and F. Molinari (2008). Asymptotic properties for a class of partially identified
models. Econometrica 76, 763–814.
Bhattacharya, D. (2009). Inferring optimal peer assignment from experimental data. Journal of
the American Statistical Association 104, 486–500.
Bierens, H. J. and J. Hartog (1988). Non-linear regression with discrete explanatory variables,
with an application to the earnings function. Journal of Econometrics 38, 269–299.
Bugni, F. A. (2010). Bootstrap inference in partially identified models defined by moment
inequalities: Coverage of the identified set. Econometrica 78, 735–753.
Canay, I. A. (2010). EL inference for partially identified models: Large deviations optimality and
bootstrap validity. Journal of Econometrics 156, 408–425.
Chernozhukov, V., H. Hong, and E. Tamer (2007). Estimation and confidence regions for parame-
ter sets in econometric models. Econometrica 75, 1243–1284.
Cosslett, S. R. (1983). Distribution-free maximum likelihood estimation of the binary choice
model. Econometrica 51, 765–782.
Dudley, R. M. (1987). Universal Donsker classes and metric entropy. Annals of Probability 15,
1306–1326.
Han, A. K. (1987). Non-parametric analysis of a generalized regression model: The maximum
rank correlation estimator. Journal of Econometrics 35, 303–316.
42
Honoré, B. E. and A. Lleras-Muney (2006). Bounds in competing risks models and the war on
cancer. Econometrica 74, 1675–1698.
Honoré, B. E. and E. Tamer (2006). Bounds on parameters in panel dynamic discrete choice
models. Econometrica 74, 611–629.
Horowitz, J. L. (1992). A smoothed maximum score estimator for the binary response model.
Econometrica 60, 505–531.
Horowitz, J. L. (1998). Semiparametric Methods in Econometrics. New York: Springer.
Ichimura, H. (1993). Semiparametric least squares (SLS) and weighted SLS estimation of single-
index models. Journal of Econometrics 58, 71–120.
Imbens, G. W. and C. F. Manski (2004). Confidence intervals for partially identified parameters.
Econometrica 72, 1845–1857.
Jun, S., J. Pinkse, and Y. Wan (2011). Classical Laplace estimation for 3p
n-consistent estimators:
Improved convergence rates and rate-adaptive inference. Working paper, Pennsylvania State
University.
Khan, S. and E. Tamer (2009). Inference on endogenously censored regression models using
conditional moment inequalities. Journal of Econometrics 152, 104–119.
Kim, J. and D. Pollard (1990). Cube root asymptotics. The Annals of Statistics 18, 191–219.
Kim, K. (2008). Set estimation and inference with models characterized by conditional moment
inequalities. Working paper, University of Minnesota.
Klein, R. W. and R. H. Spady (1993). An efficient semiparametric estimator for binary response
models. Econometrica 61, 387–421.
Komarova, T. (2008). Binary choice models with discrete regressors: Identification and misspeci-
fication. Working paper, London School of Economics.
Lewbel, A. (2000). Semiparametric qualitative response model estimation with unknown het-
eroskedasticity or instrumental variables. Journal of Econometrics 97, 145–177.
Maddala, G. S. (1983). Limited-Dependent and Qualitative Variables in Econometrics. Cambridge
University Press.
Magnac, T. and E. Maurin (2008). Partial identification in monotone binary models: Discrete
regressors and interval data. Review of Economic Studies 75, 835–864.
43
Manski, C. F. (1975). Maximum score estimation of the stochastic utility model of choice. Journal
of Econometrics 3, 205–228.
Manski, C. F. (1985). Semiparametric analysis of discrete response: Asymptotic properties of the
maximum score estimator. Journal of Econometrics 27, 313–333.
Manski, C. F. (1987). Semiparametric analysis of random effects linear models from binary panel
data. Econometrica 55, 357–362.
Manski, C. F. (1988). Identification of binary response models. Journal of the American Statistical
Association 83, 729–738.
Manski, C. F. (2003). Partial Identification of Probability Distributions. Springer-Verlag.
Manski, C. F. and E. Tamer (2002). Inference on regressions with interval data on a regressor or
outcome. Econometrica 70, 519–546.
Matzkin, R. L. (1992). Nonparametric and distribution-free estimation of the binary threshold
crossing and the binary choice models. Econometrica 60, 239–270.
McFadden, D. L. (1974). Conditional logit analysis of qualitative choice analysis. In P. Zarembka
(Ed.), Frontiers in Econometrics, pp. 105–142. Academic Press.
Menzel, K. (2008). Estimation and inference with many moment inequalities. Working paper,
Massachussetts Institute of Technology.
Newey, W. K. and D. McFadden (1994). Large sample estimation and hypothesis testing. In
R. F. Engle and D. McFadden (Eds.), Handbook of Econometrics, Volume 4, Amsterdam. New
Holland.
Nolan, D. and D. Pollard (1987). U -Processes: Rates of convergence. Annals of Statistics 15,
780–799.
Pakes, A. and D. Pollard (1989). Simulation and the asymptotics of optimization estimators.
Econometrica 57, 1027–1057.
Pakes, A., J. Porter, K. Ho, and J. Ishii (2006). Moment inequalities and their application. Working
paper, Harvard University.
Politis, D. N., J. P. Romano, and M. Wolf (1999). Subsampling. Springer.
Pollard, D. (1989). Asymptotics via empirical processes. Statistical Science 4, 341–366.
44
Romano, J. P. and A. M. Shaikh (2008). Inference for identifiable parameters in partially identified
econometric models. Journal of Statistical Planning and Inference 138, 2786–2807.
Romano, J. P. and A. M. Shaikh (2010). Inference for the identified set in partially identified
econometric models. Econometrica 78, 169–211.
Serfling, R. J. (1980). Approximation Theorems of Mathematical Statistics. Wiley.
Shorack, G. R. (2000). Probability for Statisticians. Springer.
Stoker, T. M. (1986). Consistent estimation of scaled coefficients. Econometrica 54, 1461–1481.
Stoye, J. (2009). More on confidence intervals for partially identified parameters. Econometrica 77,
1299–1315.
Tamer, E. (2010). Partial identification in econometrics. Annual Review of Economics 2, 167–195.
van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge University Press.
Yildiz, N. (2012). Consistency of plug-in estimators of upper contour and level sets. Econometric
Theory 28, 309–327.
45