asimplenonparametricapproachtoestimatingthedistribution...

A Simple Nonparametric Approach to Estimating the Distributionof Random Coefficients in Structural Models

Jeremy T. FoxRice University & NBER

Kyoo il Kim∗

Michigan State UniversityChenyu Yang

University of Rochester

August 2016Forthcoming: Journal of Econometrics

Abstract

We explore least squares and likelihood nonparametric mixtures estimators of the joint distributionof random coefficients in structural models. The estimators fix a grid of heterogenous parametersand estimate only the weights on the grid points, an approach that is computationally attractivecompared to alternative nonparametric estimators. We provide conditions under which the esti-mated distribution function converges to the true distribution in the weak topology on the spaceof distributions. We verify most of the consistency conditions for three discrete choice models.We also derive the convergence rates of the least squares nonparametric mixtures estimator underadditional restrictions. We perform a Monte Carlo study on a dynamic programming model.

Keywords: Random coefficients, mixtures, discrete choices, dynamic programming, sieve estimationJEL codes: C21, C22, C25

∗Corresponding authors: Jeremy Fox at [email protected] and Kyoo il Kim at [email protected]. Our addressesare Department of Economics MS-22 Rice University P. O. Box 1892 Houston, TX 77251 and MSU Department ofEconomics, Marshall-Adams Hall, 486 W Circle Dr Rm 110, East Lansing MI 48824.

1

1 Introduction

Economic researchers often work with models where the parameters are heterogeneous across thepopulation. A classic example is that consumers may have heterogeneous preferences over a set ofproduct characteristics in an industry with differentiated products. These heterogeneous parametersare often known as random coefficients. When working with cross sectional data, the goal is often toestimate the distribution of heterogeneous parameters. Our paper establishes the consistency and ratesof convergence of “fixed grid” nonparametric estimators for a distribution of heterogeneous parametersdue to Bajari, Fox and Ryan (2007), Train (2008, Section 6), Fox, Kim, Ryan and Bajari (2011), andKoenker and Mizera (2014). These estimators are computationally simpler than some alternatives. Weuse FKRB to refer to Fox, Kim, Ryan and Bajari (2011).

We estimate the distribution of heterogeneous parameters F (β) in the model

Pj (x) =

∫gj (x, β) dF (β) , (1)

where j is the index of the jth out of J finite values of the outcome y, x is a vector of observedexplanatory variables, β is the vector of heterogeneous parameters, and gj (x, β) is the probability thatthe jth outcome occurs for an observation with heterogeneous parameters β and explanatory variablesx. Given this structure, Pj (x) is the cross sectional probability of observing the jth outcome whenthe explanatory variables are x. The researcher picks gj (x, β) as the underlying model, has an i.i.d.sample of N observations (yi, xi), and wishes to estimate F (β). As F is only restricted to be a validCDF, the mixture model (1) is nonparametric.

The unknown distribution F (β) enters (1) linearly. The estimators we analyze exploit linearityand achieve a computationally simpler estimator than some alternatives. All the fixed grid estimatorsdivide the support of the vector β into a finite and known grid of vectors β1, . . . , βR. Computationally,the unknown parameters are the weights θ1, . . . , θR on the R grid points. These can be estimated usinga least squares or likelihood criterion with the constraints that each θr ≥ 0 and that

∑Rr=1 θ

r = 1. Theestimator of the distribution F (β) with N observations and R grid points becomes

FN (β) =∑R

r=1θr1 [βr ≤ β] ,

where θr’s denote estimated weights and 1 [βr ≤ β] is equal to 1 when βr ≤ β. Computationally,the least squares and likelihood constrained optimization problems are globally convex and concave,respectively. Particular numerical algorithms are guaranteed to converge to a global optimum.

FKRB discuss the advantages of this estimator for complex structural models, like dynamic pro-gramming models with heterogeneous parameters. In this respect, fixed grid estimators share somecomputational advantages with the parametric approach in Ackerberg (2009). Our Monte Carlo studyin an online appendix is to a discrete choice, dynamic programming model.

FKRB and other previous analyses assume that the R grid points used in a finite sample are indeedthe true grid points that contain the finite support of the true F0 (β). Thus, the true distribution F0 (β)

is assumed to be known up to a finite number of weights θ1, . . . , θR. As economists often lack convincingeconomic rationales to pick one set of grid points over another, assuming that the researcher knows

2

the true distribution up to finite weights is unrealistic.Instead of assuming that the distribution is known up to weights θ1, . . . , θR, this paper requires

the true distribution F0 (β) to satisfy much weaker restrictions. In particular, the true F0 (β) can haveany of continuous, discrete and mixed continuous and discrete supports. The prior approaches areparametric as the true weights θ1, . . . , θR lie in a finite-dimensional subset of a real space. Here, theapproach is nonparametric as the true F0 (β) is known to lie only in the infinite-dimensional space ofmultivariate distributions on the space of heterogeneous parameters β.

In a finite sample of N observations, our estimators are still implemented by choosing a fixed grid ofpoints θ1, . . . , θR , ideally to trade off bias and variance in the estimate FN (β). We, however, recognizethat as the sample increases, R and thus the fineness of the grid of points should also increase in orderto reduce the bias in the approximation of F (β). We write R (N) to emphasize that the number ofgrid points (and implicitly the grid of points itself) is now a function of the sample size. The maintheorem in our paper is that, under restrictions on the economic model and an appropriate choice ofR (N), our least squares and likelihood estimators FN (β) converge to the true F0 (β) as N → ∞, ina function space. We use the Lévy-Prokhorov metric, a common metrization of the weak topology onthe space of multivariate distributions.

We recognize that the nonparametric versions of our estimators are special cases of sieve estimators(Chen 2007). Sieve estimators estimate functions by increasing the flexibility of the approximatingclass used for estimation as the sample size increases. A sieve estimator for a smooth function mightuse an approximating class defined by a Fourier series, for example. As we are motivated by practicalconsiderations in empirical work, our estimators’ choice of basis, a finite grid of points, is justifiedby the estimators’ computational simplicity. Further and unlike a typical sieve estimator, we need toconstrain our estimated functions to be valid distribution functions. Our constrained least squares andlikelihood approaches are both computationally simple and ensure that the estimated CDFs satisfy thetheoretical properties of a valid CDF.

Because our estimators are sieve estimators, we prove their consistency by satisfying high-levelconditions for the consistency of sieve extremum estimators, as given in an appendix lemma in Chenand Pouzo (2012). We repeat this lemma and its proof in our paper so our consistency proof is self-contained. Our fixed grid estimators are not a special case of the two-step sieve estimators exploredusing lower-level conditions in the main text of Chen and Pouzo.1

We prove the consistency of our estimators for the distribution of heterogeneous parameters, infunction space under the weak topology. We present separate theorems for mixtures of discrete gridpoints and mixtures of continuous densities with a grid of points over the parameters of each density.The theorem for the mixture of grid points requires the heterogeneous parameters to lie in a, not nec-essarily known, compact set. The theorem for a mixture of continuous densities allows for unboundedsupport of the heterogeneous parameters. Our consistency theorems are not specific to the economicmodel being estimated.

We provide the rate of convergences for a subset of the models handled by our consistency theo-rem, namely those that are differentiable in the heterogeneous parameters, which include the random

1Note that under the Lévy-Prokhorov metric on the space of multivariate distributions, the problem of optimizingthe population objective function over the space of distributions turns out to be well posed under the definition of Chen(2007). Thus, our method does not rely on a sieve space to regularize the estimation problem to address the ill-posedinverse problem, as much of the sieve literature focuses on.

3

coefficients logit model. The convergence rates, the asymptotic estimation error bounds, consist oftwo terms: the bias and the variance. While obtaining the variance term is rather standard in thesieve estimation literature, deriving the bias term depends on the specific approximation methods (e.g.,power series or splines). Because our use of approximating functions is new in the sieve estimationliterature, deriving the bias term is not trivial. We provide the bias term, which is the smallest possibleapproximation error of the true function using sieves for the class of models we consider.

Our rate of convergence results highlight an important practical issue with any nonparametric es-timator: there is a curse of dimensionality in the dimension of the heterogeneous parameters. Largersample sizes will be needed if the vector of heterogeneous parameters has more elements. Further,the rate results indicate that our baseline estimator is not practical when there is a large number ofheterogeneous parameters. In high dimensional settings, we suggest allowing heterogeneous parameterson only a subset of explanatory variables and estimating homogenous parameters on the remainingexplanatory variables. We extend our consistency result to models where some parameters are ho-mogeneous. However, including homogeneous parameters requires nonlinear optimization, which losessome of the computational advantages of our estimators.

We provide a Monte Carlo study in an online appendix. We estimate a dynamic programming,discrete choice model, adding heterogeneous parameters to the framework of Rust (1987). The dynamicprogramming problem must be solved once for each realization of the heterogeneous parameters. Wepresent results for both the fixed grid likelihood and least squares estimators as well as, for comparison,a likelihood estimator where we estimate both the grid of points and the weights on those points. Weshow that our fixed grid estimators have superior speed but inferior statistical accuracy compared tothe more usual approach of estimating a flexible grid.

The outline of our paper is as follows. Section 2 presents three examples of discrete choice mixturemodels. Section 3 introduces the estimation procedures. Section 4 demonstrates consistency of ourestimators in the space of multivariate distributions. Section 5.1 extends our consistency results tomodels with both heterogeneous parameters and homogeneous parameters and Section 5.2 considersmixtures of smooth basis densities. Section 6 verifies most of the primitive conditions for consistencyestablished in Section 4 using the three examples of mixture models in Section 2. Section 7 derives theconvergence rates of the nonparametric estimator for a class of models. Finally, an online appendixpresents the Monte Carlo study.

2 Examples of Mixture Models

In our framework, the object the econometrician wishes to estimate is F (β), the distribution of thevector of heterogeneous parameters β. One definition of identification is that a unique F (β) solves(1) for all x and all outcomes j = 1, . . . , J . This is the definition used in certain relevant papers onidentification in the statistics literature, for example Teicher (1963).

We will return to these three examples of discrete choice models later in the paper. Each exampleconsiders economic models with heterogeneous parameters that play a large role in empirical work.Some of the example models are nested in others, but verification of the conditions for consistency inSection 6 will use additional restrictions on the supports of x and β that are non-nested across models.

4

Example 1. (logit) Let there be a multinomial choice model such that y is one of J unordered choices,such as types of cars for sale. For j ≥ 2, the utility of choice j to consumer i is ui,j = x′i,jβi+εi,j , wherexi,j is a vector of observable product characteristics of choice j and the demographics of consumer i, βiis a vector of random coefficients giving the marginal utility of each car’s characteristics to consumeri, and εi,j is an additive, consumer- and choice-specific error. There is an outside good 1 with utilityui,1 = εi,1. The consumer picks choice j when ui,j > ui,h ∀h 6= j. The random coefficients logit modeloccurs when εi,j is known to have the type I extreme value distribution. In this example, (1) becomes,for j ≥ 2,

Pj (x) =

∫gj (x, β) dF (β) =

∫ exp(x′jβ

)1 +

∑Jh=2 exp

(x′hβ

)dF (β) ,

where x = (x2, . . . , xJ). A similar expression occurs for other choices h 6= j. Compared to priorempirical work using the random coefficients logit, our goal is to estimate F (β) nonparametrically.

Example 2. (binary choice) Let J = 2 in the previous example, so that there is one inside goodand one outside good. Thus, the utility of the inside good 2 is ui,2 = εi + x′iβ2,i, where βi = (εi, β2,i) isseen as one long vector and εi supplants the logit errors in Example 1 and plays the role of a randomintercept. The outside good 1 has utility ui,1 = 0. In this example, (1) becomes, for j = 2,

P2 (x) =

∫g2 (x, β) dF (β) =

∫1[ε+ x′β2 ≥ 0

]dF (β) ,

where 1 [·] is the indicator function equal to 1 if the inequality in the brackets is true. Without logiterrors, the joint distribution of both the intercept and the slope coefficients is estimated nonparamet-rically. In this example, g2 (x, β) is discontinuous in β.

Example 3. (multinomial choice without logit errors) Consider a multinomial choice modelwhere the distribution of the previously logit errors is also estimated nonparametrically. In this case,the utility to choice j ≥ 2 is ui,j = x′i,j βi + εi,j and the utility of the outside good 1 is ui,1 = 0. The

notation βi is used because the full heterogeneous parameter vector is now βi =(βi, εi,2, . . . , εi,J

),

which is seen as one long vector. We will not assume that the additive errors εi,j are distributedindependently of βi or of each other. In this example, (1) becomes, for j ≥ 2,

Pj (x) =

∫gj (x, β) dF (β) =

∫1[x′j β + εj ≥ max

{0, x′hβ + εh

}∀h 6= j, h ≥ 2

]dF (β) .

3 Estimator

We analyze both least squares (linear probability models) and maximum likelihood criteria. We firstdiscuss the least squares criterion, from FKRB. Recall that yi,j is equal to 1 whenever the outcome yifor the ith observation is j, and 0 otherwise. Start with the model (1) and add yi,j to both sides while

5

moving Pj (x) to the right side. For the statistical observation i, this gives

yi,j =

∫gj (xi, β) dF (β) + (yi,j − Pj (xi)) . (2)

By the definition of Pj (x), the expectation of the composite error term yi,j − Pj (x), conditional onx, is 0. This is a linear probability model with an infinite-dimensional parameter, the distributionF (β). We could work directly with this equation if it was computationally simple to estimate thisinfinite-dimensional parameter while constraining it to be a valid CDF.

Instead, we work with a finite-dimensional sieve space approximation to F . In particular, we letR (N) be the number of grid points in the grid BR(N) =

(β1, . . . , βR(N)

). A grid point is a vector if

β is a vector, so R (N) is the total number of points in all dimensions. The researcher chooses BR(N).Given the choice of BR(N), the researcher estimates θ =

(θ1, . . . , θR(N)

), the weights on each of the

grid points. With this approximation, (2) becomes

yi,j ≈R(N)∑r=1

θrgj (xi, βr) + (yi,j − Pj (x)) . (3)

We use the ≈ symbol to emphasize that (3) uses a sieve approximation to the distribution functionF (β). Because each θr enters yi,j linearly, we estimate

(θ1, . . . , θR(N)

)using the linear probability

model regression of yi,j on the R “regressors” zri,j = gj (xi, βr).

To be a valid CDF, θr ≥ 0 ∀ r and∑R(N)

r=1 θr = 1. Therefore, the estimator is

θ = arg minθ

1

NJ

∑N

i=1

∑J

j=1

yi,j − R(N)∑r=1

θrzri,j

2

(4)

subject to θr ≥ 0∀ r = 1, . . . , R (N) andR(N)∑r=1

θr = 1.

There are J “regression observations” for each statistical observation (yi, xi). This minimization prob-lem is a quadratic programming problem subject to linear inequality constraints. The minimizationproblem is convex and routines like MATLAB’s lsqlin guarantee finding a global optimum. One canconstruct the estimated cumulative distribution function for the heterogeneous parameters as

FN (β) =∑R(N)

r=1θr1 [βr ≤ β] .

Thus, we have a structural estimator for a distribution of heterogeneous parameters in addition to aflexible method for approximating choice probabilities.

Following Train, we can also use the log-likelihood criterion (divided by the sample size)

L =

N∑i=1

log

R(N)∑r=1

θrzri,yi

/N,

6

where the zri,yi are the probabilities computed above for the observed outcome yi for observation i. As

with the least squares criterion, we enforce the constraints θr ≥ 0∀ r = 1, . . . , R (N) and∑R(N)

r=1 θr = 1.Computationally, one can use the EM algorithm in Train’s Section 6 or a nonlinear, gradient-basedsearch routine, which is available in most scientific packages. The performance of the gradient-basedsearch routine will be improved if the gradient of the likelihood is provided to the solver in closed form.The sth element of that gradient is

∂L

∂θs=

N∑i=1

zsi,yi∑R(N)r=1 θrzri,yi

/N.

The log likelihood is globally concave and any local maximum found will be the global maximum.The fixed grid approach, whether based on a least squares or a likelihood criterion, has two main

advantages over other approaches to estimating distributions of heterogeneous parameters. First,the approach is computationally simple: we can always find a global optimum and, by solving forzri,j = gj (xi, β

r) before optimization commences, we avoid many evaluations of complex structuralmodels such as dynamic programming problems. Second, the approach is nonparametric. In the nextsection, we show that if the grid of points is made finer as the sample size N increases, the estimatorsFN (β) converge to the true distribution F0. We do not need to impose that F0 lies in known parametricfamily.

On the other hand, a disadvantage is that the estimates may be sensitive to the choice of tuningparameters. While most nonparametric approaches require choices of tuning parameters, here thechoice of a grid of points is a particularly high-dimensional tuning parameter. FKRB propose cross-validation methods to pick these tuning parameters, including the number of grid points, the supportof the points, and the grid points within the support.

Example. 1 (logit) For the logit example,exp(x′i,jβr)

1+∑Jh=2 exp(x′i,hβr)

= gj (xi, βr). To implement the least

squares estimator, for each statistical observation i, the researcher computes R · J probabilities zri,j =exp(x′i,jβr)

1+∑Jh=2 exp(x′i,hβr)

. This computation is done before optimization commences. The outcome for choos-

ing the outside good 1 does not need to be included in the objective function, as∑J

j=1 gj (xi, βr) = 1.

To implement the likelihood estimator, the researcher computes R probabilities zri,yi for each statisticalobservation i.

Remark 1. FKRB discuss the case of panel data on T periods. Let yTi = (yi,1, . . . , yi,T ) be theactual sequence of T outcomes for panel observation i. Similarly, let the explanatory variables be(xi,1, . . . , xi,T ). The likelihood criterion for panel data is

L =

N∑i=1

log

R(N)∑r=1

θrzri,yTi

/N,

where zri,yTi

=∏Tt=1 gyi,t (xi,t, β

r). We do not explore panel data in our theoretical results.

7

4 Consistency in Function Space

Assume that the true distribution function F0 lies in the space F of distribution functions on thesupport B of the heterogeneous parameters β. We wish to show that the estimated distributionfunction FN (β) =

∑R(N)r=1 θr1 [βr ≤ β] converges to the true F0 ∈ F as the sample size N grows large.

To prove consistency, we use the results for sieve estimators developed by Chen and Pouzo (2012),hereafter referred to as CP. We define a sieve space to approximate F as

FR =

{F | F (β) =

∑R

r=1θr1 [βr ≤ β] , θ ∈ ∆R ≡

{(θ1, . . . , θR

)′ | θr ≥ 0,∑R

r=1θr = 1

}},

for a choice of grid BR =(β1, . . . , βR

)that becomes finer as R increases. We require FR ⊆ FS ⊂ F for

S > R, or that large sieve spaces encompass smaller sieve spaces. The choice of the grids and R (N)

are up to the researcher; however consistency will require conditions on these choices.Based on CP, we prove that the estimator FN converges to the true F0. In their main text, CP study

sieve minimum distance estimators that involve a two-stage procedure. Our estimator is a one-stagesieve least squares estimator (Chen, 2007) and so we cannot proceed by verifying the conditions in thetheorems in the main text of CP. Instead, we show its consistency based on CP’s general consistencytheorem in their appendix, their Lemma A.2, which we quote in the proof of our consistency theoremfor completeness. As a consequence, our consistency proof verifies CP’s high-level conditions for theconsistency of a sieve extremum estimator.

First, we consider the least squares criterion and then the likelihood criterion follows. Let yi denotethe J×1 finite vector of binary outcomes (yi,1, . . . , yi,J) and let g(xi, β) denote the corresponding J×1

vector of choice probabilities (g1 (xi, β) , . . . , gJ (xi, β)) given xi and the heterogeneous parameter β.Then we can define our sample criterion function for least squares as

QN (F ) ≡ 1

NJ

∑N

i=1

∥∥∥∥yi − ∫ g(xi, β)dF (β)

∥∥∥∥2

E

=1

NJ

∑N

i=1

∥∥∥∥yi −∑R

r=1θrg(xi, β

r)

∥∥∥∥2

E

(5)

for F ∈ FR(N), where || · ||E denotes the Euclidean norm. We can rewrite our estimator as

FN = argminF∈FR(N)QN (F ) + C · νN (6)

where we can allow for some tolerance (slackness) of minimization, C · νN , that is a positive sequencetending to zero as N gets larger, if necessary.

Also let

Q(F ) ≡ E

[∥∥∥∥y − ∫ g (x, β) dF (β)

∥∥∥∥2

E

/J

]be the population objective function.

As a distance measure for distributions, we use the Lévy-Prokhorov metric, denoted by dLP(·),which is a metrization of the weak topology for the space of multivariate distributions F . The Lévy-Prokhorov metric in the space of F is defined on a metric space (B, d) with its Borel sigma algebraΣ(B). We use the notation dLP(F1, F2), where the measures are implicit. This denotes the Lévy-Prokhorov metric dLP(µ1, µ2), where µ1 and µ2 are probability measures corresponding to F1 and F2.

8

The Lévy-Prokhorov metric is defined as

dLP (µ1, µ2) = inf {ε > 0 | µ1 (C) ≤ µ2 (Cε) + ε andµ2 (C) ≤ µ1 (Cε) + ε for all Borel measurableC ∈ Σ(B)} ,

where C ⊆ B and Cε = {b ∈ B | ∃ a ∈ C, d (a, b) < ε}. The Lévy-Prokhorov metric is a metric, so thatdLP (µ1, µ2) = 0 only when µ1 = µ2. See Huber (1981, 2004).

The following assumptions are on the economic model and data generating process. We writeP (x, F ) =

∫g(x, β)dF (β).

Assumption 1.

1. Let F be a space of distribution functions on a finite-dimensional real space B. F is compact inthe weak topology and contains the true F0.

2. Let ((yi, xi))Ni=1 be i.i.d.

3. Let β be independently distributed from x.

4. Assume the model g (x, β) is identified, meaning that for any F1 6= F0, F1 ∈ F , the set X ⊆ Xwhere P (x, F0) 6= P (x, F1) has a positive measure in X .2

5. Q (F ) is continuous on F in the dLP (·, ·) metric.

Assumption 1.1, the compactness of F is satisfied if B itself is compact in Euclidean space (Parthasarathy1967, Theorem 6.4). Unfortunately, the compactness of B rules out some examples such as normal dis-tributions of heterogeneous parameters. In part to address this, Section 5.2 provides a consistencytheorem for a related estimator, which can use mixtures of normal distributions, where the supportof the heterogeneous parameters is allowed to be RK . Assumptions 1.2 and 1.3 are standard fornonparametric mixtures models with cross-sectional data.

Assumption 1.4 requires that the model be identified at a set of values of x that occurs with positiveprobability. The assumption rules out so-called fragile identification that could occur at values of xwith measure zero (such as identification at infinity). Assumptions 1.4 and 1.5 need to be verified foreach economic model g (x, β). We will discuss these assumptions for our three examples in Section 6.

Remark 2. Assumption 1.5 is satisfied when g(x, β) is continuous in β for all x because in this caseP (x, F ) is also continuous on F for all x in the Lévy-Prokhorov metric. Then by the dominated conver-gence theorem, the continuity of Q (F ) in the dLP(·, ·) metric follows from the continuity of P (x, F ) onF for all x and P (x, F ) ≤ 1 (uniformly bounded). Here the continuity of P (x, F ) on F means for anyF1 ∈ F and such that dLP(F1, F2) → 0 it must follow that

∣∣∫ gj(x, β)dF1(β)−∫gj(x, β)dF2(β)

∣∣ → 0

for all j. This holds by the definition of weak convergence when g(x, β) is continuous and boundedand because the Lévy-Prokhorov metric is a metrization of the weak topology.

2This is with respect to the probability measure of the underlying probability space. This probability is well definedwhether x is continuous, discrete or some elements of x are functions of other elements (e.g. polynomials or interactions).

9

Remark 3. If the support B is a finite set, the continuity of Q (F ) holds even when gj(x, β) is dis-continuous, because in this case the Lévy-Prokhorov metric becomes equivalent to the total variationmetric (see Huber 1981, p.34). This implies∣∣∣∣∫ gj(x, β)dF1(β)−

∫gj(x, β)dF2(β)

∣∣∣∣→ 0

for any F1, F2 ∈ F such that dLP(F1, F2) → 0, in part because gj(x, β) is bounded between 0 and 1.We do not have a counterexample to continuity when B is not a finite set. Note that our consistencyresult for a mixtures of parametric densities, presented in Section 5.2, has a continuity condition thatis easier to verify for discontinuous gj(x, β), as Section 6 discusses.

In addition to Assumption 1, we also require that the grid of points be chosen so that the grid BRbecomes dense in B in the usual topology on the reals.

Condition 1. Let the choice of grids satisfy the following properties:

1. Let BR become dense in B as R→∞.

2. FR ⊆ FR+1 ⊆ F for all R ≥ 1.

3. R(N)→∞ as N →∞ and it satisfies R(N) logR(N)N → 0 as N →∞.

The first two parts of this condition have previously been mentioned and ensure that the sievespaces give increasingly better approximations to the space of multivariate distributions. Condition1.3 specifies a rate condition so that the convergence of the sample criterion function QN (F ) to thepopulation criterion function Q(F ) is uniform over FR. Uniform convergence of the criterion functionand identification are both key conditions for consistency.

Theorem 1. Suppose Assumption 1 and Condition 1 hold. Then, dLP

(FN , F0

)p→ 0.

See Appendix B.1 for the proof of the forthcoming Theorem 2, which nests Theorem 1.

Remark 4. Appendix A proves that this estimation problem is well posed under the definition of Chen(2007).

Next, we consider the distribution function estimator using the log-likelihood criterion. Define thepopulation criterion function and its sample analog, respectively, as

Q(F ) ≡ −QML(F ) ≡ −E

J∑j=1

yi,j logPj(xi, F )

= −E

J∑j=1

yi,j log

∫gj(xi, β)dF (β)

and

QN (F ) ≡ −QMLN (F ) ≡ − 1

N

∑N

i=1

J∑j=1

yi,j logPj(xi, F ) = − 1

N

∑N

i=1

J∑j=1

yi,j log

∫gj(xi, β)dF (β)

for F ∈ FR(N). Then we can obtain the ML estimator as in (6) and denote the resulting estimator byFMLN . We obtain the following corollary for the consistency of this ML estimator

10

Corollary 1. Suppose Assumption 1 and Condition 1 hold. Further suppose that Pj (x, F ) is boundedaway from zero for all j and F ∈ F . Then, dLP

(FMLN , F0

)p→ 0.

See Appendix B.2 for the proof.

Remark 5. The literature on sieve estimation has not established general results on the asymptoticdistribution of sieve estimators, in function space. However, for rich classes of approximating basisfunctions that do not include our approximation problem, the literature has shown conditions underwhich finite dimensional functionals of sieve estimators have asymptotically normal distributions. Inthe case of nonparametric heterogeneous parameters, we might be interested in inference in the meanor median of β. For demand estimation, say Example 3, we might be interested in average responses(or elasticities) of choice probabilities with respect to changes in particular product characteristics.Let ΠNF0 be a sieve approximation to F0 in our sieve space FR(N). If we could obtain an error boundfor dLP (ΠNF0, F0), we could also derive the convergence rate in the Lévy-Prokhorov metric (Chen2007). If the error bound shrinks fast enough as R(N) increases, we conjecture that we could alsoprove that plug-in estimators for functionals of F0 are asymptotically normal (Chen, Linton, and vanKeilegom 2003).3 Error bounds for discrete approximations are available in the literature for a class ofparametric distributions F , but we are not aware of results for the unrestricted class of multivariatedistributions. For a subset of problems, including the random coefficients logit, we are able to deriveapproximation error bounds. We provide convergence rates for these cases in Section 7.

5 Extensions

5.1 Models with Homogenous Parameters

In many empirical applications, it is common to have both heterogeneous parameters β and finite-dimensional parameters γ ∈ Γ ⊆ Rdim(γ). We write the model choice probabilities as g(x, β, γ) and theaggregate choice probabilities as P (x, F, γ). Here we consider the consistency of estimators for modelswith both homogenous parameters and heterogeneous parameters. For conciseness, we state a theoremfor the least squares criterion and omit a corollary for the likelihood criterion.

Remark 6. Estimating a model allowing a parameter to be a heterogeneous parameter when in truththe parameter is homogeneous will not affect consistency if the model with heterogeneous parametersis identified.

Remark 7. Searching over γ as a homogeneous parameter for the least squares criterion requires nonlin-ear least squares. The optimization problem may also not be globally convex. The objective functionmay not be differentiable for our examples where g(x, β, γ) involves an indicator function.

Our estimator for models with homogeneous parameters is defined as (similarly to (6))

(γN , FN ) = argmin(γ,F )∈Γ×FR(N)QN (γ, F ) + C · νN ,

3We conjecture that we could prove an analog to Theorem 2 in Chen et al (2003) if we could verify analogs toconditions (2.4)–(2.6) in that paper for our sieve space.

11

where QN (γ, F ) denotes the corresponding sample criterion function. Q(γ, F ) is the population cri-terion function based on the model g(x, β, γ). An alternative computational strategy is profiling, asin

FN (γ) = argminF∈FR(N)QN (γ, F ) + C · νN for all γ ∈ Γ.

Profiling gives usγN = argminγ∈ΓQN

(γ, FN (γ)

)+ C · νN ,

and therefore FN = FN (γN ). We make the following assumptions to prove consistency for models withhomogenous parameters.

Assumption 2.

1. Let A ≡ Γ × F where F is compact in the weak topology and Γ is a compact subset of Rdim(γ)

and A contains the true (γ0, F0).


3. Let β be independently distributed from x.

4. Assume the model g (x, β, γ) is identified, meaning that for any (γ1, F1) 6= (γ0, F0), (γ1, F1) ∈Γ×F , the set X ⊆ X where P (x, F0, γ0) 6= P (x, F1, γ1) has a positive measure in X .

5. At least one of the following properties holds.

(a) gj (x, β, γ) is Lipschitz continuous in γ for each outcome j.

(b) (i)(γN , FN

)is well-defined and measurable. (ii) For each outcome j, there exists a vec-

tor of known functions h (x, β, γ) = (h1 (x, β, γ) , . . . , hJ (x, β, γ)) such that, for each j,gj (x, β, γ) = 1 [Aj · h (x, β, γ) > 0] for some vector of known constants Aj. (iii) Each of theJ functions hj (x, β, γ) is Lipschitz continuous in γ.

6. Q (γ, F ) is lower semicontinuous in γ, is continuous on F in the weak topology, and is continuousat (γ0, F0).

If homogeneous parameters were added in the examples we consider, Assumption 2.5.a would holdfor Example 1, the logit model with random coefficients. Assumption 2.5.b might hold for Examples 2and 3. See the remark below.

We present the consistency theorem for the estimator with homogeneous parameters.

Theorem 2. Suppose Assumption 2 and Condition 1 hold. Then, γNp→ γ0 and FN

p→ F0.

See Appendix B.1 for the proof. Again, we omit a corollary for the likelihood case.

Remark 8. Assumption 2.5.b is designed to handle extensions of Examples 2 and 3 to include homoge-neous parameters. In the extensions of these examples, the homogeneous parameters enter inside indi-cator functions. The non-primitive portion, Assumption 2.5.b.i, requires that the estimator

(γN , FN

)be well-defined and measurable, echoing a condition in Lemma A.2 of Chen and Pouzo (2012), which

12

our consistency proof relies on. Remark A.1.i.a in CP states that lower semicontinuity of the leastsquares sample objective function is a sufficient condition for the estimator to be well-defined and mea-surable. Even though an indicator function for an open set is lower semicontinuous, the least squaressample objective (or likelihood) function itself might not be lower semicontinuous in γ even if eachgj (x, β, γ) is lower semicontinuous in γ. For example, multiplication by a negative number is enoughto change lower semicontinuity into upper semicontinuity. Therefore, we follow the main text of CPand, for the case of indicator functions, maintain the non-primitive assumption that the estimator iswell-defined and measurable. Assumption 2.5.b.i states in part that the sample objective function hasa unique global optimum. Although Assumption 2.5.b.i may not actually hold if the homogeneousparameters enter inside indicator functions, the sample criterion is converging to a population criterionwith a unique global optimum due to Assumptions 2.4 and 2.6. Extending the appendix lemma in Chenand Pouzo (2012) to the case where the sample objective function has a continuum of multiple globaloptima (just like in maximum score) is a minor extension that we do not pursue for space reasons. Thecontinuity of the population criterion Q(γ, F ) with respect to F can be satisfied using the sufficientcondition discussed in Remark 3 for the earlier Assumption 1.5. The lower semicontinuity with respectto γ may require further primitive conditions, like the conditions on the support of the explanatoryvariables in the results on binary choice in Ichimura and Thompson (1998, Theorem 1).

5.2 Continuous Distribution Function Estimator

A limitation of the discrete approximation estimator is that the CDF of the heterogeneous parameterswill be a step function. In applied work, it is often attractive to have a smooth distribution of heteroge-neous parameters. In this subsection, we describe one approach to obtain a continuous distribution ordensity function estimator that allows for unbounded supports. Instead of modeling the distribution ofthe heterogeneous parameters as a mixture of point masses, we instead model the density as a mixtureof parametric densities, e.g. normal densities. Approximating a density or distribution function usinga mixture of parametric densities or distributions is popular (e.g. Jacobs, Jordan, Nowlan, and Hinton1991, Li and Barron 2000, McLachlan and Peel 2000, and Geweke and Keane 2007). Our estimator’sadvantage is its computational simplicity.

As a leading case, let a basis r be a normal distribution with mean the K×1 vector µr and standarddeviation the K × 1 vector σr. Let φ (β | µr, σr) denote the joint normal density corresponding to therth basis distribution. Under independent normal basis functions, the joint density for a given r isjust the product of the marginals, or φ (β | µr, σr) =

∏Kk=1 φ (βk | µrk, σrk) . We can also use only a

location mixture with the basis functions φ (β | µr) =∏Kk=1 φ (βk | µrk) or use a multivariate normal

mixture with the basis functions φ (β | µr,Σr), where Σr denotes a variance-covariance matrix. We canalso consider mixtures of other parametric density functions. We use the generic notation φ (β | λr) todenote the rth basis function, where λr is the rth distribution parameter. Let θr denote the probabilityweight given to the rth basis function, φ (β | λr). As in the discrete approximation estimator, the vectorof weights θ lies in the unit simplex, ∆R.

There are many ways to perform K-dimensional numerical integration, such as sparse grid quadra-ture (Heiss and Winschel 2008). We focus on simulation for simplicity. To implement our continuousdensity estimator for a given R, make S simulation draws from φ (β | λr) independently of r (i.e. useindependent simulation draws for each λr). Let a particular draw s be denoted as βr,s. We then create

13

the R “regressors”

zri,j =

∫gj (xi, β)φ (β | λr) dβ ≈ 1

S

S∑s=1

gj (xi, βr,s) .

The ≈ emphasizes the error (possibly quite small) in numerical integration. This numerical integrationstep is done first, before any optimization. We then approximate Pj (xi) as

Pj (xi) ≈R∑r=1

θrzri,j .

Here, we use the ≈ to emphasize both sieve and numerical integration approximations, althoughtypically the sieve approximation will be a larger source of error. We estimate θ using the inequalityconstrained least squares problem as before

θS = arg minθ∈∆R

1

NJ

∑N

i=1

∑J

j=1

(yi,j −

∑R

r=1θrzri,j

)2

. (7)

This is once again inequality-constrained linear least squares, a globally convex optimization problemwith an easily-computed unique solution. The resulting density estimator is fN,S (β) =

∑Rr=1 θ

rSφ (β | λr)

and the distribution function estimator is FN,S (β) =∑R

r=1 θrSΦ (β | λr), where dΦ(·) = φ(·).

5.2.1 Consistency of the Continuous Distribution Function Estimator

We show consistency for the continuous distribution function estimator after imposing additional re-strictions on the data generating process. For this purpose, we restrict the class of the true distributionfunctions to

FM =

{F : F =

∫Φ(β | λ)Pλ(dλ), Pλ ∈ Pλ, dΦ(β | λ) ∈ G =

{dΦ(β | λ) | β ∈ B ⊆ RK , λ ∈ Λ ⊂ Rd

}},

(8)such that any distribution in FM is in truth given by a mixture of parametric distributions in G.Note that we allow for B = RK in FM , thus removing the restriction that B is compact underlyingTheorem 1. Here Pλ denotes a probability measure on Λ, the support of the distribution parameter λ.This means that we assume the true distribution is in the space of possibly continuous mixtures oversome known parametric basis functions. Note, however, that Petersen (1983, as Lemma 3.4 of Zeeviand Meir (1997)) suggests that any density function can be approximated to arbitrary accuracy by aninfinitely countable convex combination of basis densities, including normals, i.e. G can be dense inthe space of continuous density functions. For example Zeevi and Meir (1997) show that G is dense inthe space of all density functions that are bounded away from zero on compact support. Therefore weargue that the class FM can be arbitrary close to the space of any continuous distribution functionswith suitable choices of G. We approximate this possibly continuous mixture using a finite mixtureover the same basis functions. Accordingly we construct our sieve space as

FMR =

{F | F =

∑R

r=1θrΦ(β | λr), dΦ(β | λ) ∈ G, θ ∈ ∆R

}.

14

First consider an estimator ignoring numerical integration error in the regressors

FN (β) =∑R(N)

r=1θrΦ(β|λr), (9)

where θ = arg minθ∈∆R(N)

1NJ

∑Ni=1

∑Jj=1

(yi,j −

∑R(N)r=1 θrzri,j

)2and zri,j =

∫gj(xi, β)dΦ(β | λr) with

a choice of grid ΛR =(λ1, . . . , λR

)on Λ. We consider simulation error later.

Let F0 =∫

Φ(β | λ)Pλ,0(dλ) ∈ FM . Then we can present the consistency result in terms ofestimating the true Pλ,0 because knowing Pλ fully characterizes the true F0 given a researcher’s choiceof Φ(β | λ) (e.g. normal distributions). Note that (9) can be also written in terms of estimating Pλwith the estimator

Pλ,N (λ) =∑R(N)

r=1θr1 [λr ≤ λ]

where the sieve space for Pλ is given by

Pλ,R =

{Pλ | Pλ (λ) =

∑R

r=1θr1 [λr ≤ λ] , θ ∈ ∆R

}.

Therefore FN =∫

Φ(·|λ)dPλ,N (dλ)p→ F0 =

∫Φ(·|λ)dPλ,0(dλ) as long as Pλ,N

p→ Pλ,0 because weassume F0 ∈ FM and because Φ(β|λ) is continuous in λ for all β and therefore F =

∫Φ(·|λ)dPλ(dλ)

is also continuous on Pλ in the Lévy-Prokhorov metric. This facilitates our analysis because we candirectly apply Theorem 1 to Pλ,N under assumptions below. To present the consistency theorem weneed additional notation.

Define a sample criterion function in terms of estimating Pλ,0 as

QN (Pλ) ≡ 1

NJ

∑N

i=1

∥∥∥∥yi − ∫ g(xi, λ)dPλ (λ)

∥∥∥∥2

E

=1

NJ

∑N

i=1

∥∥∥∥yi −∑R

r=1θrg(xi, λ

r)

∥∥∥∥2

E

(10)

for Pλ ∈ Pλ,R(N), where g(x, λr) =∫g(x, β)dΦ(β|λr). Then we can rewrite the estimator Pλ,N as

Pλ,N = argminPλ∈Pλ,R(N)QN (Pλ) + C · νn (11)

for some positive sequence νn tending to zero. Also define the population criterion function as

Q(Pλ) ≡ E

[∥∥∥∥y − ∫ g (x, λ) dPλ (λ)

∥∥∥∥2

E

/J

].

We make the following assumptions, which are similar to those used in Theorem 1

Assumption 3.

1. Let FM be a space of distribution functions generated by Pλ in (8), where Λ is compact. FM

contains the true F0.


3. Let both β and λ be independently distributed from x.

15

4. Assume the model g(x, β) is identified, meaning that for any Pλ,1 6= Pλ,0, the set X ⊆ X whereP (x, F0) 6= P (x, F1) such that F0 =

∫Φ(· | λ)Pλ,0(dλ) and F1 =

∫Φ(· | λ)Pλ,1(dλ) has a positive

measure in X .

5. Q (Pλ) is continuous on Pλ in the weak topology.

We leave out lengthy discussion of these assumptions because most of these assumptions are madefor similar reasons as those in Assumption 1. More discussion is in the beginning of Section 6. We alsorequire that the choice of grids on Λ satisfies the following properties.

Condition 2. 1. Let ΛR =(λ1, . . . , λR

)become dense in Λ as R→∞.

2. Pλ,R ⊆ Pλ,R+1 ⊆ Pλ for all R ≥ 1.

3. R(N)→∞ as N →∞ and it satisfies R(N) logR(N)N → 0 as N →∞.

Theorem 3. Suppose Assumption 3 and Condition 2 hold. Then, dLP

(Pλ,N , Pλ,0

)p→ 0 and dLP

(FN , F0

)p→

0.

A brief proof is in an appendix. However, most of the steps mirror the proofs of Theorems 1 and2, and so the appendix omits the unchanged steps for conciseness.

Next we account for the simulation error in the basis functions from approximating the integralwith respect to Φ(β | λr). Denote the resulting distribution estimator with simulated basis functionsby FN,S (β) ≡

∑R(N)r=1 θrSΦ(β | λr), where

θS = arg minθ∈∆R(N)

1

N

∑N

i=1

∑J

j=1

yi,j − R(N)∑r=1

θr1

S

S∑s=1

g(xi, βr,s)

2

. (12)

Note that all simulation draws are independent and that different draws are used for each r. Weobtain the mixing ratios θS as in (12) using the simulated basis functions and our distribution functionestimator is still FN,S (β), which belongs to FMR . Note that we only use simulation to approximateΦ(β | λr) and to obtain θS in (12). Our distribution estimator is still FN,S (β), not FN,S (β) ≡∑R(N)

r=1 θrS1S

∑Ss=1 1 [βr,s ≤ β], because the CDF Φ(β | λr) is known to researchers. In the normal

distribution case, specialized software calculates the normal CDF Φ(β | λr) with much less error thantypical simulation approaches. We show our estimator is consistent when the number of simulationdraws tends to infinity.

Theorem 4. Suppose Assumption 3 and Condition 2 hold. Let S →∞. Then, dLP

(FN,S , F0

)p→ 0.

The proof is in the appendix.

6 Discussion of Examples

We return to the examples we introduced in Section 2. We discuss the two key conditions for each modelg (x, β): Assumption 1.4, identification of F (β), and Assumption 1.5, continuity of the populationobjective function under the Lévy-Prokhorov metric. Throughout this section, we assume Assumptions

16

1.1–1.3 hold. Note that Matzkin (2007) is an excellent survey of older results on the identification ofmodels with heterogeneity.

For the mixture of continuous densities, the identification of the mixture distribution, Assumption3.4, will typically occur if the underlying distribution of heterogeneous parameters is identified, As-sumption 1.4, and the basis functions are chosen appropriately. We do not need to explicitly verifyAssumption 3.4 once Assumption 1.4 is verified. We also do not explicitly verify Assumption 3.5,continuity of the population objective function in the continuous mixtures cases. An important pointis that even when g(x, β) is nonsmooth as in Examples 2 and 3, g(x, λ) =

∫g(x, β)dΦ(β | λ) can

still be continuous and differentiable in λ because it is smoothed by integration with respect to thedistribution Φ(β | λ). Therefore, Assumption 3.5 can be satisfied by Remark 2.

Example. 1 (logit) The identification of F (β) in the random coefficients logit model is the main con-tent of Fox, Kim, Ryan, and Bajari (2012, Theorem 15).4 Assumption 14 in Fox et al states that “Thesupport of x, X contains x = 0, but not necessarily an open set surrounding it. Further, the supportcontains a nonempty open set of points (open in Rdim(xj)) of the form

(x′2, . . . , x

′j−1, x

′j , x′j+1, . . . , x

′J

)=(

0′, . . . , 0′, x′j , 0′, . . . , 0′

).”5 Fox et al also require the support of X to be a product space, which rules

out including polynomial terms in an element of xj or including interactions of two elements of xj .Given this assumption, Assumption 1.4 holds. Assumption 1.5 holds by Remark 2 in the current paper.

Example. 2 (binary choice) Ichimura and Thompson (1998, Theorem 1) establish the identificationof F (β) under the conditions that i) the coefficient on one of the the non-intercept explanatory variablesin x is known to either always be positive or either always be negative (the sign can be identified), ii)there is some normalization such as the coefficient known to be positive or negative is always either+1 or −1 (more generally the random coefficients lie on an unknown hemisphere), iii) there are largeand product supports on each of the explanatory variables other than the intercept. This rules outpolynomial terms and interactions. If we impose the scale normalization βk∗ = ±1, Assumption 1.4holds if we add large and product support conditions on each explanatory variable in x. An advantageof our estimator is the ease of imposing sign restrictions if necessary. Because the researcher picksBR(N) =

(β1, . . . , βR(N)

), the researcher can choose the grid so that the first element of each vector

βr is always positive, for example. Note that binary choice is a special case of multinomial choice, sothe non-nested identification conditions in example 3, below, can replace these used here. Next, notethat the continuity condition (Assumption 1.5) holds by Remark 3 when the support B is a finite set.We have not shown that continuity holds under only the identification assumptions of Ichimura andThompson (1998, Theorem 1), although we know of no counterexamples.

Example. 3 (multinomial choice without logit errors) Fox and Gandhi (2015, Theorem 2) studythe identification of the multinomial choice model without logit errors. Our linear specification of theutility function for each choice is a special case of what they allow. Fox and Gandhi require a choice-j-specific explanatory variable with large and product support. On other hand, Fox and Gandhi allow

4Theorem 15 of Fox et al also allows homogeneous, product-specific intercepts.5Fox et al discusses what x = 0 means when the means of product characteristics can be shifted.

17

polynomial terms and interactions for x’s other than the choice-j-specific large support explanatoryvariables, unlike examples 1 and 2. They do not require large support for the x’s that are not thelarge support, choice-specific explanatory variables. The most important additional assumption foridentification is that Fox and Gandhi require that F (β) takes on at most a finite number T of supportpoints, although the number T and support point identities β1, . . . , βT are learned in identification.The number T in the true F0 is not related in any way to the finite-sample R (N) used for estimationin this paper. So Assumption 1.4 holds under this restriction on F . Next, the continuity Assumption1.5 holds also by Remark 3 when the support B is a finite set.

7 Estimation Error Bounds

We next derive convergence rates for the approximation of the underlying distribution of heterogeneousparameters F (β) for a subset of the models allowed in the consistency theorems in Sections 3 and5.2. These results highlight, among other issues, the curse of dimensionality in the dimension of theheterogeneous parameters. Larger sample sizes are needed if the dimension K of the heterogeneousparameters is larger. We present results for the least squares criterion and do not include corollariesfor the likelihood criterion.

7.1 Discrete Approximation Estimator

Our results apply to a narrower set of true models gj(x, β) than the earlier consistency theorems. Inparticular, we require that gj(x, β) be smooth, so we allow our Example 1, the logit case, but not theother examples, where gj(x, β) involves an indicator function. Nevertheless, the random coefficientslogit is a leading model used in empirical work and the results exploiting the smoothness of gj(x, β)

are illustrative of issues, such as the curse of dimensionality, that apply also to nonsmooth choices forgj(x, β).

Our results relate to the literature on sieve estimations (e.g., Newey (1997) and Chen (2007)).Of course, the major difference is our choice of discrete basis functions, motivated by our estimator’scomputational advantages and our desire to easily constrain our estimate to be a valid CDF. We cannotdirectly apply previous results because of our different sieve space. Our proof technique to derive ourerror bounds uses results on quadrature, as we will explain.

We restrict the true distribution F0 to lie in a class of distributions that have smooth densities on acompact space of heterogeneous parameters. We define the parameter space for the choice probabilitiesP (x, F ) as the collection of those generated by such smooth densities:

H =

{P (x, F ) | max

0≤s≤ssupβ∈B|Dsf(β)| ≤ C,B =

∏K

k=1

[βk, βk

], f = dF ∈ Cs [B]

}, (13)

where Ds = ∂s

∂βα11 ...∂β

αKK

, s = α1 + · · · + αK with D0f = f , giving the collection of all derivatives of

order s. Also, Cs[B] is a space of s-times continuously differentiable density functions defined on B.Therefore we assume any element of the class of density functions that generates H is defined on aCartesian product B, is uniformly bounded by C <∞, is s-times continuously differentiable, and has

18

all own and partial derivatives uniformly bounded by C. The definition of H depends on C and s; thedegree of smoothness s will show up in our convergence rate results.

Let the space of approximating functions be

HR(N) ={P (x, F ) | F ∈ FR(N)

}. (14)

We require that the grid points accumulate in FR(N) such that HR ⊆ HR+1 ⊆ . . .. Our approach usesresults from quadrature to pick approximation choices θr such that we can approximate the choiceprobability P (x, F0) arbitrarily well using approximating functions in HR(N) as R (N)→∞.

In this section and in the corresponding proofs, we let ‖v‖E =√v′v, ‖h‖2L2,N

= 1N

∑Ni=1 ||h(xi)||2E ,

‖h‖2L2=∫||h||2Ed$ (the norm in L2), and ‖h‖2∞ = supx∈X ||h(x)||2E for any function h : X → R,

where $ denotes a probability measure on X . We introduce the linear probability model error ei,j ,as in yi,j = Pj(xi, F ) + ei,j and E [ei,j | X1, . . . , XN ] = 0. In addition to restricting the class of truedistributions, we make the following additional assumptions.

Assumption 4. (i)(ei = (ei,1, . . . , ei,J)′

)Ni=1

are independently distributed; (ii) E [ei | X1, . . . , XN ] =

0; (iii) (Xi)Ni=1 are i.i.d. with a density function bounded above; (iv) g(x, β) is s-times continuously

differentiable w.r.t. β and its (all own and partial) derivatives are uniformly bounded ; (v) the sievespace defined in ( 14) satisfies HR ⊆ HR+1 ⊂ . . . .

Assumptions 4 (i)-(iii) are about the structure of the data and in particular they allow for het-eroskedasticity for the linear probability error, which is necessary for linear probability models. Aspreviewed earlier, Assumption 4 (iv) assumes that the true model g(x, β) is differentiable. We useAssumption 4 (iv) to approximate P (x, F ) using a sieve method for F combined with a quadraturemethod for the choice of weights θr. Assumption 4 (iv) is satisfied by Example 1. Assumption 4 (v)was mentioned previously.

Any asymptotic error bound consists of two terms: the order of bias and the variance. Whileobtaining the variance term is rather standard in the sieve estimation literature, deriving the biasterm depends on the specific choice of basis function (e.g., power series or splines in the previousliterature). Because our choice of basis functions is new in the sieve literature, we first state the orderof bias, meaning the approximation error rate of our sieve approximation to arbitrary conditional choiceprobabilities P (x, F ) in H. Keep in mind this bias result is primarily about the flexibility of a class ofapproximations and has less to do with using a finite sample of data.

Lemma 1. Suppose P (x, F ) ∈ H and suppose Assumptions 4 (iv) and (v) hold. Then there exist F ∗ ∈FR(N) such that for all x ∈ X , ‖P (x, F ∗)− P (x, F )‖2E = O

(R−2s/K

). Further suppose Assumptions

4 (i)-(iii) hold. Then, ‖P (xi, F∗)− P (xi, F )‖2L2,N

= OP(R−2s/K

).

To prove this result, we combine a quadrature approximation with a power series approximationto approximate P (x, F ) =

∫g (x, β) f(β)dβ. We first approximate g (x, β) f(β) using a tensor product

power series in β. Then we approximate the integral of the tensor products approximation with respectto β using quadrature. The complete proof is in Appendix C.3.

Lemma 1 is the key ingredient that allows us to use machinery from the sieve literature for our esti-mator. Let C be a (generic) positive constant. Define ΨR ≡

(E[∑

j gj (Xi, βr) gj

(Xi, β

r′)])

1≤r,r′≤R

19

(a R×R matrix) and its smallest eigenvalue as ξmin(R). Then we obtain the following estimation errorbounds.

Theorem 5. Suppose Assumptions 1.1, 1.3, and 1.4 and Condition 1 hold. Suppose Assumption 4holds. Suppose further that R(N)2 logR(N)

N → 0 and ξmin(R) > 0 for all finite R. Then if P (x, F0) ∈ Hand C is a particular constant,∥∥∥P (xi, FN)− P (xi, F0)

∥∥∥2

L2,N

≤ C ·max

{R(N) logR(N)

ξmin(R(N)) ·N,R(N)−2s/K

}w.p.a.1.

Theorem 5 establishes a bound on the distance between the true conditional choice probability andthe approximated conditional choice probability. It shows how the finite-sample estimator is able tofit data generated by a nonparametric choice of a heterogeneity distribution. Roughly speaking, in theestimation error bound the first term in the max operator corresponds to an asymptotic variance andthe second term corresponds to an asymptotic bias (Chen 2007). Fixing N , a larger R(N) reduces thebias but increases the variance and an optimal choice of R(N) is obtained by balancing the bias andvariance. Obtaining the variance term is standard in the sieve estimation literature and obtaining thebias term requires approximation results that depend on the type of sieves (in the previous literature,e.g., power series or splines). In the proof, we obtain this bias term using Lemma 1.

There are several lessons from this error bound. The approximation error rate in the bias termshows that we have faster convergence with a smoother density function s and slower convergence witha higher dimensional β, K. These results are intuitive. In many other settings, smoother distributionsare easier to approximate and higher dimensional distributions are harder to approximate. Practically,one might want to use caution when estimating an unrestricted joint distribution of β when K is large.

Note that the condition ξmin(R) > 0 for all finite R in Theorem 5 means the regressors gj (Xi, βr)

are not linearly dependent across different βr in least squares estimation. Theorem 5 allows the caselimR→∞ ξmin(R) = 0 but this term drops out in the convergence rate if limR→∞ ξmin(R) > 0. The caselimR→∞ ξmin(R) = 0 means that ΨR, which is the probability limit (as N → ∞ with fixed R) of thesample matrix

(1N

∑Ni=1

∑Jj=1 gj(xi, β

r)gj(xi, βr′))

1≤r,r′≤R, tends to be singular as R grows to infinity.

In this case, as often proposed in the literature (e.g. Hastie, Tibshirani, and Friedman 2009), we canpotentially reduce the variances of resulting estimators using a subset selection method or a shrinkagemethod, such as LASSO and ridge/Tikhonov regression. On the other hand, these approaches cancause additional bias in finite samples.

The previous theorem was for choice probabilities, but we are also interested in the distributionfunction itself. Using the result in Theorem 5, we can approximate the distribution of heterogeneousparameters with the following convergence rate.

Theorem 6. Suppose the assumptions and conditions in Theorem 5 hold. Then for a particularconstant C, w.p.a.1

∣∣∣FN (β)− F0 (β)∣∣∣ ≤ C√ R(N)

ξmin(R(N))max

{√R(N) logR(N)

ξmin(R(N)) ·N,R(N)−s/K

}a.e. β ∈ B.

Not surprisingly, the bias term in this bound suggests that the estimator of the distribution function

20

suffers from a curse of dimensionality in the number of heterogeneous parameters K and has a fasterrate of convergence if the true generating process is smoother, as indexed by s.

7.2 Continuous Distribution Function with Compact Support

Here we consider the asymptotics for the smooth density estimator proposed in Section 5.2. In this ap-proach, we approximate the vector of choice probabilities P (xi) = (P1 (xi) , . . . , PJ (xi)) using smoothbasis functions as

P (xi) ≈R∑r=1

θr∫g(xi, β)dΦ(β | λr),

where Φ(β | λr) denotes the rth basis distribution function. We focus on simulation as the numericalintegration technique. Therefore,

P (xi) ≈R∑r=1

θr∫g(xi, β)dΦ(β | λr) ≈

R∑r=1

θr

(1

S

S∑s=1

g(xi, βr,s)

), (15)

where the (βr,s)Ss=1 are drawn from Φ(β | λr). We then can rewrite (15) as

P (xi) ≈R∑r=1

θr

(1

S

S∑s=1

g(xi, βr,s)

)=

R∑r=1

S∑s=1

θr

Sg(xi, β

r,s) =R×S∑r=1

θrg(xi, βr)

for some β r and θr. Therefore, we can interpret (15) as the discrete approximation of P (xi) with R ·Sgrid points and R · (S − 1) restrictions such that the first group of S coefficients on the first group ofS regressors are identical to θ1

S , the second group of S coefficients on the second group of S regressorsare identical to θ2

S , and so on.Therefore, the smooth mixture model can be nested in the discrete approximation case if the support

of β is bounded. To see this note that we have FN,S (β) ≡∑R(N)

r=1 θrSΦ(β | λr) and the simulation-approximated distribution estimator by FN,S (β) ≡

∑R(N)r=1 θrS

1S

∑Ss=1 1 [βr,s ≤ β] where θS solves (12).

Then we can obtain∣∣∣FN,S (β)− F0(β)∣∣∣ ≤ ∣∣∣FN,S (β)− FN,S(β)

∣∣∣+∣∣∣FN,S (β)− F0(β)

∣∣∣≤

R(N)∑r=1

θrS

∣∣∣∣∣ 1SS∑s=1

1 [βr,s ≤ β]− Φ(β | λr)

∣∣∣∣∣+∣∣∣FN,S (β)− F0(β)

∣∣∣ p→ 0 (16)

as N →∞ and S →∞. The first term in (16) goes to zero because for any given r the empirical distri-bution obtained from the simulation draws converges to the CDF of the draws (by the Glivenko–Cantellitheorem) and the second term in (16) goes to zero because FN,S (β) can be seen exactly as our dis-crete approximation estimator. We can also bound the convergence rate of the first term in (16) byR(N)/

√S. Note that the term

∣∣∣ 1√S

∑Ss=1 {1 [βr,s ≤ β]− Φ(β | λr)}

∣∣∣ = OP∗(1) for each given λr by acentral limit theorem because βr,s are iid draws from Φ(β|λr) and E∗[1 [βr,s ≤ β]] = Φ(β|λr) for eachr, where E∗[·] denotes expectation and P∗ denotes probability measure with respect to the simulation

21

draw. It follows that∑R(N)

r=1 θrS1√S

∣∣∣ 1√S

∑Ss=1 {1 [βr,s ≤ β]− Φ(β | λr)}

∣∣∣ ≤ C ·R(N)/√S w.p.a.1, which

is the bound for the first term in (16). As discussed above, Theorem 6 gives the convergence rate ofthe second term in (16), again because FN,S (β) can be seen exactly as our discrete approximationestimator.

Of course, the approach of nesting the mixture of continuous densities into the discrete approx-imation does not handle the full set of models allowed by the consistency theorem, Theorem 3. Inparticular, mixtures of normals are not allowed. Future work could explore more general results forthe rate of convergence of the estimator using a mixture of continuous densities.

8 Conclusion

Previous papers have introduced fixed grid estimators for distributions of heterogeneity in structuralmodels. We explore the asymptotic properties of the nonparametric distribution estimators. We showconsistency in the function space of all distributions under the weak topology by viewing our estimatorsas sieve estimators and verifying high-level conditions in Chen and Pouzo (2012). We also showconsistency for i) a model with homogenous parameters and ii) an estimator based on a mixture oversmooth basis distribution functions on possibly unbounded supports. We verify some of the conditionsfor consistency for three example discrete choice models, each of which is widely used in empiricalwork. We also derive the convergence rates of the least squares nonparametric estimator for a classof differentiable models, for which we can derive the approximation error rates of our nonparametricsieve space.

22

A Well-posedness

Chen (2007) and Carrasco, Florens and Renault (2007) distinguish between functional (here the distri-bution) optimization and identification problems that are well-posed and problems that are ill-posed.Using Chen’s definition, the optimization problem of maximizing the population criterion functionQ (F ) with respect to the distribution function F will be well-posed if dLP(Fn, F0) → 0 for all se-quences {Fn} in F such that Q(Fn) − Q(F0) → 0. The problem will be ill-posed if there exists asequence {Fn} in F such that Q(Fn) − Q(F0) → 0 but dLP(Fn, F0) 9 0.6 We now argue that ourproblem is well-posed.

Note that F is compact in the weak topology (Assumption 1.1). Also, Q (F ) is continuous on F byAssumption 1.5. It follows that with our choice of the criterion function and metric, our optimizationproblem is well posed in the sense of Chen (2007) because for every ε > 0 we have

infF∈FR(N):dLP(F,F0)≥ε

(Q(F )−Q(F0)) ≥ infF∈F :dLP(F,F0)≥ε

(Q(F )−Q(F0)) > 0, (17)

where the first inequality holds because FR(N) ⊂ F by construction and the second, strict inequalityholds as the minimum is attained by continuity and compactness and because the model is identified(Assumption 1.4), as we argue in the proof of Theorem 1. Therefore, our optimization problem satisfiesChen’s definition of well-posedness.

B Proofs of Consistency

B.1 Proofs of Theorems 1 and 2

We provide proof of the consistency theorem for models with homogenous parameters because theessentially same proof can be applied to the models without homogeneous parameters.

We verify the conditions of CP’s Lemma A.2 (also see Theorem 3.1 in Chen (2007)) in our con-sistency proof. To provide completeness, we first present our simplified version of CP’s Lemma A.2,which does not incorporate a penalty function. We let α = (γ, F ) ∈ A ≡ Γ× F , AR(N) ≡ Γ× FR(N),and with possible abuse of notation dLP(α1, α2) ≡ ‖γ1 − γ2‖E + dLP(F1, F2) for α1, α2 ∈ A. DefineQN (α) = 1

NJ

∑Ni=1

∥∥yi − ∫ g (xi, β, γ) dF∥∥2

Eand Q(α) ≡ E

[∥∥y − ∫ g (x, β, γ) dF∥∥2

E/J].

Lemma 2. Lemma A.2 of CP: Let αN = (γ, FN ) be such that QN (αN ) ≤ infα∈AR(N)QN (α) +

Op(νN ) with νN → 0. Suppose the following conditions (B.2.1)–(B.2.4) hold:

• (B.2.1) (i) Q(α0) < ∞; (ii) there is a positive function δ(N,R(N), ε) such that for each N ≥ 1,R ≥ 1, and ε > 0, infα∈AR(N):dLP (α,α0)≥εQ(α)−Q(α0) ≥ δ (N,R(N), ε) and lim infN→∞ δ (N,R(N), ε) ≥0 for all ε > 0.

• (B.2.2) (i) (A, dLP(·)) is a metric space; (ii) AR ⊆ AR+1 ⊆ A for all R ≥ 1, and there exists asequence ΠNα0 ∈ αR(N) such that dLP (ΠNα0, α0)→ 0.

6Whether the problem is well-posed or ill-posed also depends on the choice of the metric. For example, if one usesthe total variation distance metric instead of the the Lévy-Prokhorov metric, the problem will be ill-posed because thedistance between a continuous distribution and any discrete distribution will always be equal to one in the total variationmetric.

23

• (B.2.3) (i) QN (α) is a measurable function of the data ((yi, xi))Ni=1 for all α ∈ AR(N); (ii) αN is

well-defined and measurable with respect to the Borel σ-field generated by the topology inducedby the metric dLP(·, ·).

• (B.2.4) (i) Let cQ(R(N)) = supα∈AR(N)

∣∣∣QN (α)−Q(α)∣∣∣ p→ 0;

(ii) max{cQ(R(N)), νN , |Q (ΠNα0)−Q (α0)|

}/δ (N,R(N), ε)

p→ 0 for all ε > 0.

Then dLP(αN , α0)p→ 0.

Using Lemma 2 we now provide our consistency proof for the least squares estimator. Becauseour estimator is an extremum estimator, we can take νN to be arbitrarily small. We start with thecondition (B.2.1). The condition Q(α0) < ∞ holds because Q(α) ≤ 1 for all α ∈ A, because we havea linear probability model. Next we will verify the condition

infa∈AR(N):dLP(α,α0)≥ε

Q(α)−Q(α0) ≥ δ(N,R(N), ε) > 0 (18)

for each N ≥ 1, R(N) ≥ 1, ε > 0, and some positive function δ(N,R(N), ε) to be defined below. Wewill use our assumption of identification (Assumption 1.4). Let m (x, α) = P (x, α0)− P (x, α), whereP (x, α) =

∫gj (x, β, γ) dF (β). Note that we have

Q(α) = E[||y − P (x, α0) +m(x, α)||2E/J

]= E

[||y − P (x, α0)||2E/J

]+ E

[||m(x, α)||2E/J

](19)

because E[(y − P (x, α0))′m(x, α)] = 0 by the law of iterated expectation and E[y − P (x, α0) | x] = 0.Therefore, for each α ∈ A, we have

Q(α)−Q(α0) = E[||m(x, α)||2E/J

]− E

[||m(x, α0)||2E/J

]= E

[||m(x, α)||2E/J

](20)

because m(x, α0) = 0. The condition (18) now holds due to our assumption of identification, as thefollowing argument shows.

Consider E[||m (x, α) ||2E ], with m (x, α) defined above, as a map from A to R+ ∪ {0}. For anyα 6= α0, E[||m (x, α) ||2E ] takes on positive values for each α ∈ A, because the model is identified on aset X with positive probability. Then note that E[||m (x, α) ||2E ] is continuous in α and also note thatAR(N) is compact. Therefore E[||m (x, α) ||2E ] attains some strictly positive minimum on {α ∈ AR(N) :

dLP(α, α0) ≥ ε}. Then we can take δ(N,R(N), ε) = infα∈AR(N):dLP(α,α0)≥εE[||m(x, α)||2E/J

]> 0 for

all R(N) ≥ 1 with ε > 0.

We have shown that δ(N,R(N), ε) > 0 for all N and hence lim infN→∞ δ(N,R(N), ε) ≥ 0. This isenough for (B.2.1). However, under our assumptions indeed lim infN→∞ δ(N,R(N), ε) > 0 because

δ(N,R(N), ε) = infα∈AR(N):dLP(α,α0)≥ε

(Q(α)−Q(α0)) ≥ infα∈A:dLP(α,α0)≥ε

(Q(α)−Q(α0))

= infα∈A:dLP(α,α0)≥ε

E[||m (x, α) ||2E/J ] > 0,

where the first inequality holds because AR(N) ⊆ A by construction and the second, strict inequalityholds because the model is identified (Assumption 1.4). Here, lim infN→∞ δ(N,R(N), ε) > 0 holdsbecause the last term in the right-hand side of the above inequality does not depend on N and the

24

term is strictly positive.7 This result makes it convenient to verify (B.2.4)(ii) below.Next we consider (B.2.2). First note that (A, dLP) is a metric space and we have AR ⊆ AR+1 ⊆ A

for all R ≥ 1 by construction of our sieve space. Then we claim that there exists a sequence offunctions ΠNα0 ∈ AR(N) such that dLP(ΠNα0, α0)→ 0 as N →∞. First, BR(N) becomes dense in Bby assumption. Second, AR(N) becomes dense in A because the set of distributions FR(N) on a densesubset BR(N) ⊂ B is itself dense. To see this, remember that the class of all distributions with finitesupport is dense in the class of all distributions (Aliprantis and Border 2006, Theorem 15.10). Anydistribution with finite support can be approximated using a finite support in a dense subset BR(N)

(Huber 2004).Next we consider (B.2.3). As Theorem 1 is a special case of Theorem 2, we focus explicitly on

Theorem 2. Consider first Assumption 2.5.a. Remark A.1.(1) (a) of CP says that (B.2.3) holds if eachsieve space AR is compact and the finite-sample objective function is lower semicontinuous in (γ, F ).First note that FR is a compact subset of F for each R because BR is a compact subset of B and henceAR is also a compact subset of A.8 Second we show that for any data ((yi, xi))

Ni=1, QN (α) is continuous

on AR for each R ≥ 1. Since AR is compact, this continuity means our estimator is well defined as theminimum in (6). Because checking the continuity in γ is trivial for given models (e.g. logit model), wefocus on the continuity on FR. For any F1, F2 ∈ FR(N), applying the triangle inequality, we obtain

∣∣∣QN (γ, F1)− QN (γ, F2)∣∣∣ ≤ 2

∑N

i=1

∑J

j=1yi,j

∣∣∣∣∫ gj(xi, β, γ)(dF1 − dF2)

∣∣∣∣ /NJ (21)

+∑N

i=1

∑J

j=1

{∫gj(xi, β, γ)(dF1 + dF2)

} ∣∣∣∣∫ gj(xi, β, γ)(dF1 − dF2)

∣∣∣∣ /NJ≤ 4

∑N

i=1

∑J

j=1

∣∣∣∣∫ gj(xi, β, γ)(dF1 − dF2)

∣∣∣∣ /NJ,where the second inequality holds because yi,j , gj(xi, β, γ), and

∫gj(xi, β, γ)dF (β) are uniformly

bounded by 1 for all j and xi. Then because gj(xi, β, γ) is uniformly bounded by 1 and F1 andF2 are discrete distributions with the finite support BR, in this case the weak convergence implies thatalmost surely QN (γ, F ) is continuous on FR, i.e. for any F1, F2 ∈ FR such that dLP (F1, F2) → 0, itfollows that |QN (γ, F1) − QN (γ, F2)| → 0 almost surely.9 Therefore (B.2.3) holds by Remark A.1.(1)(a) of CP.

7As we discussed in the main text the space F of distributions on B is compact in the weak topology because weassume B itself is compact.

8Alternatively we can also see that FR is compact because the simplex, ∆R(N), itself is compact as we argue below.For any given R and BR, consider two metric spaces, (FR, dLP) and (∆R, || · ||E). Then we can define a continuous mapψ : ∆R → FR because any element in ∆R determines an element in FR. The map is continuous in the sense that forany sequence θn → θ in ∆R we have ψ(θn)→ ψ(θ) in FR. Then it is a simple proof to show that if ∆R is compact, thenFR = {ψ(θ) : θ ∈ ∆R} is also compact.

Proof. Consider an arbitrary sequence {Fn}n∈N ⊆ FR. Since Fn ∈ {ψ(θ) : θ ∈ ∆R} for all n ∈ N, we know that thereexists θn ∈ ∆R with ψ(θn) = Fn for all n ∈ N. Then {θn}n∈N ⊆ ∆R. Next note that since ∆R is compact, there existssome subsequence {θln}n∈N with θln → θ ∈ ∆R. Since the map ψ is continuous, it follows that ψ(θln) → ψ(θ). Andbecause ψ(θln) = Fln , then Fln → ψ(θ) ∈ FR because θ ∈ ∆R. Therefore, we conclude FR is also compact when ∆R iscompact.

9Note that∣∣∫ gj(xi, β)(dF1 − dF2)

∣∣ ≤∑Rr=1 |θ

r1 − θr2| → 0 as dLP (F1, F2)→ 0 for any finite R.

25

For the case of Assumption 2.5.b, we directly assume (B.2.3) for the reasons stated in Remark 8.Next there are two conditions to verify in (B.2.4). We first focus on the uniform convergence of the

sample criterion function in (B.2.4). Here we need to verify the uniform convergence over AR(N) suchthat

sup(γ,F )∈Γ×FR(N)

∣∣∣QN (γ, F )−Q(γ, F )∣∣∣ p→ 0. (22)

For this purpose, it is convenient to view QN (γ, F ) and Q(γ, F ) as functions of γ and θ ∈ ∆R(N)

and write them as QN (γ, θ) and Q(γ, θ), respectively. Then define, for any R, the class of measurablefunctions

GR =

l(y, x, θ, γ) =

∥∥∥∥∥y −∑r

θrg(x, βr, γ)

∥∥∥∥∥2

E

/J : (γ, θ) ∈ Γ×∆R

and note that QN (γ, θ) = N−1

∑Ni=1 l(yi, xi, θ, γ). Then again by Pollard (1984, Theorem II.24), the

uniform convergence (22) holds if and only if the entropy satisfies logN(ε, GR, || · ||L1,N ) = op(N) forall ε > 0. Note that the entropy measure of GR is bounded by the sum of two entropies, one associatedwith FR(N) and the other one associated with Γ. Below we show that the former is op(N). We alsonote that the latter satisfies the entropy condition (and so is op(N)) under Assumption 2.5a-2.5b byTheorem 2.7.11 of van der Vaart and Wellner (1996) for the Lipschitz case and because the class ofindicator functions belongs to the Vapnik-Červonenkis class and has a uniformly bounded entropy(Theorem 2.6.7 of van der Vaart and Wellner 1996).

Now we verify the entropy condition associated with FR(N). Let ∆R(N) be the R (N) unit simplex.Using measures of complexity of spaces, let N(ε, T , || · ||) denote the covering number of the set T withballs of radius ε with an arbitrary norm || · || and let N[](ε, T , ‖·‖) denote the bracketing number ofthe set T with ε-brackets. For ease of notation, below we suppress γ (the result holds for any givenγ ∈ Γ) and define for any R, the class of measurable functions

GR =

{l(y, x, θ) = ||y −

∑r

θrg(x, βr)||2E/J : θ ∈ ∆R

}. (23)

Note (i) QN (θ) = N−1∑N

i=1 l(yi, xi, θ), (ii) {(yi, xi)}Ni=1 are i.i.d., and (iii) E[supθ∈∆R(N)

|l(y, x, θ)|] ≤1 <∞. Then by Pollard (1984, Theorem II.24) (also see Chen (2007, Section 3.1, page 5592) for relateddiscussion), the entropy condition to satisfy becomes logN(ε,GR, || · ||L1,N )/N

p→ 0 for all ε > 0, where|| · ||L1,N denotes the L1(PN )-norm and PN denotes the empirical measure of the data ((yi, xi))

Ni=1.

The term l(y, x, θ) is Lipschitz in θ, as

|l(y, x, θ1)− l(y, x, θ2)| ≤ 1

J

J∑j=1

(2yj∑r

gj(x, βr)|θr1 − θr2|+

∑r

gj(x, βr)(θr1 + θr2)

∑r

gj(x, βr)|θr1 − θr2|

)

≤ M(·)∑R

r=1|θr1 − θr2| ≤M(·)

√R||θ1 − θ2||E

with some function E[M(·)2] < ∞. The first inequality is obtained by the triangle inequality andthe third inequality holds due to the Cauchy-Schwarz inequality. We also know ∆R is a compactsubset of RR. Now take M(·) = 4, noting that yj , gj(·), and

∑Rr=1 gj(x, β

r)θr are uniformly bounded

26

by 1. Then from Theorem 2.7.11 of van der Vaart and Wellner (1996), we have N[] (2ε,GR, ‖·‖) ≤

N(

ε4√R,∆R, ‖·‖E

)=(

4√Rε

)Rfor any norm ‖·‖. Therefore as long as R(N) logR(N)/N → 0, the

entropy condition associated with FR(N) holds because N(ε,GR, ‖·‖L1,N

)≤ N[]

(2ε,GR, ‖·‖L1,N

)≤(

4√Rε

)R(van der Vaart and Wellner 1996, page 84).

To satisfy the second condition in (B.2.4), we need to bound all three terms in the max{·} func-tion. We have shown the uniform convergence of the sample criterion function (this also satisfies(B.2.4) (i)) and we can take νN to be small enough. We also have |Q(ΠNα0) − Q(α0)| → 0, whichis trivially satisfied by the continuity of Q(α) at α0 and dLP (ΠNα0, α0) → 0. Therefore becauselim infN→∞ δ (N,R(N), ε) > 0, the condition (B.2.4) (ii) is satisfied.

We have verified all the conditions in Lemma 2 (Lemma A.2 of CP) and this completes the consis-tency proof.

B.2 Proof of Corollary 1

Similarly to the baseline least squares estimator, using Lemma 2 we show the consistency of the MLestimator. For ease of notation, we consider the model with heterogeneous parameters only. Thefollowing proof is parallel to the least squares case.

We start with (B.2.1). First, the condition Q(F0) <∞ holds under the assumption that Pj(x, F0)

is bounded away from zero for all j ≤ J . Similarly to the least squares case, next we show that forε > 0

lim infF∈FR(N):dLP(F,F0)≥ε

Q(F )−Q(F0) > 0. (24)

Note that for any F 6= F0, we have

Q(F )−Q(F0) = −

E J∑j=1

yj logPj(x, F )

− E J∑j=1

yj logPj(x, F0)

(25)

= −E

log

J∏j=1

Pj(x, F )/Pj(x, F0)yj

> − log

E J∏j=1

(Pj(x, F )/Pj(x, F0)) yj

= − log

E J∑j=1

Pj(x, F )

= 0,

where the inequality holds by Jensen’s inequality. Here the strict inequality holds because P (x, F0) 6=P (x, F ) with positive probability for any F 6= F0 (Assumption 1.4). The third equality holds by the lawof iterated expectation and Pr[yj = 1|x] = Pj(x, F0). The last result holds because

∑Jj=1 Pj(x, F ) = 1

by construction. Therefore, (24) is satisfied because (25) holds for any F ∈ F such that dLP(F, F0) ≥ ε,by essentially the same argument for the least squares case.

Next, (B.2.2) holds by the same argument for the proof of the least squares estimator.Next, we show (B.2.3) holds. We use Remark A.1.(i)(a) of CP. First note that FR is a compact

subset of F for each R because BR is a compact subset of B. Second we need to show for any data

27

((yi, xi))Ni=1, QN (F ) is continuous on FR for each R ≥ 1. Using the inequality that for 0 < c ≤ a, b,

| log a− log b| ≤ |a− b|/c, we obtain for some constant C

∣∣∣QN (F1)− QN (F2)∣∣∣ ≤ ∑N

i=1

J∑j=1

yi,j |logPj(xi, F1)− logPj(xi, F2)| /N

≤ C∑N

i=1

J∑j=1

yi,j |Pj(xi, F1)− Pj(xi, F2)| /N

= C∑N

i=1

J∑j=1

yi,j

∣∣∣∣∫ gj(xi, β)(dF1 − dF2)

∣∣∣∣ /Nand therefore (B.2.3) holds by the essentially same argument to the proof of the least squares case (seethe discussion below (21)).

Next there are two conditions to verify in (B.2.4). We first focus on the uniform convergence of thesample criterion function. Here we need to verify the uniform convergence over FR(N) such that

supF∈FR(N)

∣∣∣QN (F )−Q(F )∣∣∣ p→ 0. (26)

For this purpose, it is convenient to view QN (F ) and Q(F ) as functions of θ ∈ ∆R(N) and write themas QN (θ) and Q(θ), respectively. Then define, for any R, the class of measurable functions

GMLR =

l(y, x, θ) =

J∑j=1

yj log

(∑r

θrgj(x, βr)

): θ ∈ ∆R

. (27)

Note (i) QN (θ) = −N−1∑N

i=1 l(yi, xi, θ), (ii) ((yi, xi))Ni=1 are i.i.d., and (iii)E

[supθ∈∆R(N)

|l(y, x, θ)|]<

∞. Then by Pollard (1984, Theorem II.24) (also see Chen (2007, Section 3.1, page 5592) for relateddiscussion), the entropy condition becomes logN(ε,GML

R , || · ||L1,N )/Np→ 0 for all ε > 0. Using the

inequality that for 0 < c ≤ a, b, | log a− log b| ≤ |a− b|/c, we obtain for some constant C

|l(y, x, θ1)− l(y, x, θ2)| =

∣∣∣∣∣∣J∑j=1

yj

{log

(∑r

θr1gj(x, βr)

)− log

(∑r

θr2gj(x, βr)

)}∣∣∣∣∣∣≤ C

J∑j=1

yj

∣∣∣∣∣∑r

θr1gj(x, βr)−

∑r

θr2gj(x, βr)

∣∣∣∣∣≤ C

J∑j=1

yj

{∑r

gj(x, βr) |θr1 − θr2|

}≤M(·)

∑R

r=1|θr1 − θr2| ≤M(·)

√R||θ1 − θ2||E

with some function E[M(·)2] < ∞. Therefore, the first condition in (B.2.4) holds by the essentiallysame argument to the proof of the least squares case. Finally, the second condition in (B.2.4) alsoholds by the essentially same argument to the proof of the least squares case. We have verified all theconditions in Lemma 2 (Lemma A.2 of CP) for the ML estimator.

28

B.3 Proof of Theorem 3

We can prove Theorem 3 by verifying conditions (B.2.1)–(B.2.4) in Lemma 2, just as in the proof of The-orem 1. Observe that (B.2.1)–(B.2.2) are clearly satisfied because the arguments that (B.2.1)–(B.2.2)are satisfied within the proof of Theorem 1 follow almost exactly if we replace g(x, βr), FR, F , andQ(F ) with g(x, λr), Pλ,R, Pλ, and Q(Pλ). The verification of (B.2.3) also follows the arguments inthe proof of Theorem 1, instead using QN (Pλ) and ΛR. Condition (B.2.4), regarding the uniformconvergence of QN (Pλ) to Q(Pλ), is satisfied by invoking the same arguments as in the proof Theorem1, using QN (Pλ) and Q(Pλ). Other arguments are also essentially identical to Theorem 1. ThereforedLP

(Pλ,N , Pλ,0

)p→ 0, which also implies dLP

(FN , F0

)p→ 0 because we assume F0 ∈ FM and because

F =∫

Φ(·|λ)dPλ(dλ) is continuous on Pλ in the Lévy-Prokhorov metric. We omit a complete proof ofTheorem 3 to eliminate redundancy.

B.4 Proof of Theorem 4

Define gS(x, λr) = 1S

∑Ss=1 g(x, βr,s) where βr,s are drawn from Φ(β|λr) and define QN,S(Pλ) =

N−1∑N

i=1

∥∥yi −∑r θrgS(xi, λ

r)∥∥2

E/J . Then define the estimator of Pλ,0 as

Pλ,N,S = argminPλ∈Pλ,R(N)QN,S(Pλ) + C · νN (28)

with some tolerance of minimization, C · νN , that tends to zero (if this is necessary).We prove the consistency by verifying conditions in Lemma 2 for the simulated distribution func-

tion estimator. Note that (B.2.1)–(B.2.2) are clearly satisfied because the proofs of (B.2.1)-(B.2.2)in Theorem 1 hold by replacing g(x, βr), FR , F , and Q(F ) with gS(x, λr), Pλ,R, Pλ, and Q(Pλ),respectively, when necessary. We focus on (B.2.3) and (B.2.4).

To show (B.2.3) holds, we use Remark A.1.(i)(a) of CP. First note that Pλ,R is a compact subsetof Pλ for each R because ΛR is a compact subset of Λ. Second we need to show that for any data((yi, xi))

Ni=1, QN,S(Pλ) is continuous on Pλ,R for each R ≥ 1. Since Pλ,R is compact, this continuity

means our estimator is well defined as the minimum in (28). Note∫gS (x, λ) dPλ,l =

∑Rr=1 θ

rl gS (x, λr)

for Pλ,l ∈ Pλ,R, l = 1, 2. Then, for any Pλ,1, Pλ,2 ∈ Pλ,R(N), applying the triangle inequality, we obtain

∣∣∣QN,S(Pλ,1)− QN,S(Pλ,2)∣∣∣ ≤ 2

∑N

i=1

∑J

j=1yi,j

∣∣∣∣∫ gSj (xi, λ)(dPλ,1 − dPλ,2)

∣∣∣∣ /NJ+

∑N

i=1

∑J

j=1

{∫gSj (xi, λ)(dPλ,1 + dPλ,2)

} ∣∣∣∣∫ gSj (xi, λ)(dPλ,1 − dPλ,2)

∣∣∣∣ /NJ≤ 4

∑N

i=1

∑J

j=1

∣∣∣∣∫ gSj (xi, λ)(dPλ,1 − dPλ,2)

∣∣∣∣ /NJ,where the second inequality holds because yi,j , gSj (xi, λ), and

∫gSj (xi, λ)dPλ are uniformly bounded

by 1 for all j and xi. Then because gSj (xi, λ) is uniformly bounded by 1 and Pλ,1 and Pλ,2 are discretedistributions with the finite support ΛR, weak convergence implies that almost surely QN,S(Pλ) iscontinuous on Pλ,R, i.e. for any Pλ,1, Pλ,2 ∈ Pλ,R such that dLP (Pλ,1, Pλ,2) → 0, it follows that|QN,S(Pλ,1) − QN,S(Pλ,2)| → 0 almost surely. Therefore, by Remark A.1.(i) (a) of CP, (B.2.3) holdswith simulated basis functions with any S.

29

Next we verify (B.2.4). (B.2.4) (ii) is clearly satisfied as in Theorem 1. We focus on (B.2.4) (i): theuniform convergence of QN,S(Pλ) to Q(Pλ) for Pλ ∈ Pλ,R(N), supPλ∈Pλ,R(N)

|QN,S(Pλ)−Q(Pλ)| p→ 0. Asin Theorem 1 it is convenient to view QN,S(Pλ) and Q(Pλ) for Pλ ∈ Pλ,R(N) as functions of θ ∈ ∆R(N)

and then write them as QN,S(θ) and Q(θ), respectively. Then the uniform convergence condition toverify becomes

supθ∈∆R(N)

∣∣∣QN,S(θ)−Q(θ)∣∣∣ p→ 0. (29)

Define, for any S, QS(θ) ≡ E[∥∥yi −∑r θ

rgS (xi, λr)∥∥2

E/J], where the expectation is taken with

respect to (yi, xi) while fixing the simulation draws in gS (xi, λr). We first show that

supθ∈∆R(N)

∣∣∣QN,S(θ)−QS(θ)∣∣∣ p→ 0.

Later we show supθ∈∆R(N)|QS(θ)−Q(θ)| p→ 0; therefore invoking the triangle inequality we obtain

(29).For any R define the class of measurable functions

GR,S =

l(y, x, θ) =

∥∥∥∥∥y −∑r

θrgS(x, λr)

∥∥∥∥∥2

E

/J : θ ∈ ∆R

.

Note (i) QN,S(θ) = N−1∑N

i=1 l(yi, xi, θ), (ii) {(yi, xi)}Ni=1 are i.i.d., and (iii) E[supθ∈∆R(N)

|l(y, x, θ)|] ≤1 < ∞. Then by Pollard (1984, Theorem II.24) (also see Chen (2007, Section 3.1, page 5592) forrelated discussion), the uniform convergence (29) holds if and only if logN(ε,GR,S , || · ||L1,N )/N

p→ 0

for all ε > 0, where || · ||L1,N denotes the L1(PN )-norm and PN denotes the empirical measure of thedata {(yi, xi)}Ni=1. Then following the same steps in the proof of Theorem 1, we conclude as long asR(N) logR(N)/N → 0, the uniform convergence of QN,S(θ) to QS(θ) holds for any given S.

Next we show supθ∈∆R(N)|QS(θ) − Q(θ)| p→ 0 in P∗ where P∗ denotes probability measure w.r.t.

30

the simulation draws. Consider that

|QS(θ)−Q(θ)|

= E

∥∥∥∥∥yi −∑r

θrgS(xi, λr)

∥∥∥∥∥2

E

−

∥∥∥∥∥yi −∑r

θrg(xi, λr)

∥∥∥∥∥2

E

≤ J−1

J∑j=1

E

∣∣∣∣∣∣(yi,j −

∑r

θrgSj (xi, λr)

)2

−

(yi,j −

∑r

θrgj(xi, λr)

)2∣∣∣∣∣∣

≤ J−1J∑j=1

E

[∣∣∣∣∣(

2yi,j −∑r

θrgSj (xi, λr)−

∑r

θrgj(xi, λr)

)(∑r

θrgSj (xi, λr)−

∑r

θrgj(xi, λr)

)∣∣∣∣∣]

≤ 2J−1J∑j=1

E

[∣∣∣∣∣∑r

θrgSj (xi, λr)−

∑r

θrgj(xi, λr)

∣∣∣∣∣]

≤ 2J−1J∑j=1

∑r

θrE[∣∣gSj (xi, λ

r)− gj(xi, λr)∣∣]

≤ 2J−1J∑j=1

∑r

θr maxrE

[∣∣∣∣∣ 1SS∑s=1

{gj(xi, β

r,s)−∫gj(xi, β)dΦ(β|λr)

}∣∣∣∣∣]

= oP∗(1),

where the third inequality holds because 0 <∑

r θrgSj (xi, λ

r) ≤ 1 and 0 <∑

r θrgj(xi, λ

r) ≤ 1 andthe last result holds because

∑Rr=1 θ

r = 1 and βr,s are iid draws from Φ(β|λr), so E∗[gj(x, βr,s)] =∫gj(x, β)dΦ(β|λr) < ∞ for each r and we applied a LLN to the term inside the | · | bracket for each

r, which holds for any x ∈ X . Note that the convergence of the partial sum inside the | · | bracket isindependent of r because we use S simulation draws independently drawn for each λr.

Here E∗[·] denotes expectation w.r.t. the simulation draw. Note that the above oP∗(1) result doesnot depend on θ and therefore we also have supθ∈∆R(N)

|QS(θ)−Q(θ)| p→ 0. Then invoking the triangleinequality we obtain the uniform convergence in (29).

We have verified all the conditions in Lemma 2 (Lemma A.2 of CP) and therefore showed that Pλ,N,Sis a consistent estimator for Pλ,0. This in turn implies the consistency of FN,S =

∫Φ(·|λ)Pλ,N,S(dλ)

to F0 =∫

Φ(·|λ)Pλ,0(dλ) because Φ(β|λ) is continuous in λ for all β and thus F =∫

Φ(·|λ)dPλ(dλ) isalso continuous on Pλ in the Lévy-Prokhorov metric.

C Proofs for Asymptotic Bounds Results

We let C, C1, C2, . . . denote generic positive constants. We use diag (A) to denote a diagonal matrixcomposed of diagonal elements of a matrix A. We often use the following inequality (denoted by RCS

for the Cauchy-Schwarz inequality for R terms in a sum):∑R

r=1Wr ≤√R√∑R

r=1W2r for a sequence

Wr’s. We note ‖P (x, F0)‖∞ ≤ ζ0 and ‖g (x, βr)‖∞ ≤ ζ0 for some constant ζ0 > 0 uniformly overr ≤ R(N) because probabilities are bounded by one. We also note for some c0 > 0, ‖g(x, βr)‖L2

≥ c0

uniformly over r ≤ R(N) because X , the support of X, has positive measure and the support of β isbounded. We first present preliminary lemmas that are useful to prove Theorem 5 and Theorem 6.

31

We define ΨN,R =(

1N

∑Ni=1

∑Jj=1 gj(xi, β

r)gj(xi, βr′))

1≤r,r′≤R(a R×R matrix).

Lemma 3. Suppose X has positive measure and Assumption 4(i)-(iii) hold. Then

min{

Pr{||g (·, βr) ||2L2,N

≤ 2||g(·, βr)||2L2, ∀r

},Pr

{||g(·, βr)||L2 ≤ 2||g(·, βr)||L2,N

, ∀r}}≥

1−R exp(−C1Nc

40/ζ

40

).

Proof. The claim follows from the union bound applied to the r-specific events and Hoeffding’s in-equality.

Lemma 3 implies that diag (ΨN,R) ≤ 2diag (ΨR) holds with probability approaching one (w.p.a.1)because ‖g (·, βr)‖2L2,N

’s are diagonal elements of ΨN,R and ‖g (·, βr)‖2L2’s are diagonal elements of ΨR.

The purpose of the following Lemma 4 is to show that ΨN,R ≥ ΨR/2 holds w.p.a.1.

Lemma 4. Let G = span{g(·, β1), . . . , g(·, βR)} be the linear space spanned by functions g(·, β1), . . . , g(·, βR).Suppose Assumptions 4 (i)-(iii) hold. Then Pr

{supµ∈G\{0}(‖µ‖

2L2/ ‖µ‖2L2,N

) > 2}≤ R2 exp

(−C2N/(ζ

40R

2)).

Proof. Let φ1, . . . , φM be an orthonormal basis of G in L2($) with M ≤ R. Also let ρ(D) denote thefollowing quantity for a symmetric matrix D: ρ(D) = sup

∑l |al|

∑l′ |al′ |

∣∣Dl,l′∣∣ , where the sup is taken

over sequences {al}Ml=1 with∑M

l=1 a2l = 1. Then, following Lemma 5.2 in Baraud (2002), we have

Pr

{sup

µ∈G\{0}

‖µ‖2L2

‖µ‖2L2,N

> c

}≤M2 exp

(−N ($0 − c−1)2

4$1 max{ρ2 (A) , ρ (B)

}) (30)

where Al,l′ =√E[||φl||2E ||φl′ ||2E ] and Bl,l′ =

∥∥∥∥φl∥∥E∥∥φl′∥∥E∥∥∞ for l, l′ = 1, . . . ,M and $0 and $1

denote the lower bound and upper bound of the density of X, respectively. We find∣∣Al,l′∣∣ ≤ ζ2

0 and∣∣Bl,l′∣∣ ≤ ζ20 . It follows that

ρ(A) ≤ ζ20 sup

∑l|al|∑

l′|al′ | = ζ2

0 sup(∑

l|al|)2≤ ζ2

0 supM∑

l|al|2 = ζ2

0M ≤ ζ20R

where (∑

l |al|)2 ≤M

∑l |al|

2 holds by the Cauchy-Schwarz inequality. Similarly we have ρ(B) ≤ ζ20R.

The conclusion follows from (30) and M ≤ R.

Now define an event E0 ={

ΨN,R −(ξmin(R)/4ζ2

0

)diag(ΨN,R) ≥ 0

}.

Lemma 5. Suppose X has positive measure and Assumption 4(i)-(iii) hold. Then,Pr{E0} ≥ 1−R exp

(−C1Nc

40/ζ

40

)−R2 exp

(−C2N/ζ

40R

2)

Proof. First, note that ΨR − (ξmin(R)/ζ20 )diag (ΨR) ≥ 0. Now let G be the linear space spanned by

g(·, β1), . . . , g(·, βR). Now note that, under the eventA ={||g(·, βr)||2L2,N

≤ 2||g(·, βr)||2L2∀r = 1, . . . , R

},

we have diag(ΨN,R) ≤ 2diag(ΨR). Also, under the event B ={

supµ∈G\{0}

(‖µ‖2L2

/ ‖µ‖2L2,N

)≤ 2},

we have ΨN,R ≥ ΨR/2. Therefore, under the intersection of the two events, A and B, we have

ΨN,R − ξmin(R)diag (ΨN,R) /(4ζ20 ) ≥ ΨR/2− 2ξmin(R)diag (ΨR) /(4ζ2

0 ) ≥ 0,

32

or event E0. Then by Lemma 3 and Lemma 4, we find 1 − Pr{E0} ≤ R · exp(−C1Nc

40/ζ

40

)+ R2 ·

exp(−C2N/ζ

40R

2)because Pr{Ec0} ≤ Pr{Ac ∪Bc} ≤ Pr{Ac}+ Pr{Bc}.

Lemma 6. Suppose X has positive measure and Assumption 4(i)-(iii) hold. Then, for given positivesequence ηN

Pr{∣∣∣∑

ie′ig(Xi, β

r)/N∣∣∣ ≤ ηN ||g(·, βr)||L2,N

for all r = 1, . . . , R}

≥ 1−R · exp(−CNη2

Nc20/ζ

20

)−R · exp

(−C1Nc

40/ζ

40

).

Proof. Hoeffding (1963)’s inequality implies that

EX

[Pr{∣∣∣∑

ie′ig(Xi, β

r)/N∣∣∣ ≥ ηN ||g(·, βr)||L2,N

, ∀r ≤ R |X1, . . . , XN

}](31)

≤ EX

[∑r

exp(−2Nη2

N ||g(·, βr)||2L2,N/4Jζ2

0

)]because E [e′ig (Xi, β

r) | X1, . . . , XN ] = 0, choice probabilities lie in [0, 1], and −√Jζ0 ≤ e′ig(Xi, β

r) ≤√Jζ0 (by the Cauchy-Schwarz inequality) uniformly.Now note under the event

{||g(·, βr)||L2 ≤ 2||g(·, βr)||L2,N

, ∀r = 1, . . . , R},

∑R

r=1exp

(−Nη2

N ‖g (·, βr)‖2L2,N/2Jζ2

0

)≤

∑R

r=1exp

(−Nη2

N ‖g (·, βr)‖2L2/8Jζ2

0

)(32)

≤∑R

r=1exp

(−Nη2

Nc20/8Jζ

20

)= R exp

(−CNη2

Nc20/ζ

20

).

From (31)-(32) and Lemma 3, the claim follows.

The following Lemma 7 decomposes the error bound into a bias term and a variance term.

Lemma 7. Suppose X has positive measure and Assumption 4(i)-(iii) hold. Then, for any N ≥ 1,R ≥ 2, and a > 1, we have for all F ∈ FR(N) ,

∥∥∥P (xi, FN )− P (xi, F0)∥∥∥2

L2,N

≤ a+ 1

a− 1‖P (xi, F )− P (xi, F0)‖2L2,N

+ Cη2NRζ

20

ξmin(R)

a2

a− 1,

where the inequality holds with probability greater than 1− pN,R,pN,R ≡ R exp

(−CNη2

Nc20/ζ

20

)+ 2R exp

(−C1Nc

40/ζ

40

)+R2 exp

(−C2N/ζ

20R

2).

Proof. Because P (xi, FN ) (i.e., FN ) is the solution of the minimization problem in (6), we have

||P (xi, FN )− yi||2L2,N≤ ||P (xi, F )− yi||2L2,N

(33)

for any F ∈ FR(N). Now note that

∥∥∥P (xi, FN )− yi∥∥∥2

L2,N

=∥∥∥P (xi, FN )− P (xi, F0) + P (xi, F0)− yi

∥∥∥2

L2,N

=∥∥∥P (xi, FN )− P (xi, F0)

∥∥∥2

L2,N

− 2∑

ie′i

(P (xi, FN )− P (xi, F0)

)/N + ||ei||2L2,N

, (34)

33

where we use the definition ei = yi − P (xi, F0). Similarly we obtain

∥∥∥P (xi, FN )− yi∥∥∥2

L2,N

= ‖P (xi, F )− P (xi, F0)‖2L2,N−2∑

ie′i (P (xi, F )− P (xi, F0)) /N+||ei||2L2,N

.

(35)

Subtracting (35) from (34) and by (33), we obtain∥∥∥P (xi, FN )− P (xi, F0)∥∥∥2

L2,N

≤ ‖P (xi, F )− P (xi, F0)‖2L2,N+ 2

∑i

e′i(P (xi, FN )− P (xi, F ))/N.

Let VN,r = 1N

∑Ni=1 e

′ig(xi, β

r). Then, 1N

∑i e′i(P (xi, FN ) − P (xi, F )) =

∑Rr=1 VN,r(θ

r − θr) by con-struction of P (xi, FN ) and P (xi, F ) and we obtain by the triangle inequality,∥∥∥P (xi, FN )− P (xi, F0)

∥∥∥2

L2,N

≤ ‖P (xi, F )− P (xi, F0)‖2L2,N+ 2

∑r

|VN,r| · |θr − θr|. (36)

Define the event E1 =⋂Rr=1

{|VN,r | ≤ ηN ||g(·, βr)||L2,N

}. Then under {E0 ∩E1}, we have

∑R

r=1V 2N,r(θ

r − θr)2 ≤ η2N

∑R

r=1||g(·, βr)||2L2,N

(θr − θr)2 (37)

= η2N

1

N

∑N

i=1

∑R

r=1(θr − θr)2||g(xi, β

r)||2E = η2N (θ − θ)′diag(ΨN,R)(θ − θ)

≤ η2N

(ξmin(R)/4ζ2

0

)−1(θ − θ)′ΨN,R(θ − θ) = η2

N

(ξmin(R)/4ζ2

0

)−1 ||P (xi, FN )− P (xi, F )||2L2,N.

From (36) and (37), it follows that under the event {E0 ∩E1},∥∥∥P (xi, FN )− P (xi, F0)∥∥∥2

L2,N

≤ ‖P (xi, F )− P (xi, F0)‖2L2,N+ 2

∑R

r=1|VN,r| · |θr − θr|

≤ ‖P (xi, F )− P (xi, F0)‖2L2,N+ 2√R

(∑R

r=1V 2N,r

(θr − θr

)2)1/2

≤ ‖P (xi, F )− P (xi, F0)‖2L2,N+ 2ηN

√R(ξmin(R)/4ζ2

0

)−1/2 ||P (xi, FN )− P (xi, F )||L2,N

≤ ‖P (xi, F )− P (xi, F0)‖2L2,N

+2ηN√R(ξmin(R)/4ζ2

0

)−1/2(∥∥∥P (xi, FN )− P (xi, F0)

∥∥∥L2,N

+ ‖P (xi, F )− P (xi, F0)‖L2,N

)by the Cauchy-Schwarz inequality, RCS, and the triangle inequality. Applying the inequality 2xy ≤x2a+ y2/a (any x,y, a > 0) to x = ηN

√R(ξmin(R)/4ζ2

0 )−1/2 and y = ||P (xi, FN )−P (xi, F0)||L2,Nand

to x = ηN√R(ξmin(R)/4ζ2

0 )−1/2 and y = ||P (xi, F )−P (xi, F0)||L2,N, respectively, we obtain under the

event {E0 ∩E1},∥∥∥P (xi, FN )− P (xi, F0)∥∥∥2

L2,N

≤ ‖P (xi, F )− P (xi, F0)‖2L2,N+∥∥∥P (xi, FN )− P (xi, F0)

∥∥∥2

L2,N

/a

+ ‖P (xi, F )− P (xi, F0)‖2L2,N/a+ 2aη2

NR(ξmin(R)/4ζ2

0

)−1

34

It follows that under the event {E0 ∩E1}, for all a > 1,∥∥∥P (xi, FN )− P (xi, F0)∥∥∥2

L2,N

≤ a+ 1

a− 1‖P (xi, F )− P (xi, F0)‖2L2,N

+ Cη2NRζ

20

ξmin(R)

a2

a− 1.

The conclusion follows from Lemma 6.

C.1 Proof of Theorem 5

We first obtain the convergence rates of the choice probability as

Lemma 8. Suppose the assumptions and the conditions in Theorem 5 hold. Then, we have∥∥∥P (xi, FN )− P (xi, F0)∥∥∥2

L2,N

≤ OP

(R(N) logR(N)

ξmin(R(N)) ·N

)+ C · inf

F∈FR(N)

‖P (xi, F )− P (xi, F0)‖2L2,N.

Then the result of Theorem 5 follows from Lemma 8 combined with Lemma 1. Lemma 8 derivesthe variance term of the asymptotic bounds and Lemma 1 derives the bias term. We prove Lemma 8below.

C.2 Proof of Lemma 8

From Lemma 7, the fastest convergence rate will be obtained when the order of ηN is as small aspossible while keeping pN,R → 0. By inspecting pN,R, we note that the optimal rate is obtained whenwe choose ηN = C

√logR(N)/N since the first term in pN,R dominates the second term in pN,R when

ηN is small enough and pN,R → 0 with this choice of ηN . The inspection of the third term in pN,Rreveals that we also require R(N) should satisfy R(N)2 logR(N)/N → 0 so that pN,R → 0. The resultof Lemma 8 follows from these requirements, Lemma 7 and∥∥∥P (xi, FN )− P (xi, F0)

∥∥∥2

L2,N

≤ OP

(R(N) logR(N)

ξmin(R(N)) ·N

)+ C · ‖P (xi, F )− P (xi, F0)‖2L2,N

because C η2NR(N)ζ20ξmin(R(N))

a2

a−1 = O(R(N) logR(N)ξmin(R(N))·N

)under our choice of ηN and because Lemma 7 holds for

any F ∈ FR(N).

C.3 Proof of Lemma 1

First we construct approximating power series with the length of L tensor products of higher orderpolynomials of βk’s in β as (ϕ1(β), . . . , ϕl(β), . . . , ϕL(β)), where ϕl(β) is the lth element in the L numberof tensor product polynomials. The tensor products are defined by the functions ϕl(β) = βl11 β

l22 · · ·β

lKK

with β = (β1, β2, . . . , βK) ∈ B =∏Kk=1[β

k, βk] and lk’s are exponents of each βk. For example, we can

let ϕ1(β) = 1, ϕ2(β) = β1, . . . , and ϕl(β) = βl11 βl22 · · ·β

lKK .

Now note that each gj (x, β) f(β) is a member of the Hölder class (s-smooth) of functions sinceit is uniformly bounded and all of its own and partial derivatives up to the order of s are also uni-formly bounded by our restriction on f in H and Assumption 4 (iv). Therefore, we can approximate

35

gj (x, β) f(β) well using power series (see Chen, 2007) and obtain the approximation error rate due toTiman (1963) as (we suppress dependence of al and ϕl on j for notational simplicity)

supβ∈B

∣∣∣∣gj (x, β) f(β)−∑L

l=1al(x)ϕl(β)

∣∣∣∣ = O(L−s/K

)(38)

for all x ∈ X . Let(βk

= bk,1 < bk,2 < · · · < bk,rk+1 = βk, k = 1, . . . ,K)be partitions of the intervals

[βk, βk], k = 1, . . . ,K, into r1, . . . , rK subintervals, respectively. Then we can define the R = r1r2 · · · rK

number of subcubes as {Cι1,...,ιK =∏Kk=1[bk,ιk , bk,ιk+1], ιk = 1, 2, . . . , rk}, which become a partition

P (B) of B. For any choice of R points

(bι1,...,ιK ∈ Cι1,...,ιK | ιk = 1, 2, . . . , rk,k = 1, . . . ,K)

(one bι1,...,ιK for each of R subcubes), now we can approximate a Riemann integral of∑L

l=1 al(x)ϕl(β)

using a quadrature method with R distinct weights

(c(ι1, . . . , ιK) ≡ c(bι1,...,ιK ) | ιk = 1, . . . , rk, k = 1, . . . ,K) such that

∫ ∑L

l=1al(x)ϕl(β)dβ =

∑L

l=1al(x)

∫ϕl(β)dβ

=∑L

l=1al(x)

∑Cι1,...,ιK∈P (B)

c(ι1, . . . , ιK)ϕk(bι1,...,ιK ) +R(δR)

where R(δR) denotes a remainder term with δR = max {diam(Cι1,...,ιK ) : Cι1,...,ιK ∈ P (B)}. Withoutloss of generality, we will pick δR = C · R−1/K . Noting that ϕl(β) is a product of polynomials inβk’s by construction, we can apply Theorem 6.1.2 (Generalized Cartesian Product Rules) of Krommerand Ueberhuber (1998) and so we can approximate multivariate integrals with products of univariateintegrals. Note that

∫ϕl(β)dβ =

∏Kk=1

∫ϕl,k(βk)dβk with ϕl(β) =

∏Kk=1 ϕl,k(βk). Therefore if we

approximate∫ϕl,k(βk)dβk using a univariate quadrature with weights {ck(1), . . . , ck(rk)}, generally we

obtain∫ϕl,k(βk)dβk =

∑rkιk=1 ck(ιk)ϕl,k(bk,ιk) +Rl,k(δR) where Rl,k(δR) denotes a possible remainder

term. Now we can make the univariate quadrature even become accurate (or exact) at least up to theorder of rk (Theorem 5.2.1 in Krommer and Ueberhuber (1998)) i.e.,

∫βpkdβk =

∑rkιk=1 ck(ιk)b

pk,ιk

withsuitable choice of ck(ιk) ≡ ck(bk,ιk) for all p ≤ rk such that Rl,k(δR) = 0.

For notational simplicity, we take rk = r1 for all k. Then with the L = (r1 + 1)K = (R1/K + 1)K

number of power series, we can include powers and cross products of βk’s at least up to the order ofr1. With the choice of L = (r1 + 1)K and ck(ιk)’s that make the univariate quadrature exact at leastup to the order of r1, we can let∫

ϕl(β)dβ =∏K

k=1

∫ϕl,k(βk)dβk =

∏K

k=1

∑rk

ιk=1ck(ιk)ϕl,k(bk,ιk) (39)

36

for l = 1, . . . , L. By adding and subtracting terms, it follows that for some F ∗ defined later

Pj (x, F )− Pj (x, F ∗) =

∫f(β)gj (x, β) dβ −

∫ ∑L

l=1al(x)ϕl(β)dβ (40)

+∑L

l=1al(x)

∫ϕl(β)dβ −

∑L

l=1al(x)

∏K

k=1

∑rk

ιk=1ck(ιk)ϕl,k(bk,ιk) (41)

+∑L

l=1al(x)

∏K

k=1

∑rk

ιk=1ck(ιk)ϕl,k(bk,ιk)− Pj (x, F ∗) . (42)

We first bound (40) by the triangle inequality and (38), (40) ≤∫ ∣∣∣f(β)gj (x, β)−

∑Ll=1 al(x)ϕl(β)

∣∣∣ dβ =

O(L−s/K · vol(B)). Second note that (41) becomes zero due to (39).Next construct ϕl(br), r = 1, . . . , R such that

∑Rr=1 ϕl(b

r) =∏Kk=1

{∑rkιk=1 ck(ιk)ϕl,k(bk,ιk)

}with

R = r1 · · · rK and br = (br1, br2,, . . . , b

rK) in b ≡ {b : b = (b1,ι1 , . . . , bK,ιK ), ι1 = 1, . . . , r1, . . . , ιK = 1, . . . , rK}.

Then, for any choice of br ∈ b we can always write ϕl(br) = c1(b1,ι1) · · · cK(bK,ιK )ϕl(b) for someb = (b1,ι1 , . . . , bK,ιK ) in b. Therefore without loss of generality write

R∑r=1

ϕl(br) =

∏K

k=1

rk∑ιk=1

ck(ιk)ϕl,k(bk,ιk)

=

R∑r=1

c1(br) · · · cK(brK)ϕl(br). (43)

Define θ∗r = c1(br1) · · · cK(brK)f(br)/{∑R

r=1 c1(br1) · · · cK(brK)f(br)}

and let F ∗ be constructed usingthe weights θ∗. It follows that∣∣∣∣∑L

l=1al(x)

∏K

k=1

∑rk

ιk=1ck(ιk)ϕl,k(bk,ιk)− Pj (x, F ∗)

∣∣∣∣≤

R∑r=1

∣∣∣∣∑L

l=1al(x)ϕl(b

r)− θ∗rgj (x, br)

∣∣∣∣=

R∑r=1

∣∣∣ ∑Ll=1 al(x)c1(br1) · · · cK(brK)ϕl(b

r)− c1(br1)···cK(brK)f(br)∑Rr=1 c1(br1)···cK(brK)f(br)

gj (x, br)∣∣∣

≤R∑r=1

|c1(br1) · · · cK(brK)|

∣∣∣∣∣L∑l=1

al(x)ϕl(br)− f(br)∑R

r=1 c1(br1) · · · cK(brK)f(br)gj (x, br)

∣∣∣∣∣≤

R∑r=1

|c1(br1) · · · cK(brK)| |∑L

l=1al(x)ϕl(b

r)− f(br)gj (x, br) |

+

R∑r=1

|c1(br1) · · · cK(brK)|

∣∣∣∣∣∑R

r=1 c1(br1) · · · cK(brK)f(br)− 1∑Rr=1 c1(br1) · · · cK(brK)f(br)

∣∣∣∣∣ |f(br)gj (x, br)|

= O(R ·R−1L−s/K

)where the first inequality holds by the triangle inequality and (43) and the first equality holds by(43) and by the definition of θ∗r. In the above note that |c1(br1) · · · cK(brK)| is at most O(R−1), whichis obtained if one uses the uniform weights, i.e., ck(b1k) = . . . = ck(b

Rk ) for k = 1, . . . ,K. Second

note that∣∣∣∑L

l=1 al(x)ϕl(br)− f(br)gj (x, br)

∣∣∣ = O(L−s/K) due to the bound in (38). Third note that

37

∑Rr=1 c1(br1) · · · cK(brK)f(br) is another quadrature approximation of the integral

∫B f(β)dβ = 1. Be-

cause f(β) itself belongs to a Hölder class, the approximation error rate of this integral becomes∣∣∣∑Rr=1 c1(br1) · · · cK(brK)f(br)− 1

∣∣∣ = O(L−s/K

). To see this, note that

∣∣∣∣∣∫Bf(β)dβ −

R∑r=1

c1(br1) · · · cK(brK)f(br)

∣∣∣∣∣≤

∣∣∣∣∫Bf(β)dβ −

∫B

∑L

l=1alϕl(β)dβ

∣∣∣∣ (44)

+

∣∣∣∣∑L

l=1al

∫Bϕl(β)dβ −

∑L

l=1al∏K

k=1

∑rk

ιk=1ck(ιk)ϕl,k(bk,ιk)

∣∣∣∣ (45)

+

∣∣∣∣∣∑L

l=1al∏K

k=1

∑rk

ιk=1ck(ιk)ϕl,k(bk,ιk)−

∑L

l=1al

R∑r=1

c1(br) · · · cK(brK)ϕl(br)

∣∣∣∣∣ (46)

+

∣∣∣∣∣R∑r=1

c1(br) · · · cK(brK)∑L

l=1alϕl(b

r)−R∑r=1

c1(br1) · · · cK(brK)f(br)

∣∣∣∣∣ , (47)

where al satisfies supβ∈B

∣∣∣f(β)−∑L

l=1 alϕl(β)∣∣∣ = O

(L−s/K

)(such al’s exist since f(β) belongs to a

Hölder class). Then we find (44)= O(L−s/K · vol(B)

)by the triangle inequality, (45)= 0 by (39), and

(46)= 0 by 43. Finally note (47)≤∑R

r=1 c1(br) · · · cK(brK)∣∣∣∑L

l=1 alϕl(br)− f(br)

∣∣∣ = O(R ·R−1 · L−s/K

)by the triangle equality, |c1(br) · · · cK(brK)| is at most O(R−1), and because al satisfiessupβ∈B

∣∣∣f(β)−∑L

l=1 alϕl(β)∣∣∣ = O

(L−s/K

).

Combining these results, we bound (42) as O(L−s/K). Combining these bounds, we then conclude

|Pj (x, F )− Pj (x, F ∗)| = O(L−s/K · vol(B)

)+ 0 +O

(L−s/K

)= O

(L−s/K

)= O

(((R1/K + 1

)K)−s/K)= O

((R1/K + 1

)−s)≤ O

(R−s/K

).

The second conclusion in the lemma is trivial since the above holds for all x ∈ X and j = 1, . . . , J .

C.4 Proof of Theorem 6

First we derive the distance between θ and θ∗0 where θ∗0 (i.e. F ∗0 ) satisfies ‖P (xi, F∗0 )− P (xi, F0)‖2L2,N

=

OP(R−2s/K

). Note that such F ∗0 exists by Lemma 1. Note that with probability approaching to one,

we have 2 ‖g (·, βr)‖L2,N≥ ‖g (·, βr)‖L2

≥ c0 > 0 for all r = 1, . . . , R by Lemma 3. It follows that

38

c0

2

∑R

r=1|θr − θ∗r0 | (48)

≤∑R

r=1||g(·, βr)||L2,N

|θr − θ∗r0 | ≤√R

(∑R

r=1||g(·, βr)||2L2,N

(θr − θ∗r0 )2

)1/2

≤(4ζ2

0/ξmin(R))1/2√

R||P (xi, FN )− P (xi, F∗0 )||L2,N

≤(4ζ2

0/ξmin(R))1/2√

R(||P (xi, FN )− P (xi, F0)||L2,N

+ ||P (xi, F∗0 )− P (xi, F0)||L2,N

)=

√R(N)/ξmin(R(N)) max

{OP(√%R,N

), OP(R(N)−s/K)

},

where the second inequality holds by RCS, the third inequality holds similarly with (37), and the fourthinequality holds by the triangle inequality. Therefore, by Lemma 1, the result of Theorem 5, and theCauchy-Schwarz inequality, the conclusion follows.

Next we have∣∣∣FN (β)− F0(β)

∣∣∣ ≤ ∣∣∣FN (β)− F ∗0 (β)∣∣∣+ |F ∗0 (β)− F0(β)| by the triangle inequality. It

is not difficult to see that

supβ∈B

∣∣∣FN (β)− F ∗0 (β)∣∣∣ = sup

β∈B

∣∣∣∣∑R

r=1θr1 [βr ≤ β]−

∑R

r=1θ∗r0 1 [βr ≤ β]

∣∣∣∣= sup

β∈B

∣∣∣∣∑R

r=1(θr − θ∗r0 )1 [βr ≤ β]

∣∣∣∣ ≤∑R

r=1|θr − θ∗r0 |,

where the last inequality holds by the triangle inequality. Note that F0(β) =∫f0(b)1 [b ≤ β] db and

F ∗0 (β) becomes a quadrature approximation. We use a similar strategy with the proof of Lemma 1where we approximate f0(b) with

∑Ll=1 af0,lϕl(b) such that supβ∈B

∣∣∣f0(b)−∑L

l=1 af0,lϕl(b)∣∣∣ = O

(L−s/K

)due to Timan (1963). We find F0(β)− F ∗0 (β) = O

(R−s/K

)a.e. β ∈ B. Therefore, the conclusion fol-

lows from this and Theorem 5.

39

References

[1] Ackerberg, D. (2009), “A New Use of Importance Sampling to Reduce ComputationalBurden in Simulation Estimation”, Quantitative Marketing and Economics, 7(4), 343–376.

[2] Aliprantis, C.D. and K.C. Border (2006), Infinite Dimensional Analysis: A Hitchhiker’sGuide, Springer.

[3] Bajari, P., J.T. Fox and S. Ryan (2007), “Linear Regression Estimation of Discrete ChoiceModels with Nonparametric Distributions of Random Coefficients”, American EconomicReview, 72, 2, 459–463.

[4] Baraud, Y. (2002), “Model Selection for Regression on a Random Design”, ESAIM Prob-ability & Statistics 7, 127-146.

[5] Carrasco, M, J.P. Florens, and E. Renault (2007), “Linear Inverse Problems in StructuralEconometrics Estimation Based on Spectral Decomposition and Regularization”, Hand-book of Econometrics, V6B, Elsevier.

[6] Chen, X. (2007), “Large Sample Sieve Estimation of Semi-Nonparametric Models,” Hand-book of Econometrics V7, Elsevier.

[7] Chen, X, O. Linton, and I. van Keilegom (2003), “Estimation of Semiparametric ModelsWhen the Criterion Function is Not Smooth”, Econometrica, 71, 5, 1591-1608.

[8] Chen, X. and D. Pouzo (2012), "Estimation of Nonparametric Conditional Moment Mod-els with Possibly Nonsmooth Generalized Residuals", Econometrica, 80, 1, 277–321.

[9] Fox, J.T. and A. Gandhi (2015), “Nonparametric Identification and Estimation of RandomCoefficients in Multinomial Choice Models”, University of Michigan working paper.

[10] Fox J.T., K. Kim, S. Ryan, and P. Bajari (2011), “A Simple Estimator for the Distributionof Random Coefficients”, Quantitative Economics, 2, 381-418.

[11] Fox, J.T., K. Kim, S. Ryan, and P. Bajari (2012), “The Random Coefficients Logit ModelIs Identified”, Journal of Econometrics, 166(2), 204-212.

[12] Geweke, J. and M. Keane (2007), “Smoothly mixing regressions”, Journal of Econometrics,138, 2007.

[13] Hastie, T, R. Tibshirani, and J. Friedman (2009), The Elements of Statistical Learning:Data Mining, Inference, and Prediction, Springer Series in Statistics.

[14] Heiss, F. and Winschel, V. (2008), “Likelihood approximation by numerical integrationon sparse grids”, Journal of Econometrics, 144(1), 62–80.

[15] Hoeffding, W. (1963), “Probability Inequalities for Sums of Independent Random Vari-ables”, Journal of the American Statistical Association, 58, 13-30.

[16] Huber, J. (1981, 2004) Robust Statistics, Wiley.

40

[17] Ichimura, H. and T.S. Thompson (1998), “Maximum likelihood estimation of a binarychoice model with random coefficients of unknown distribution,” Journal of Econometrics,86(2), 269–295.

[18] Jacobs, R. , Jordan, M. I., Nowlan, S., and Hinton, G (1991), “Adaptive mixtures of localexperts,” Neural Comput., 3(1), 79–87.

[19] Koenker, R. and I. Mizera (2014), “Convex Optimization, Shape Constraints, CompoundDecisions, and Empirical Bayes Rules”, Journal of the American Statistical Association,109, 506, 674–685.

[20] Krommer, A.R. and C.W. Ueberhuber (1998), Computation Integration, SIAM.

[21] Li, J.Q. and A.R. Barron (2000), “Mixture density estimation”, Advances in Neural Infor-mation Processing Systems, Vol. 12, pp. 279–285.

[22] Matzkin, R.L. (2007) “Heterogeneous Choice”, Advances in Economics and Econometrics,Theory and Applications, Ninth World Congress of the Econometric Society. Cambridge.

[23] McLachlan, G.J. and D. Peel (2000), Finite Mixture Models. Wiley.

[24] Newey, W.K. (1997), “Convergence Rates and Asymptotic Normality for Series Estima-tors”, Journal of Econometrics 79, 147-168.

[25] Parthasarathy, K.R. (1967), Probability Measures on Metric Spaces, Academic Press.

[26] Petersen, B.E. (1983), Introduction to the Fourier Transform and Pseudo-DifferentialOperators, Pitman Publishing, Boston.

[27] Pollard, D. (1984), Convergence of Statistical Processes, Springer-Verlag, New York.

[28] Rust, J. (1987), “Optimal Replacement of GMC Bus Engines: An Empirical Model ofHarold Zurcher”, Econometrica, 55(5): 999–1033.

[29] Teicher, H. (1963), “Identifiability of Finite Mixtures”, Annals of Mathematical Statistics,34, 1265-1269.

[30] Timan, A.F. (1963), Theory of Approximation of Functions of a Real Variable, MacMilan,New York.

[31] Train, K. (2008), “EM Algorithms for Nonparametric Estimation of Mixing Distributions”,Journal of Choice Modeling, 1, 1, 40–69.

[32] Van der Vaart, W. and J.A. Wellner (1996), Weak Convergence and Empirical Processes,Springer Series in Statistics, New York.

[33] Zeevi, A. J. and R. Meir (1997), “Density Estimation Through Convex Combinations ofDensities: Approximation and Estimation Bounds,” Neural Networks, 10(1), 99–109.

41

asimplenonparametricapproachtoestimatingthedistribution...

Documents