partial maximum likelihood estimation of spatial probit ......march 18, 2010 abstract this paper...

43
Partial Maximum Likelihood Estimation of Spatial Probit Models Honglin Wang Michigan State University Emma M. Iglesias Michigan State University and University of Essex Jeffrey M. Wooldridge § Michigan State University March 18, 2010 Abstract This paper analyzes spatial Probit models for cross sectional dependent data in a binary choice context. Observations are divided by pairwise groups and bivariate normal distributions are specied within each group. Partial maximum likelihood estimators are introduced and they are shown to be consistent and asymptotically normal under some regularity conditions. Consistent covariance matrix estimators are also provided. Estimates of average partial eects can also be obtained once we characterize the conditional distribution of the latent error. Fi- nally, a simulation study shows the advantages of our new estimation procedure in this setting. Our proposed partial maximum likelihood estimators are shown to be more ecient than the generalized method of moments counterparts. Keywords: Spatial statistics, Maximum Likelihood, Probit Model. JEL classication: C12, C13, C21, C24, C25. We are grateful to participants at the 2009 Econometric Society European Meeting and at seminars at City University, London, London School of Economics, UCL, University of Exeter and University Carlos III for very useful comments. Any remaining errors are our own. Department of Economics, Michigan State University, 101 Marshall-Adams Hall, East Lansing, MI 48824-1038, USA. e-mail: [email protected]. Department of Economics, Michigan State University, 101 Marshall-Adams Hall, East Lansing, MI 48824-1038, USA. e-mail: [email protected]. Financial support from the MSU Intramural Research Grants Program is gratefully acknowledged. § Department of Economics, Michigan State University, 101 Marshall-Adams Hall, East Lansing, MI 48824-1038, USA. e-mail: [email protected]. 1

Upload: others

Post on 15-Feb-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

  • Partial Maximum Likelihood Estimation of Spatial Probit

    Models∗

    Honglin Wang†

    Michigan State University

    Emma M. Iglesias‡

    Michigan State University

    and University of Essex

    Jeffrey M. Wooldridge§

    Michigan State University

    March 18, 2010

    Abstract

    This paper analyzes spatial Probit models for cross sectional dependent data in a binary

    choice context. Observations are divided by pairwise groups and bivariate normal distributions

    are specified within each group. Partial maximum likelihood estimators are introduced and

    they are shown to be consistent and asymptotically normal under some regularity conditions.

    Consistent covariance matrix estimators are also provided. Estimates of average partial effects

    can also be obtained once we characterize the conditional distribution of the latent error. Fi-

    nally, a simulation study shows the advantages of our new estimation procedure in this setting.

    Our proposed partial maximum likelihood estimators are shown to be more efficient than the

    generalized method of moments counterparts.

    Keywords: Spatial statistics, Maximum Likelihood, Probit Model.JEL classification: C12, C13, C21, C24, C25.

    ∗We are grateful to participants at the 2009 Econometric Society European Meeting and at seminars at CityUniversity, London, London School of Economics, UCL, University of Exeter and University Carlos III for very useful

    comments. Any remaining errors are our own.†Department of Economics, Michigan State University, 101 Marshall-Adams Hall, East Lansing, MI 48824-1038,

    USA. e-mail: [email protected].‡Department of Economics, Michigan State University, 101 Marshall-Adams Hall, East Lansing, MI 48824-1038,

    USA. e-mail: [email protected]. Financial support from the MSU Intramural Research Grants Program is gratefullyacknowledged.

    §Department of Economics, Michigan State University, 101 Marshall-Adams Hall, East Lansing, MI 48824-1038,USA. e-mail: [email protected].

    1

  • 1 Introduction

    Most econometrics techniques on cross-section data are based on the assumption of independence of

    observations. However, economic activities become more and more correlated over space with modern

    communication and transportation improvements. On the other hand, technological advances in

    communications and the geographic information system (GIS) make spatial data more available

    than before. Spatial correlations among observations received more and more attentions in regional,

    real estate, agricultural, environmental and industrial organizations economics (Lee, 2004).

    Econometricians began to pay more attention on spatial dependence problems in the last two

    decades and some important advances have been done in both theoretical and empirical studies1.

    Spatial dependence not only means lack of independence between observations, but also a spatial

    structure underlying these spatial correlations (Anselin and Florax, 1995). There are two ways to

    capture spatial dependence by imposing structures on a model: one is in the domain of geostatistics

    where the spatial index is continuous (Conley, 1999), the other is that spatial sites form a countable

    lattice (Lee, 2004). Among the lattice models, there are also two types of spatial dependence models

    according to spatial correlation between variables or error terms: the spatial autoregressive dependent

    variable model (SAR) and the spatial autoregressive error model (SAE). In most applications of

    spatial models, the dependent variables are continuous (Conley, 1999; Lee, 2004; Kelejian and Prucha,

    1999, 2001; among others), and only few applications address the spatial dependence with discrete

    choice dependent variables (exceptions include: Case, 1991; McMillen, 1995; Pinkse and Slade, 1998;

    Lesage, 2000; Beron and Vijerberg, 2003). This paper is designed to address this gap and we are

    concerned about the SAE model with discrete choice dependent variables.

    As the name indicates, there are two aspects in the discrete choice model with spatial dependence.

    First, the dependent variable is discrete and the leading cases occur where the choice is binary. Probit

    and Logit are the two most popular non-linear models for binary choice problems. For the sake of

    brevity, in this study we focus on Probit model, but the approach developed here generalizes to other

    discrete choice models.

    In discrete choice models, if the observations are independent, we use maximum likelihood estima-

    tion to get efficient estimators given the correct conditional distribution of dependent variables. The

    nice part of the maximum likelihood estimator (MLE) is that we can still get consistency, asymptotic

    normality but inefficient estimators in many situations (panel data or clustering) by pseudo MLE

    even when we ignore certain dependence among observations (Poirier and Rudd, 1988). However, the

    non-linear property causes computation difficulties in estimation, and this computational difficulty

    becomes much worse when dependence occurs, which results in solving -dimensional integration.

    Dependence is the other aspect of this problem. General forms of dependence are rarely allowed for

    in cross-sectional data, although routinely allowed for in time-series data (Conley, 1999). For example,

    some scholars discussed discrete choice models with dependence in time-series data: Robinson (1982)

    1Anselin, Florax and Rey (2004) wrote a comprehensive review about econometrics for spatial models.

    2

  • relaxed Amemiya (1973) assumptions of independence in Tobit model, and proved that the MLE

    with dependent observations is strongly consistent and asymptotically normal under some regularity

    conditions. Poirier and Rudd (1988) discussed the Probit model with dependence in time-series

    data, and developed generalized conditional moment (GCM) estimators which are computational

    attractive and relatively more efficient.

    However, dependence in space is more complicated than in the time setting because of four

    reasons: first, time is one dimensional whereas space has at least two dimensions; second, time has

    natural order (direction) whereas space has no natural direction; third, time is regularly divided

    because of regular astronomical phenomena whereas spatial observations are attached to geographic

    properties of the surface of the earth; fourth, time-series observations are draws from a continuous

    process whereas, with spatial data, it is common for the sample and the population to be the same

    (Pinkse et al., 2007).

    Therefore, how to deal with dependence in space in estimation is the key to spatial econome-

    tricians. Inspired by works about dependence in time-series data, Conley (1999) uses metrics of

    economic distance to characterize dependence among agents, and shows that the GMM estimator is

    consistent and asymptotically normal under some assumptions similar to time-series data. He also

    provides how to get consistent covariance matrix estimator by an approach similar to Newey-West

    (1987). Pinkse and Slade (1998) use GMM in the discrete choice setting with the SAE model, and

    show that the GMM estimator remains consistent and asymptotically normal under some regularity

    conditions. Although Pinkse and Slade (1998) generated generalized residuals from the MLE as the

    basis of the GMM estimators, they do not take advantage of information from spatial correlations

    among observations, and hence the GMM estimator is much less efficient than full ML estimators.

    Lee (2004) examines carefully the asymptotic properties of MLE and quasi-MLE for the linear spatial

    autoregressive model (SAR), and he shows that the rate of convergence of those estimators may de-

    pend on some general features of the spatial weights matrix of the model. If each units are influenced

    by only a few neighboring units, the estimators may have√-rate of convergence and asymptotic

    normality; otherwise, it may have lower rate of convergence and estimators could be inconsistent.

    In this paper, we choose to capture spatial dependence by considering spatial sites to form a

    countable lattice, and we provide a middle-ground approach which trades off efficiency and compu-

    tation burdensome. The idea is to divide spatial dependent observations into many small groups

    (clusters) in which adjacent observations belong to one group. The implicit rationale behind this

    is adjacent observations usually account for the most important spatial correlations between obser-

    vations. If we can correctly specify the conditional joint distribution within groups, which allows

    us to utilize relatively more information of spatial correlations, estimating the model by partial ML

    (PML) will give us consistent and more efficient estimators, which should be generally better than

    GMM estimators. However, this approach is subject to biased variance-covariance matrix estimators

    because of spatial correlations among groups. To deal with this problem, we follow the methods

    proposed by Newey-West (1987) and Conley (1999) to get consistent variance-covariance matrix es-

    3

  • timators. Of course, this middle ground approach will not get the most efficient estimator. However,

    since information from adjacent observations usually capture important spatial correlations in the

    whole sample, we get a consistent and a relatively efficient estimators, and we avoid some tedious

    computations at expense of a loss of a relatively small part of efficiency.

    This paper is organized as follows. Section 2 reviews some widely applied econometric techniques

    on discrete choice models. Also, the SAE model with discrete choice dependent variable is presented

    and the regularity conditions are specified. We also show how we can estimate average partial effects

    once we characterize the conditional distribution of the latent error. Section 3 presents the bivariate

    spatial Probit model. In Section 4, we prove consistency and asymptotic normality of the PML

    estimator (PMLE) under regularity assumptions, and discuss how to get consistent covariance matrix

    estimators. Section 5 presents a simulation study showing the advantages of our new estimation

    procedure in this setting. Finally, Section 6 concludes. The proofs are collected in Appendix 1, while

    the results for the simulation study are provided in Appendix 2.

    2 Discrete Choice Models with Spatial Dependence

    2.1 A Probit Model without Dependence

    We first review the standard Probit model without dependence and the underlying linear latent

    variable model is

    ∗ = + (1)

    where ∗ is the latent dependent variable and a scalar, is a 1× vector of regressors, is a×1parameter vector to be estimated, and is a continuous random variable, independent of and

    it follows a standard normal distribution. However, we cannot observe ∗ , and we can only observethe indicator , which is related to ∗ as follows

    =

    (1 if ∗ 00 if ∗ ≤ 0

    ) (2)

    Therefore, we can get the conditional distribution of given as

    ( = 1|) = ( ∗ 0|) = ( −|) = Φ() (3)

    where Φ denotes the standard normal cumulative distribution function (cdf). It is easy to see we can

    get

    ( = 0|) = 1−Φ() (4)Since is a Bernoulli random variable, we can write the conditional density function of

    conditional on as

    (|) = [Φ()][1−Φ()]1− ∈ {0 1} (5)

    4

  • Also, given the independence assumption of random variables, the log likelihood function can be

    written as

    () =X=1

    { log[Φ()] + (1− ) log[1−Φ()]} (6)

    and the sufficient condition for uniqueness of the global maximum of () is that the function is

    strictly concave (Gourieroux, 2000). We can solve then b from the first order condition()

    =

    X=1

    −Φ()Φ()[1−Φ()]()

    0 = 0 (7)

    where is the probability density function (pdf) of the standard normal distribution. However,

    the simple closed-form expressions for the MLE are not available because the cdf of the normal

    distribution has no close-form solution. So the MLE must be solved by using numerical algorithms2.

    In general, we can prove that the conditional MLE is consistent and the most efficient estimator

    given some regularity conditions3 such as correctly specifying a parametric model, an identified

    and a log-likelihood function that is continuous in .

    2.2 A Probit Model with Spatial Dependence

    When spatial dependence is present in a binary response model, the dependence not only brings us

    the difficulties in estimating parameters, but also raises a question: what is the estimation of interest?

    because the magnitudes of the partial effects depend not only on the value of the covariates, but also

    on the value of the unobserved heterogeneity caused by the spatial dependence.

    For example, a binary response model with spatial correlation in the latent errors is given by

    = 1[ + ≥ 0] (8)

    where {} is allowed to be spatially correlated. Distance between points is almost always assumedto be exogenous, and so we can think of having marginal distributions

    (| ) (9)

    where is the matrix of covariates for all observations and is a matrix containing pairwise

    distances. The model that just allows spatial correlation in the errors essentially assumes

    (| ) = (| ) (10)

    and, in fact, the dependence on appears only in the variance. In particular, for probit,

    | ∼ (0 ()) (11)2Commonly used numerical solutions are all derived from Newton’s method. (see Gourieroux, 2000 for details).3See details in Wooldridge (2002, page 391).

    5

  • where (· ·) is known (in principal) from the spatial autocorrelation assumption. Literature hasfocused on estimating the parameters, and (the latter is the spatial dependence coefficient, a

    scalar in practice), but we need to understand that estimating and is not our final goal. We should

    pay more attention to estimating average partial effects based on the average structural function,

    which is directly correlated to parameters estimating (especially spatial dependence coefficient ).

    () = nΦh

    p()

    io(12)

    which is (hopefully) consistently estimated by

    −1X=1

    Φ

    ∙̂

    q( ̂)

    ¸ (13)

    If we knew the unconditional distribution of , we would not have to use the variance function at

    all; but we have the conditional distribution. We proceed now to provide a new method to estimate

    the parameters of a spatial probit model.

    2.3 A Probit Model with Spatial Error Correlation

    Consider the Probit model with spatial error correlation (SAE), where the underlying linear latent

    variable model is

    ∗ = + (14)

    = X

    =1

    + (15)

    where is an element in the spatial weights matrix which can be defined by different spatial

    distances such as the Euclidean distance. is the spatial autoregressive error coefficient and we have

    a random variable ∼ (0 1). We can write equations (14) and (15) in matrix form as follows

    ∗ = + (16)

    = ( − )−1 (17)

    so that the variance-covariance matrix for the model is

    Ω ≡ (| ) = [( − )0( − )]−1 (18)

    If ∗ is observable, equation (16) becomes a linear function, and we can use the Jacobian trans-formation of into ∗ and write the log likelihood function as

    ( ) = −2ln(2)− 1

    2( ∗ −)00( ∗ −) + ln || (19)

    where = − , and then the estimate of can be solved as b = ( 00)−1 00 ∗6

  • However, in practice we cannot observe ∗, and we can only observe , and it implies thatwe have to deal with a non-linear Probit model because of the normal distributional assumption.

    Moreover the errors are correlated, and the full likelihood function becomes

    = (1 = 1 2 = 2 · · · = ) =1Z

    −∞

    · · ·Z

    −∞

    () (20)

    () = (2)−2 |Ω|−1− 12 (0Ω−1) (21)

    Although theoretically, if we take the first derivatives subject to and the spatial coefficient ,

    we obtain

    =

    {1Z

    −∞

    · · ·Z

    −∞

    (2)−2 |( − )0( − )|−12 [0(− )0(− )]}

    = 0 (22)

    =

    {1Z

    −∞

    · · ·Z

    −∞

    (2)−2 |( − )0( − )|−12 [0(− )0(− )]}

    = 0 (23)

    The expression of the first derivatives are quite complicated, but if we have sufficient computational

    ability and and are identifiable, we can get consistent and efficient estimates of and by using

    numerical methods. However, in practice, it would be a formidable computational task even for a

    moderate size sample. We propose in this paper a more attractive procedure in the next sections

    that involves to deal with partial maximum likelihood.

    2.4 Probit Models with Other Forms of Spatial Correlation

    Generally, there is no reason to think that spatial correlation is properly modeled by (15). Other

    forms are possible. For example, one might assume that, outside of a certain geographic radius from

    a given observation , is uncorrelated with shocks to the outlying regions. So, for example, we

    might assume a constant correlation with any unit within a given radius — similar to a random effects

    structure for unbalanced panel data.

    Alternatively, we may prefer more of a moving average structure, such as

    = +

    ÃX6=

    ! (24)

    where the are i.i.d. with unit variance. This formulation is attractive because it is relatively easy

    to find variances and pairwise correlations, which we will use in the PMLEs described in the next

    section. For example,

    (| ) = 1 + 2ÃX

    6= 2

    ! (25)

    7

  • Clearly, methods that use only the variance in estimation can only identify 2 (but we almost

    always think 0, anyway). Pairwise covariances can also be obtained from

    ( | ) = + + 2Ã X

    6=6=

    ! (26)

    Expressions like this for the covariance between different errors are important for applying grouped

    PMLE methods.

    3 Using Partial MLEs to Estimate General Spatial Probit

    Models

    Estimating a probit spatial autocorrelation model by full MLE is a prodigious task, although several

    approaches have been applied. The EM algorithm can be used (McMillen, 1992), the RIS simulator

    (Beron and Vijverberg, 2003), and the Bayesian Gibbs sampler (Lesage, 2000). But each of these

    approaches is still computationally burdensome. To combine such approaches with simulation studies,

    or to be able to quickly estimate a range of models, is outside the abilities of even current computation

    capabilities for even moderate sample sizes.

    To get an estimator that is computationally feasible, Pinkse and Slade (1998) proposed using gen-

    eralized method of moments (GMM) using information on the marginal distributions of the binary

    responses. In particular, the generalized residuals from the marginal probit log likelihood are used to

    construct moment conditions for the GMM method. Pinske and Slade show that, under conditions

    very similar to those in this paper, the GMM estimator is consistent and asymptotically normal.

    The consistent variance-covariance matrix can also be obtained theoretically without a covariance

    stationary assumption, although Pinske and Slade do not discuss estimation of the asymptotic vari-

    ance. Therefore, the GMM estimator is almost practically useful, but it is fundamentally based on

    the marginal probit models. Thus, while a GMM estimator can be obtained that is efficient given

    the information on the marginal likelihood, the method throws out much useful information. This

    is very similar to just using a pooled probit for time series or panel and ignoring dynamics. We

    describe a simplified version of this approach in Section 3.1, which, in effect, uses a heteroskedastic

    probit model to estimate the along with any spatial autocorrelation parameter.

    Using only the marginal distribution of , conditional on the covariates and weights, likely results

    in serious loss of information for estimating both and the spatial autocorrelation parameters.

    Our key contribution in this paper is to explore the use of partial maximum likelihood where we

    group small numbers of nearby observations and obtain the joint distribution of those observations.

    Naturally, these distributions are determined by the fully specified spatial autocorrelation model —

    just as we must obtain the implied variance to apply marginal probit methods. Once the covariances

    between observations are found as a function of the weights and , we can use that information in

    8

  • multivariate probit estimation. Section 3.2 covers the case of where we describe a bivariate probit

    approach, with heteroskedasticity and covariance implied by the particular spatial autocorrelation

    model. Using a single covariance in addition to the variance seems likely to improve efficiency of

    estimation.

    3.1 Univariate Probit Partial MLE

    One way to estimate the coefficients along with spatial correlation parameters is to derive the

    marginal distributions, ( = 1| ) as a function of all of the weights (and the parameters, and , of course). Under the joint normality assumption, the model will be a form of probit with

    heteroskedasticity. In particular, given any spatial probit model such that the variances are well

    defined, we can find

    ( = 1| ) = Φ(()) (27)where 2 () = (| ) = (| ) is a function of all weights, , and the spatial correlationparameters, . As is well-known in time series contexts — see, for example, Poirier and Ruud (1988)

    or Robinson (1982) — using probit while ignoring the time series correlation leads to consistent

    estimation under standard regularity conditions, provided the data are weakly dependent. Thus, it is

    not surprising that pooled probit that accounts for the heteroskedasticity in the marginal distribution

    is generally consistent for spatially correlated data, too — provided, of course, we limit the amount

    of spatial correlation.

    The log likelihood can be written generically as

    () =X=1

    { log[Φ(())] + (1− ) log[1−Φ(())]} (28)

    Assuming that and are identified, and that the conditions in Section 4 hold, the pooled het-

    eroskedastic probit is generally consistent and√-asymptotically normal. But, for reasons we dis-

    cussed above, it is likely to be very inefficient relative to the full MLE. Further, estimators that use

    some information on the spatial correlation across observations seem more promising in terms of

    increasing precision.

    3.2 Bivariate Probit Partial MLE

    We now turn to using information on pairs of “nearby” observations to identify and . There is

    nothing special about using pairs; we could use, say, triplets, or even larger groups. But the bivariate

    case is easy to illustrate and is computationally quite feasible.

    For illustration, assume a sample includes 2 observations, and we divide the 2 observations

    into pairwise groups according to the spatial Euclidean distance between them (see Graph 1). In

    other words, each group includes two observations, with the idea being that the internal correlation

    9

  • between the two observations is more important than external correlations with observations in other

    groups. Within a group, the two observations follow a conditional bivariate normal distribution be-

    cause error terms are assumed to have a joint normal distribution.

    Graph 1: (2 observations =⇒ groups)

    Let© ∗ : = 1

    ªwhere, for notational simplicity, we assume 2 Observations. Then ∗ =¡

    ∗1 ∗2

    ¢follows something similar to a random effects probit model, but there is heteroskedasticity

    in the variances and covariances. In group , we have

    ∗1 = 111 + 212 + + 1 + 1 (29)

    ∗2 = 121 + 222 + + 2 + 2 = 1 2 (30)

    We can rewrite the above equations in latent variable matrix form as

    ∗1 = 1 + 1 (31)

    ∗2 = 2 + 2 = 1 2 (32)

    where 1 and 2 are 1× vectors of regressors and is a ×1 vector. 1 and 2 are scalars. Ingroup , observation and observation are not only correlated with each other, but also correlated

    with other observations over space. Therefore, the variances and covariance between 1 and 2 not

    only depend on the weight within group, but also weights with other observations out of the group,

    and the parameters, as well. See, for example equation (26).

    10

  • It is easy to see that (1|1 ) = (2|2 ) = 0, and the covariance-variance matrix forgroup under assumption (??), is defined as Ω ≡ (| )

    (| ) ≡ Ω() ="Ω11 Ω12

    Ω21 Ω22

    # (33)

    where we suppress the dependence on and in what follows for notational simplicity. The variance

    terms are the same as in the Pinske and Slade (1998) approach. In our case, the covariance must

    also be computed, although this will be the source to improve the precision of the estimates of and

    Note here that elements in Ω depend not only on the weight between two observations in group

    , but also weights for every observation in the whole sample, because two observations in group

    not only correlated with each other, but also correlated with other observations over space. Since we

    define two nearby observations as one group, we pick up the corresponding part (Ω) from the whole

    covariance-variance matrix (see equation (26)).

    Since we cannot observe ∗1 and ∗2, as we discussed in the univariate Probit model, we define

    =

    (1 if ∗ 00 if ∗ ≤ 0

    ) (34)

    Therefore the conditional bivariate normal distribution of 1 and 2 given is given as

    (1 = 1 2 = 1|) = (1 + 1 0 2 + 2 0|) (35)= (1 1 2 2|) = Φ2( 1p

    Ω112pΩ22

    ) (36)

    =(1 2)p

    (1)p (2)

    =Ω12pΩ11Ω22

    (37)

    where Φ2 is the standard bivariate normal distribution, 2 is the standard density function of the

    bivariate normal distribution and is the standardized covariance between two error terms. Esti-

    mation in this context is similar to “random effects” probit with 2 observations, although one needs

    to properly compute the variances and covariances of the latent errors. We can also use random

    matched pairs of even groups of more than two observations. The idea behind is that the more

    correlations used, the better.

    Given that (1 2) has a joint normal distribution, we can write

    1 = 12 + 1 (38)

    where

    1 =(1 2)

    (2) (39)

    and 1 is independent of and 2

    11

  • Because of the joint normality of (1 2) 1 is also normally distributed with (1) = 0 and

    (1) = (1)− 21 (2) (40)

    Thus, we can write the conditional distribution of 1 as

    (1| 2) ∼ (0 (1)) (41)

    Substitute equation (38) back to ∗1 = 1 + 1, and we can get

    ∗1 = 1 + 12 + 1 (42)

    Therefore

    (1 = 1| 2) = ΦÃ1 + 12p

    (1)

    ! (43)

    The reason we want to find (43) is to retrieve (1 = 1 2 = 1|). Since

    (1 = 1 2 = 1|) = (1 = 1|2 = 1)× (2 = 1|) (44)

    it is easy to see that (2 = 1|) = Φ( 2√ (2)

    ), and thus it remains to get (1 = 1|2 = 1)First, since 2 = 1 if and only if 2 −2 and 2 follows a normal distribution and it is

    independent of , then the density of 2 given 2 −2 is( 2√

    (2))

    (2 −2) =( 2√

    (2))

    Φ( 2√ (2)

    ) (45)

    Therefore,

    (1 = 1|2 = 1) = [ (1 = 1| 2)|2 = 1) (46)= [Φ(

    1 + 12p (1)

    )|2 = 1 ] (47)

    =1

    Φ( 2√ (2)

    )

    Z ∞−2

    Φ(1 + 12p

    (1))(

    2p (2)

    )2 (48)

    and it is easy to see that (1 = 0|2 = 1) = 1 − (1 = 1|2 = 1) because 1 is thebinary variable.

    Similarly, we can get

    (1 = 1|2 = 0) = 11−Φ( 2√

    (2))

    Z 2−∞

    Φ(1 + 12p

    (1))(

    2p (2)

    )2 (49)

    and (1 = 0|2 = 0) = 1− (1 = 1|2 = 0)

    12

  • Now we are ready to get (1 = 1 2 = 1|) as follows

    (1 = 1 2 = 1|) = 1Φ( 2√

    (2))

    Z ∞−2

    Φ(1 + 12p

    (1))(

    2p (2)

    )2

    ×Φ( 2p (2)

    ) (50)

    =

    Z ∞−2

    Φ(1 + 12p

    (1))(

    2p (2)

    )2 (51)

    and similarly we can obtain finally

    (1 = 0 2 = 1|) = Φ( 2p (2)

    )−Z ∞−2

    Φ(1 + 12p

    (1))(

    2p (2)

    )2 (52)

    (1 = 1 2 = 0|) =Z 2−∞

    Φ(1 + 12p

    (1))(

    2p (2)

    )2 (53)

    (1 = 0 2 = 0|)= [1− Φ( 2p

    (2))]−

    Z 2−∞

    Φ(1 + 12p

    (1))(

    2p (2)

    )2 (54)

    4 Partial Maximum Likelihood Estimation

    As we discussed in the introduction, if the observations are independent, we can simplify the mul-

    tivariate distribution into the product of univariate distributions, and then the ML estimator can

    be obtained easily. However, spatial correlations among observations do not allow the simplification

    any more. Under spatial correlation, the situation is kind of similar to the panel data case. In panel

    data, we cannot assume independence among observations over different periods for the same person

    (or firm), which means we are not likely to specify the full conditional density of given correctly.

    Therefore, we need to relax the assumption in the panel data case. The way we deal with the problem

    is that if we have a correctly specified model for the density of given we can define the partial

    log likelihood function as

    ∈Θ

    X=1

    X=1

    log (| ) (55)

    where (| ) is the density for given for each . The partial log likelihood function worksbecause 0 (the true value) maximizes the expected value of the above equation provided we have

    the densities (| ) correctly specified (Wooldridge, 2002).We can apply a similar idea to the spatial Probit model: if we have the bivariate normal densities

    2(12| ) correctly specified for each group, we could get a consistent estimator by partialML. However, there are several differences between panel data and spatial dependent data: first, the

    panel data model assumes that the cross section dimension () is sufficiently large relative to the

    13

  • time dimension ( ), but in spatial data we do not have this assumption. Second, in the panel data

    model, we view the cross section observations as independent, while in the spatial data model, even

    though we divided the sample into groups, however, we are definitely not assuming independence

    among groups. Observations in different groups are still correlated, but the correlations are assumed

    to decay as distances become further away. Third, as we discussed before, dependence in space is

    more complicated than dependence in time, and we need to assume that the correlations between

    groups die out quickly enough as distance goes further away. In short, we need to examine carefully

    how the weak law of large numbers (WLLN) and central limit theorem (CLT) can be applied in the

    spatial dependent case. We will discuss these issues and provide proofs in the following sections.

    Rather than “filling in”, we use a type of asymptotics where area size is increasing. Essentially, we

    need the process to be mixing in a spatial sense.

    First, we can write the partial log likelihood function as

    =X

    =1

    {12 log(1 = 1 2 = 1|) + 1(1− 2) log(1 = 1 2 = 0|)

    +(1− 1)2 log(1 = 0 2 = 1|) + (1− 1)(1− 2) log(1 = 0 2 = 0|)} = 1 2(56)

    and for the sake of brevity, we define

    (1 1) ≡ log(1 = 1 2 = 1|); (1 0) ≡ log(1 = 1 2 = 0|); (57)(0 1) ≡ log(1 = 0 2 = 1|) and (0 0) ≡ log(1 = 0 2 = 0|) (58)

    Therefore, we can rewrite the partial log likelihood function as

    =X

    =1

    {12(1 1)+1(1−2)(1 0)+(1−1)2(0 1)+(1−1)(1−2)(0 0)} (59)

    We proceed now to prove the consistency and asymptotic normality of the PMLE.

    4.1 Consistency of Bivariate Probit Estimation

    Consistent estimators b ≡ (b b)0 are the ones that converge in probability to the true value 0 ≡(0 0)

    0, i.e. b −→ 0, as the sample size goes to infinity for all possible true values. In this section,to make the asymptotic arguments formal, we distinguish between the true value, 0, and a generic

    parameter value .

    In the bivariate probit estimation, the estimator b is defined as: b maximizes () subject to ∈ Θ, where Θ is the parameters set. The objective function () is defined as

    () ≡ 1

    X=1

    {12(1 1) + 1(1− 2)(1 0)

    +(1− 1)2(0 1) + (1− 1)(1− 2)(0 0)} (60)

    14

  • i.e, in other words, b = argmax∈Θ

    () (61)

    Remember that this objective function represents a partial log likelihood, not a fully log likelihood:

    we are only using information on the conditional distribution (1 2| ) across the groups .We are not using (1 2 | ) as in a full maximum likelihood setting.The identification condition requires that () is uniquely maximized at the true value 0 where

    () is defined as

    () ≡ lim→∞

    [ ()] (62)

    This condition typically holds for well-specified models when there is not perfect collinearity among

    the regressors. Following Pinske and Slade (1998), we impose the identification condition in all our

    analysis. Further, one needs to be a little careful in parameterizing the spatial autocorrelation, but

    standard models of spatial autocorrelation cause no problems.

    The following Theorem 1 states the main consistency result for a broad class of spatial probit

    models. We define () ≡ () and lim

    →∞[ ()] = ()

    Theorem 1. If (i) 0 is the interior of a compact set Θ which is the closure of a concave set,

    (ii) attains a unique maximum over the compact set Θ at 0(iii) is continuous on Θ , (iv)

    the density of observations in any region whose area exceeds a fixed minimum is bounded, (v) as

    →∞ sup(°°° 1Pr(1=12=1|) + 1Pr(1=12=0|) + 1Pr(1=02=1|) + 1Pr(1=02=0|)°°°) ∞ (vi)

    as →∞ sup(kk+kk) = (1) (vii) sup |( )| ≤ () = 1 2 where denotesthe distance between group and , and () → 0 as → ∞ and (viii) lim

    →∞[ ()] exists, (ix)

    sup kk ∞, then b − 0 = (1).Proof: Given in Appendix 1.

    Condition (i) is a standard assumption from set theory. Condition (ii) is the identification condi-

    tion for MLE. Condition (iii) assumes that the function is continuous in the metric space, which

    is a reasonable assumption and necessary for the proof that () is stochastically equicontinuous.

    Condition (iv) simply excludes that an infinite number of observations crowd in one bounded area.

    The minimum area restriction is imposed because an infinitesimal area around a single observation

    has infinite density. Condition (v) makes sure any one of these four situations will be present in a

    sufficiently large sample. Condition (vi) makes sure the regressors are deterministic and uniformly

    bounded, which is not a strong assumption in this literature. Condition (vii) is the key assumption

    for this theorem, and it requires that the dependence among groups decays sufficiently quick when

    the distance between groups become further apart. This assumption employs the concept from -

    mixing to define the rate of dependence decreasing as distance increases. Condition (viii) assumes

    the limit of [ ()] exists as →∞ which is not a strong assumption. Condition (ix) is actuallyimplied by the rule of dividing groups, which just excludes that the two groups are exactly in the

    same location.

    15

  • 4.2 Asymptotic Normality

    As we discussed in the introduction, the spatial dependence is more complicated than time-series

    dependence at least in four perspectives. These differences cause that central limit theorems (CLT)

    need stronger conditions for the spatial dependence case. To deal with general dependence problems,

    the common way in the literature is to use the so called "Bernstein Sums", which break up into

    blocks (partial sums), and we consider the sequence of blocks. Each block must be so large, relative

    to the rate at which the memory of the sequence decays, that the degree to which the next block

    can be predicted from current information is negligible. But at the same time, the number of blocks

    must increase with so that the CLT argument can be applied to this derived sequence (Davidson,

    1994).

    In this section, we show under what assumptions we are able to apply McLeish’s central limit the-

    orem (1974) to spatial dependence cases to get asymptotic normality for the spatial Probit estimator.

    This is presented in the following Theorem. denotes the transpose of matrix

    Theorem 2: If the assumptions of Theorem 1 hold, and in addition: (i) as →∞ 2(∗)(∗) = (1)

    for all fixed ∗ 0 (ii) the sampling area grows uniformly at a rate of√ in two non-opposing

    directions, (iii) (0) ≡ lim→∞[(0) (0)] and (0) ≡ lim→∞−[(0)] are uniformlypositive definite matrices; then

    √(b − 0) → [0 (0)−1(0)(0)−1] where (0) ≡ (0)

    and (0) =2

    (0)

    Proof: Given in Appendix 1.

    Condition (i) is stronger than condition (vii) in Theorem 1, and it is also stronger than the usual

    condition in time series data because spatial dependent data has more dimension correlations than

    time series data. It shows that how dependence decays when distance between groups gets further

    away, and the dependence decays at the rate fast enough. Condition (ii) just repeats the assumption

    in the Bernstein’s blocking method, the two non-opposing directions just exclude sampling area grows

    at two parallel directions, which does not make much sense in spatial dependent case. Conditions in

    (iii) are natural conditions about matrices, which are implied by the previous assumptions. Matrices

    are semidefinite if some extreme situations happen such as Pr(1 = 1 2 = 1|) = 0 which areassumed to be excluded in the previous assumptions.

    4.3 Estimation of Variance-covariance Matrices

    Consistent estimation of the asymptotic covariance matrix is important for the construction of as-

    ymptotic confidence intervals and hypothesis tests (Newey and West, 1987). The estimations of

    (i.e. ̂ = (b)) are relatively easy, usually just obtaining sample analogues of 0 with b; but theestimation of (i.e. b = (b)) is more difficult and more important because of the correlationsamong groups. Newey-West (1987) proposed a method to estimate the variance-covariance matrix in

    16

  • settings of dependence of infinite order under a covariance stationary condition, and they suggested

    modified Bartlett weights to make sure the estimated variance and test statistics were positive.

    Andrews (1991) established the consistency of kernel HAC (Heteroskedasticity and Autocorrelation

    Consistent) estimators under more general conditions. Pinkse and Slade (1998) also showed that we

    can obtain (b) −(0) = (1) under regularity assumptions, where () ≡ [() ()] (seeLemma 9 in Appendix 1). This approach is feasible in practice only if we can get closed form expres-

    sions for [() ()], which should be a function of , and then plug in b for 0 in the function toget consistent covariance estimators. However, it is difficult to get closed form expressions for (b)in practice, and hence we follow an alternative approach proposed by Conley (1999).

    A feasible way to obtain a consistent estimate of a variance-covariance matrix that allows for a

    wider range of dependence is to apply the approach of Conley (1999) along the lines of Newey-West

    (1987). We follow this procedure in the following Theorem 3.

    Let ΞΛ be the −algebra generated by a given random field ∈ Λ with Λ compact, andlet |Λ| be the number of ∈ Λ. Let Υ (Λ1Λ2) denote the minimum Euclidean distance from anelement of Λ1 to an element of Λ2 There exists also a regular lattice index random field ∗ thatis equal to one if location ∈ 2 is sampled and zero otherwise. ∗ is assumed to be independentof the underlying random field and to have a finite expectation and to be stationary. The mixing

    coefficient is defined as

    () ≡ sup {| ( ∩)− () ()|} ∈ ΞΛ1 ∈ ΞΛ2 and|Λ1| ≤ |Λ2| ≤ Υ (Λ1Λ2) ≥

    We also define a new process () such as

    () =

    ( () if ∗ = 10 if ∗ = 0

    )

    Then

    Theorem 3. If (i) Λ grows uniformly in two non-opposing directions as −→ ∞ (ii)(0) ≡ lim→∞[(0) (0)] and (0) ≡ lim→∞−[(0)] are uniformly positive definitematrices, (iii) as defined in Theorem 1 = 1 2 and ∗ are mixing where () con-verges to zero as → ∞; () is Borel measurable for all ∈ Θ and continuous on Θ andfirst moment continuous on Θ (iv)

    P∞=1 () ∞ for + ≤ 4 (v) 1∞ () = (−2)

    (vi) for some 0 (k (0)k)2+ ∞ andP∞

    =111 ()(2+) ∞ (vii) () is Borel

    measurable for all ∈ Θ continuous on Θ and second moment continuous, (0) exists and isfull rank, (viii)

    P∈2 (0 (0) (0)) is a non-singular matrix, (ix) the ( ) are uni-

    formly bounded and ( ) −→ 1, −→ ∞ as −→ ∞ ( −→ ∞) = ¡13

    ¢and

    = ¡ 13

    ¢, (x) for some 0 (k (0)k)4+ ∞ and as defined in Theorem 1

    17

  • = 1 2 and ∗ are mixing where ∞∞ ()(2+) = (−4) (xi) supΘ k ()k2 ∞ and

    supΘ k() [ ()]k2 ∞ thenb −(0) = (1) as −→∞where we split = [ ], Λ is a rectangle so that ∈ {1 2 } and ∈ {1 2 } and

    b = −1 X=0

    X=0

    X=+1

    X=+1

    ( )

    ⎛⎝ ³b´−− ³b´ +−−

    ³b´ ³b´⎞⎠

    −−1X

    =1

    X=1

    ³b´ ³b´

    To ensure positive semi-definite covariance matrix estimates, we need to choose an appropriate

    two-dimensional weights function that is a Bartlett window in each dimension

    ( ) =

    ½(1− ||

    )(1− ||

    ) || ||

    0 else

    ¾

    Proof: It follows from Conley (1999), Proposition 3.

    5 Simulation Study

    In the previous section, we have proved that the partial maximum likelihood estimator (PMLE)

    based on the bivariate normal distribution is consistent and asymptotically normal. Moreover, one

    of the most attractive properties of our new PMLE is that we can get a more efficient estimator

    compared to the GMM estimator, and the approach is much less computational demanding when

    compared to full information methods. In order to learn about the gains in efficiency that we obtain

    in the context of a Bivariate Spatial Probit model when using PML versus GMM, we conduct in this

    Section a simulation study to show the efficiency gains of PML.

    5.1 Simulation Design and Results

    Instead of comparing our PMLE to the GMM estimator of Pinkse and Slade (1998) directly, we

    choose to compare the PMLE to the heteroskedastic Probit estimator (HPE) because of two reasons:

    First, the HPE uses similar information with the GMM estimator because both methods use gener-

    alized residuals from the Probit estimation to construct the moment conditions, which means that

    both methods use the information from the heterogeneities of the diagonal terms of the variance-

    covariance matrix, while our PMLE uses both diagonal and off-diagonal correlations information

    between two closest neighbors. Second, the STATA4 source codes for bivariate probit estimation and

    4See http://www.stata.com/

    18

  • heteroskedastic Probit estimation are available online, and we can easily add the spatial parts into

    these existing source codes to compare PML estimators with Heteroskedastic Probit Estimators. We

    consider two simulation settings allowing for different types of spatial dependence.

    5.1.1 Case 1

    According to the theoretical framework given in previous sections, we could generate a dataset which

    allows a general correlation structure across groups as equations (14) and (15), and it requires to

    specify the exact formula (as functions of and ) for the elements of Ω However, it is quite

    difficult to derive the pairwise covariances for a bivariate probit because the exact formula for Ω12(and of Ω11Ω22) is very complicated, which is an element of the inverse matrix with 2 spatially

    correlated observations as follows

    Ω =

    "Ω11 Ω12

    Ω21 Ω22

    #= [( − )0( − )]−1 =

    ⎡⎢⎢⎢⎢⎢⎢⎣Ω111

    Ω11 Ω12

    Ω21 Ω22

    Ω22

    ⎤⎥⎥⎥⎥⎥⎥⎦ (63)

    Therefore, it seems reasonable to do the following. Let be the weighting matrix which can be

    generated in STATA5 according to the distance between observations

    ∗ = 11 +22 +33 + (64)

    = (65)

    where ∼ (0 ). The weighting matrix is standardized so that the diagonal elements areones, and then the elements of shrink as distance is increasing . The reason we do this is because it

    is easier to determine () and ( ) to apply the HP and the bivariate probit estimators.

    In this way, we still allow general correlation across groups, and we are able to compare the efficiency

    gains from only using the diagonal information (the HP approach) to using both diagonal and off-

    diagonal information (bivariate probit), and we do not require to know the exact formula for the

    elements in Ω (given in equation (63)) to reach the same goal.

    Therefore, we generate the dataset according to equations (64) and (65), which allows spatial

    correlation between any two observations, and we set the true parameter values for 1 2 and

    3 equal to 1, 1 and 1 respectively. Since our main focus in this study is on the estimation of the

    spatial parameter , we also set different true values for each simulated sample: = 02; 04; 06

    and 08 to test for the performance of the two estimation methods (PML and HP). These values for

    5The STATA command is “Spatwmat”. Since the speed to calculate the inverse of a matrix is much slower as the

    size of matrix increases, and moreover the maximum matrix size in Stata is 800, we allow here each observation to bespatially correlated to nearby 99 observations.

    19

  • are in the range of the estimated value in the empirical application of Pinkse and Slade (1998). In

    this setting and with 1000 replications, we consider a sample size of = 1000 observations (where

    the sample size is divided into 500 pairwise groups). Finally, we also simulate samples of sizes 500 and

    1500 (with 250 and 750 pairwise groups respectively) to check the performance of the two methods

    in different samples sizes. The simulation results are reported in Tables 1 (for the spatial parameter

    ) and 2 (for the 1 2 and 3) in Appendix 2.

    From Table 2, we can observe that both the HPE and the PMLE of 1 2 and 3 converge to true

    parameter values across the different parameter values as sample size increasing. Also the the PML

    estimator has much less bias than the HPE. Moreover, as expected, PML always provides smaller

    standard errors than the HP estimation method and bias and standard errors decrease in general

    when sample size increases.

    Furthermore, it is in Table 1 where we can observe the largest advantages of using PML versus HP.

    We can see that the PMLE is much better than the HPE in terms of estimating the spatial parameter

    . The PMLE is always much closer to the true parameter values and with small standard errors

    across different sample sizes and parameter values (as expected from our theoretical results), while

    the HPE is much further away from true parameter values and it is has a much larger standard

    deviation over the different sample sizes, even though HPE also shows the trend to converge to the

    true values in general as the sample increases. The HPE has always much larger standard deviation

    than the PMLE, showing clearly the gains in efficiency of PML versus HPE/GMM as predicted

    by our theory. Since both the HPE and the GMM estimator use generalized residuals from Probit

    estimation to construct the moment conditions, we conjecture that the GMM estimator is subject to

    similar inefficiency problems in estimating the spatial coefficient. Also, as it is expected, the bias of

    the PMLE decreases when increases.

    In summary, from the simulation results of Tables 1 and 2, we see how the PMLE outperforms

    clearly the HPE (i.e., the GMM estimator of Pinkse and Slade (1998)), specially when estimating the

    spatial parameter , which implies that the PMLE is much more robust and efficient in the context

    of the spatial probit model. The simulation results provide clear evidence of the gains in efficiency

    that can be obtained by PML versus GMM, as predicted by our theoretical results in the previous

    section.

    5.1.2 Case 2

    We considered a second data generating process given as (64), where again we set the true parameter

    values for 1 2 and 3 equal to 1, 1 and 1 respectively, and the covariates are jointly normal and

    independent. In this case we assume (24) where the are i.i.d. (0 1) and is the

    reciprocal of the distance between and . Then we can get closed form expressions for the variance

    and covariance given in (25) and (26). Results for 1,000 replications are provided in Tables 3 and 4

    in Appendix 2. Again, the PMLE provides improvements versus GMM, especially when estimating

    20

  • Efficiency gains in estimating are smaller in Table 4, probably because we chose in this setting

    pairwise observations to be closest to each other.

    6 Conclusions

    The idea of this paper is simple and intuitive: instead of just using information contained in moment

    conditions (GMM), we divide observations into pairwise groups. Provided we correctly specify the

    conditional joint distribution within these pairwise groups, we show that the spatial bivariate Pro-

    bit model allows us to use the most important information of spatial correlations among adjacent

    observations and to get a more efficient estimator: the partial MLE (PMLE). We prove that the

    PMLE is consistent and asymptotically normal under some regularity conditions. We also discuss

    how to get consistent covariance matrix estimators under general spatial dependence by following the

    approach of Conley (1999) and Newey-West (1987), which is more usable in practice compared to

    the proposal of Pinkse and Slade (1998). An attractive aspect of our approach is that we can obtain

    a more efficient estimator than GMM without introducing stronger assumptions, and the approach

    is much less computationally demanding compared to full information methods. In order to learn

    about the gains in efficiency that we obtain in the bivariate Probit model with PML versus the GMM

    estimator, we provide a simulation study in Section 5. The advantages in terms of bias and efficiency

    of our new estimation procedure proposed in this paper are clearly demonstrated. Moreover, if we

    extend this method to the trivariate or higher dimensional multivariate Probit models, we can obtain

    even more efficient estimators, but it comes at the expense of more computational demands. We

    have also shown how the estimation of the spatial probit model through PML also allows to obtain

    estimates of partial average effects. The results in this paper can be generalized to consider models

    with distributed lags, and also to extend the performance of the PMLE to lots of other nonlinear

    models, including Tobit, count and switching. This will be the subject of future work.

    7 Appendix 1

    7.1 Proofs to Theorems

    Proof of Theorem 1. If we can prove that ()→ () uniformly, by the information inequality,

    () has a unique maximum at the true parameter when 0 is identified. Then under technical

    conditions for the limit of the maximum to be the maximum of the limit, b should converge inprobability to 0. Sufficient conditions for the maximum of the limit to be the limit of maximum

    are that the convergence in probability is uniform and the parameter set is compact (Newey and

    Mcfadden, 1994).

    To prove consistency, the proof includes three parts:

    (i) has a unique maximum at 0

    21

  • (ii) ()− () = (1) at all ∈ Θ(iii) () is stochastically equicontinuous and is continuous on Θ

    Condition (i) and to be continuous on Θ are assumed. The proof of condition (ii) is provided

    in Lemma 1, and the proof that () is stochastically equicontinuous can be found in Lemma 2.

    Proof of Theorem 2. To find out the asymptotic normality of the Partial MLE for spatialbivariate Probit model, we start the proof from mean value theorem. Since

    (b) = 0 and by using

    the mean value theorem

    (b) = 0 =

    (0) +2

    (∗)(b − 0) (66)

    ⇒ (b − 0) = −[ 2

    (∗)]−1

    (0) (67)

    where ∗ lies between b and 0First, let us discuss the term

    2

    (∗) to find out the asymptotic properties of 2

    (∗) Recall

    that

    () =1

    X=1

    {12(1 1) + 1(1− 2)(1 0)

    +(1− 1)2(0 1) + (1− 1)(1− 2)(0 0)} (68)where (1 1) ≡ log(1 = 1 2 = 1|) etc. Also

    2

    () =

    1

    X=1

    {122(1 1)

    + 1(1− 2)

    2(1 0)

    +(1− 1)(2)2(0 1)

    + (1− 1)(1− 2)

    2(0 0)

    (69)

    where2(1 1)

    =

    −1[Pr(1 = 1 2 = 1|)]2 [

    Pr(1 = 1 2 = 1|)

    ]2

    +1

    Pr(1 = 1 2 = 1|)2[Pr(1 = 1 2 = 1|)]

    (70)

    and all other terms behave similar.

    As before, we only discuss one of these terms, and the same logic applies to the other terms. We

    know that

    1

    X=1

    [122(1 1)

    (∗)]

    =1

    X=1

    12{ −1[Pr(1 = 1 2 = 1|)]2 [

    Pr(1 = 1 2 = 1|)

    (∗)]2

    +1

    Pr(1 = 1 2 = 1|)2[Pr(1 = 1 2 = 1|)]

    (∗)} (71)

    22

  • Look at the first term of the above equation given by

    1

    X=1

    12{ −1[Pr(1 = 1 2 = 1|)]2 [

    Pr(1 = 1 2 = 1|)

    (∗)]2} (72)

    Since°°° 1[Pr(1=12=1|)]2°°° ∞, we can write this term as

    1

    X=1

    11[ Pr(1 = 1 2 = 1|)

    (∗)]2 (73)

    where 11 ≡ 12 −1[Pr(1=12=1|)]2 In order to prove

    1

    X=1

    11[ Pr(1 = 1 2 = 1|)

    (∗)]2

    → 1

    X=1

    11[ Pr(1 = 1 2 = 1|)

    (0)]

    2 (74)

    we need to show that it holds for all kk = 1 Set 11 = and then

    {1

    X=1

    11[ Pr(1 = 1 2 = 1|)

    (b)]2

    −1

    X=1

    11[ Pr(1 = 1 2 = 1|)

    (0)]

    2} (75)

    =1

    X=1

    11{[ Pr(1 = 1 2 = 1|)

    (b)]2 − [ Pr(1 = 1 2 = 1|)

    (0)]2} (76)

    = (b − 0) 2

    X=1

    11 Pr(1 = 1 2 = 1|)

    (∗)×

    2 Pr(1 = 1 2 = 1|)

    (∗) (77)

    From the proof of Theorem 1, we know that sup°°° Pr(1=12=1|) °°° ∞ From Lemma 3,

    sup

    °°°2 Pr(1=12=1|)

    °°° ∞ From Theorem 1, we also know that b − 0 = (1) and hence(b − 0) 2

    X=1

    11 Pr(1 = 1 2 = 1|)

    (∗)×

    2 Pr(1 = 1 2 = 1|)

    (∗) = (1) (78)

    =⇒ {1

    X=1

    11[ Pr(1 = 1 2 = 1|)

    (b)]2 − 1

    X=1

    11[ Pr(1 = 1 2 = 1|)

    (0)]

    2} = (1)

    (79)

    =⇒ 1

    X=1

    11[ Pr(1 = 1 2 = 1|)

    (∗)]2

    → 1

    X=1

    11[ Pr(1 = 1 2 = 1|)

    (0)]

    2 (80)

    By definition,

    lim→∞

    1

    X=1

    11[ Pr(1 = 1 2 = 1|)

    (0)]

    2 = {11[ Pr(1 = 1 2 = 1|)

    (0)]2} (81)

    23

  • and therefore,

    1

    X=1

    12{ −1[Pr(1 = 1 2 = 1|)]2 [

    Pr(1 = 1 2 = 1|)

    (∗)]2} → (82)

    {12 −1[Pr(1 = 1 2 = 1|)]2 [

    Pr(1 = 1 2 = 1|)

    (0)]2} (83)

    Similarly, we can prove in relation to the second term in (71) that

    1

    X=1

    121

    Pr(1 = 1 2 = 1|)2[Pr(1 = 1 2 = 1|)]

    (∗) (84)

    → {12 1Pr(1 = 1 2 = 1|)

    2[Pr(1 = 1 2 = 1|)]

    (0)} (85)

    As usual, we apply repeatedly the above arguments to the other terms. Finally, we can get that

    lim→∞

    2

    (∗)

    → [ 2

    (0)] (86)

    If we define

    ≡ {122(1 1)

    + 1(1− 2)

    2(1 0)

    +(1− 1)(2)2(0 1)

    + (1− 1)(1− 2)

    2(0 0)

    } (87)

    where denotes the Hessian, equation (87) can be rewritten as

    lim→∞

    1

    X=1

    (∗)→ lim

    →∞[(0)] (88)

    Therefore, it remains to show the asymptotic normality of the score term: (0) For the sake

    of brevity, redefine the score as: (0) ≡ (0) Then

    (0) =1

    X=1

    {12(1 1)

    (0) + 1(1− 2)(1 0)

    (0)

    +(1− 1)2(0 1)

    (0) + (1− 1)(1− 2)(0 0)

    (0)} (89)

    We need to show that −12 (0)(0)→ (0 ) where () ≡ lim

    →∞[()

    ()] Note that

    the information matrix equality does not hold here, i.e. −[(0)] 6= [() ()] because thescore terms are correlated with each other over space. In this part, we follow Pinkse and Slade

    (1998) and we use Bernstein’s blocking methods and the McLeish’s (1974) central limit theorem for

    dependent processes. First, define ≡ Π=1(1+ ) where 2 = −1 and ( = 1 2) is

    24

  • an array of random variables on the probability triple (Ωz ) is a real number. McLeish’s (1974)central limit theorem for dependent processes requires the following four conditions

    (i) {}is uniformly integrable,(ii) → 1(iii)

    P=1

    2

    → 1(iv)

    ≤|| → 0

    Now we need to define in our case. Let 0 ≡ {√(0)√(0)

    } = − 12 P=1 for implicitlydefine In order to prove 0

    → (0 1) we need to establish that the property holds for allkk = 1 using the Cramer-Wold device. As in the proof of Lemma 1 in Theorem 1, we split theregion in which observations are located up to an area of size

    √ ×

    √ We also know that

    increases faster than√ and slower, where and are integers such that = Let

    and be constructed such that (√) → 0 Let −12 × 1 uniformly in , for some

    fixed 0 12 Let Λ denote the set of indices corresponding to the observations in area . By

    assumption a number 0 exists such that (#Λ) Define ≡ − 12P

    ∈Λ ,and hence we can write 0 =

    P=1

    Now we are ready to discuss the four conditions for Mcleish’s (1974) central limit theorem. First,

    look at condition (iv), which requires that ≤

    || = (1)

    || =≤

    |− 12X∈Λ

    | (90)

    Since by assumption

    (#Λ) ⇒≤

    |− 12X∈Λ

    | ≤ × − 12 sup kk (91)

    where # denotes the number of objects, by definition we have that

    {√(0)p(0)

    } = − 12X

    =1

    = 1√

    0{12(1 1)

    (0) + 1(1− 2)(1 0)

    (0)

    +(1− 1)2(0 1)

    (0) + (1− 1)(1− 2)(0 0)

    (0)} (92)

    Since(0) is positive definite, (0)−12 is bounded as →∞ and we have that sup kk ∞ by

    assumption (vi) in Theorem 1. We have also proved that sup°°°(11) °°° ∞ in Lemma 2. Therefore,

    we are able to prove that sup kk ∞. Then × − 12 sup kk = ( × − 12 ) = (1) byconstruction of . Hence we can get that

    ≤|| = (1)

    Second, let us discuss condition (i): {} is uniformly integrable. Following Davidson (1994),if a random variable is integrable, the contribution to the integer of extreme random variable values

    must be negligible. In other words, if | | ∞ (||1| |) → 0 as → ∞ it is

    25

  • equivalent to say [sup || ] = 0 for some 0 as →∞ Here we follow the proof ofLemma 10 in Pinkse and Slade (1998). We have that

    [sup

    || ] = [sup

    |Π=1(1 + )| ] (93)

    ≤ [sup

    |Π=1(q1 + 22)| ] (94)

    = { [sup

    |Π=1(q1 + 22)| |( sup

    || ≤ )]× [sup || ≤ ]

    + [sup

    |Π=1(q1 + 22)| |( sup

    || )]× [sup || ]} (95)

    ≤ { [sup

    |Π=1(q1 + 22)| |( sup

    || ≤ )] + [sup || ] (96)

    where is a uniform upper bound toP

    ∈Λ Therefore,

    [sup || ] = [sup |− 12X∈Λ

    | ] (97)

    = [sup−12

    X∈Λ

    || ] ≤ [sup− 12 X∈Λ

    || ] = 0 (98)

    since − 12 1 and by construction of Then,

    [sup

    |Π=1(q1 + 22)| |( sup

    || ≤ )] ≤ [sup

    |(1 + 2−22)2 | ] = 0 (99)

    provided we set sufficiently large. Therefore, we proved that [sup || ] = 0⇒ {}is uniformly integrable.

    Third, condition (ii) requires that → 1 which is equivalent to say that − 1 = (1);see proof in Lemma 4.

    Fourth, in order to prove (iii):P

    =12

    → 1 by Lemma 8, P=12 − 1 = P=1(2) −1 + (1) and

    X=1

    (2)− 1 + (1) = ( 20)− 1−X6=

    () + (1) = (1) (100)

    by construction of 0 since ( 20) = 1 It remains to show thatP

    6= () = (1) Thiscondition is proved in Lemmas 5-76.

    6Lemmas 5-8 are along the lines of those in Pinkse and Slade (1998), which are a simplified version of the proofsin Davidson (1994).

    26

  • 7.2 Technical Lemmas

    The proofs of Theorems 1-2 require the use of the following Lemmas 1-8.

    Lemma 1: Under the assumptions in Theorem 1, ()− () = (1) for all ∈ ΘProof: we can rewrite () as

    () =1

    X=1

    {12[(1 1)− (1 0)− (0 1) + (0 0)]

    +1[(1 0)− (0 0)] + 2[(0 1)− (0 0)] + (0 0)} (101)

    Since we assume that lim→∞

    [ ()] exists, and by definition () ≡ lim→∞

    [ ()] this implies

    that: () − [ ()] = (1). In order to prove () − () = (1) we only need to showthat () − [ ()] = (1) That is equivalent to prove that the distance between () and[ ()] is infinitely small as → ∞ That is: k ()−[ ()]k2 → 0 as → ∞ and bydefinition, it is equivalent to [ ()]→ 0 as →∞It is easy to see that

    [ ()] =

    1

    2

    X=1

    X=1

    {11(12 12) + 212(12 1) + 213(12 2)

    +22(1 1) + 223(1 2) + 33(2 2) (102)

    where 1 = [(1 1) − (1 0) − (0 1) + (0 0)] 2 = [(1 0) − (0 0)] and 3 =[(0 1)− (0 0)] The same definition applies to 1 2and 3Note that here

    (1 1) = log{Z ∞−2

    Φ(1 + 12p

    (1))(

    2p (2)

    )2} (103)

    which is not a function of or Hence 1 is not a function of or The same logic applies to

    the other terms (2 3 1 2 and 3). Since we can set 1 ≤ (1 1) ≤ 2 for finite 1 and2 the same applies to (1 0) (0 1) and (0 0) We normalize to 0 ≤ (1 1) ≤ 1 Therefore,it is easy to see that || ≤ 2 and the same || and hence || ≤ 4 = 1 2Therefore, we can write

    | [ ()]| =1

    2

    X=1

    X=1

    {4(12 12) + 8(12 1) + 8(12 2)

    +4(1 1) + 8(1 2) + 4(2 2) (104)

    27

  • In the previous equation, firstly, let us look at the term 12

    P=1

    P=1

    4(1 1)

    1

    2

    X=1

    X=1

    4(1 1) ≤ 12

    X=1

    X=1

    4|(1 1)| ≤ 42

    X=1

    X=1

    () (105)

    by assumption (vii). Therefore, we need to prove that

    4

    2

    X=1

    X=1

    () = (1) as →∞ (106)

    Following Pinkse and Slade (1998), we also use the Bernstein’s (1927) blocking method to prove

    this as follows. We split the region in which observations are located up to an area of size

    1√ × 2

    √ We also know that increases faster than

    √ and slower, where and are

    integers such that = Without loss of generality, we assume 1 = 2 = 1, and let and be

    constructed such that (√) → 0 Let − 12 × 1 uniformly in , for some fixed 0 12

    By construction of (−12 ) = (1) Then we are able to apply the same idea to our case. In

    our case, the groups and take the role of and where one grows faster and the other grows

    slower than√We also know the is the distance between |−|. So we can find an upper bound

    for |−| as the maximum between group and . Let us suppose that is the one that grows fasterthan

    √ and is the one that grows slower than

    √. Then we can cancel one of the summations

    corresponding to with −1 Moreover, since grows faster than√ but slower than −1 one way

    is to define

    √P

    =1

    () as the one that grows faster than√ but slower than in such a way that

    X=1

    X=1

    () = (1

    √X

    =1

    ()) (107)

    Finally,

    √P

    =1

    () grows slower than and therefore, ( 1

    √P

    =1

    ()) = (1) So, we can get

    1

    2

    X=1

    X=1

    4(1 1) ≤ 42

    X=1

    X=1

    () = (1) (108)

    We can apply the same logic to 12

    P=1

    P=1

    8(1 2) and 12P

    =1

    P4

    =1

    (2 2) Let us consider

    12

    P=1

    P=1

    4(12 12) If we define = 12 and = 12 we can apply the same logic

    to prove that 12

    P=1

    P=1

    4( ) ≤ 42P

    =1

    P=1

    () = (1) Therefore, we are able to show that

    k ()−[ ()]k2 ≤ | [ ()]| ≤ 362

    X=1

    X=1

    () = (1) (109)

    28

  • Hence, ()−[ ()] = (1) =⇒ ()− () = (1) at all ∈ Θ

    Lemma 2 Under the assumptions in Theorem 1, ()− () is stochastically equicontinuous.Proof: The proof requires only to show that () is stochastically equicontinuous because

    () is continuous by assumption (iii). We have that

    ()−³e´ = 1

    X=1

    {12[(1 1 )− (1 1e)]+1(1− 2)[(1 0 )− (1 0e)]+(1− 1)2[(0 1 )− 01(0 1e)]+(1− 1)(1− 2)[(0 0 )− (0 0e)]} (110)

    By the mean value theorem

    ()−³e´ = 1

    X=1

    {12[(1 1)

    (∗)( − e)] + 1(1− 2)[(1 0)

    (∗)( − e)]+(1− 1)2[(0 1)

    (∗)( − e)] + (1− 1)(1− 2)[(0 0)

    (∗)( − e)]} (111)

    where ∗ lies between and e In order to prove () is stochastically equicontinuous, it is sufficientto show that

    sup∈Θ|1

    X=1

    12(1 1)

    ()| = (1) (112)

    and the same requirement applies to other terms. For simplicity issues we just prove one of them

    and the rest follow the same argument. Recall that

    (1 1) ≡ log(1 = 1 2 = 1|) (113)

    and note that (1 = 1 2 = 1|) = Φ2( 1√Ω11

    2√Ω22

    |) where Φ2 is the bivariate normaldistribution function. Also

    (1 1)

    =

    [logΦ2(1√Ω11

    2√Ω22

    )]

    (114)

    and since ≡ ( )(1 1)

    () =

    ((11)

    ()

    (11)

    ()

    ) (115)

    We focus first on (11)

    () where

    (1 1)

    =

    [logΦ2(1√Ω11

    2√Ω22

    )]

    =

    11√Ω11

    + 22√Ω22

    Φ2(1√Ω11

    2√Ω22

    ) (116)

    29

  • with

    1 = (1pΩ11

    )Φ(( 2√

    Ω22− 1√

    Ω11)p

    1− 2) (117)

    2 = (2pΩ22

    )Φ(( 1√

    Ω11− 2√

    Ω22)p

    1− 2) (118)

    By assumption (v)

    sup

    °°°°°° 1Φ2( 1√Ω11

    2√Ω22

    )

    °°°°°° = sup°°°° 1Pr(1 = 1 2 = 1|)

    °°°° ∞ (119)and it is easy to see that

    °°°° 11√Ω11 + 22√Ω22°°°° ∞ provided that sup(kk) ∞ Therefore,

    sup

    °°°°(1 1) ()°°°° ∞ (120)

    We now discuss the second term (1=12=1|)

    () where

    (1 1)

    =

    [logΦ2(1√Ω11

    2√Ω22

    )]

    (121)

    =2(

    1√Ω11

    2√Ω22

    )

    Φ2(1√Ω11

    2√Ω22

    2(1√Ω11

    2√Ω22

    )

    (122)

    and after some algebra, we can prove that sup

    °°°°2( 1√Ω11 2√Ω22 ) °°°° ∞ provided that sup kk ∞Therefore, it easy to see when sup

    °°°(11) °°° ∞ and sup °°°(11) °°° ∞ we can getsup

    °°°°(1 1)°°°° ∞. (123)

    We apply the same logic to the other terms, and we can prove that sup°°°(10)

    ()°°°, sup °°°(01) ()°°°

    and sup°°°(00)

    ()°°° are also bounded.

    Therefore, finally sup∈Θ| 1

    P=1 12

    (11)

    ()| = (1) given sup(kk) = (1) and hence we

    can prove that ()− () is stochastically equicontinuous.

    Lemma 3: Under the assumptions in Theorem 2, sup°°°2 Pr(1=12=1|)

    °°° ∞30

  • Proof: From Lemma 2, we know that

    Pr(1 = 1 2 = 1|)

    =( 1√

    Ω11)Φ(

    (2√Ω22

    − 1√Ω11

    )√1−2

    )1pΩ11

    +( 2√

    Ω22)Φ(

    (1√Ω11

    − 2√Ω22

    )√1−2

    )2pΩ22

    (124)

    =⇒ 2 Pr(1 = 1 2 = 1|)

    =1(

    1√Ω11

    ){1 1√Ω11

    Φ[(2√Ω22

    − 1√Ω11

    )√1−2

    ] + [(2√Ω22

    − 1√Ω11

    )√1−2

    ](

    2√Ω22

    − 1√Ω11

    )√1−2

    }pΩ11

    +2(

    2√Ω22

    ){2 2√Ω22

    Φ[(1√Ω11

    − 2√Ω22

    )√1−2

    ] + [(1√Ω11

    − 2√Ω22

    )√1−2

    ](

    1√Ω11

    − 2√Ω22

    )√1−2

    }pΩ22

    (125)

    and even though the above expression is complicated, it is easy to see that all the terms are bounded

    provided the assumptions in Theorem 2 hold. This is equivalent to

    sup

    °°°°2 Pr(1 = 1 2 = 1|)°°°° ∞ (126)

    Pr(1 = 1 2 = 1|)

    =2(

    1√Ω11

    2√Ω22

    )

    Φ2(1√Ω11

    2√Ω22

    2(1√Ω11

    2√Ω22

    )

    (127)

    =⇒ 2 Pr(1 = 1 2 = 1|)

    2

    =(2)2(Φ2 − 22)(Φ2)2

    +2Φ2

    222

    (128)

    It is easy to see that the first term of the above equation is bounded from previous results (i.e.

    sup

    °°°2 °°° ∞) and the second term can be also proved bounded since 222 can be proved tobe bounded given that sup kk ∞ after some algebra. Hence sup

    °°°2 Pr(1=12=1|)

    °°° ∞

    Lemma 4. Under the assumptions in Theorem 2, −1 = (1)where ≡ Π=1(1+)Proof: By definition, = Π

    =1(1 + ) = −1 + −1 By repeatedly multi-

    plying out, we finally get = 1 + P

    =1 −1 Hence,

    − 1 = (X=1

    −1) (129)

    31

  • In order to prove − 1 = (1) we just need to show that: (P

    =1 −1) = (1)This is equivalent to prove that (−1) = (−1 ) We can rewrite −1 as −1 = Π

    −1=1(1 +

    )We know there are − 1 groups of in −1We split these − 1 groups into two parts:groups adjacent to group , and groups that are not adjacent to group . We then define the area

    Ξ−1 as the area which is adjacent to group . Therefore, −1 = Π∈Ξ−1(1+ )Π∈Ξ−1(1+) = Π∈Ξ−1(1 + )where ≡ Π∈Ξ−1(1 + ) which includes the groupswhich are not adjacent to group .

    Since −1 = Π∈Ξ−1(1 + )we just need to prove

    [(Π∈Ξ−1(1 + ))] = [(Π∈Ξ−1(1 + ))] = (−1 ) (130)

    We know that

    [(Π∈Ξ−1(1 + ))] = [(1 + X

    ∈Ξ−1−1)] (131)

    = [] +[(X

    ∈Ξ−1−1)] (132)

    First, we look at the term [] Since ≡ Π∈Ξ−1(1 + ) that means thegroup is not adjacent to group . By Bernstein’s method, we split the region in such a way that

    the distance between group and non-adjacent group is at least 12 Hence, |[]| =

    |( ) = (√) provided () = 0 and by assumption (vi) in Theorem 1. By

    construction of and (√) = (1) and hence we obtain |[]| = (−1 )

    Second, we look at the term [(P

    ∈Ξ−1 −1)] We have that

    [(X

    ∈Ξ−1−1)] =

    X∈Ξ−1

    [Π∈Ξ−1)] (133)

    Consider [)] first. We know that [)] = ( ) provided

    () = 0 Since ( ) → ( ) as → ∞ because gets more andmore terms (all groups not adjacent to group ), while keeps the same amount. In the first

    step, we have proved that ( ) = (−1 ) and by the same argument ( ) =(−1 )Therefore, we can prove that (−1) = (−1 )⇒ − 1 = (1)

    Lemma 5. Under the assumptions in Theorem 2,P

    6= () = (1)Proof: We know that

    P6= () =

    P=1

    P=1()−

    P= () = (1) if

    we can show that P

    =1 |()| = (−1 ) This is equivalent to proveP

    6= () =(1) because the summation over contains − 1 terms.

    32

  • Define Ξ as the set of indices corresponding to blocks that have blocks removed from every

    direction from block . In other words, we assume there are no more than 8 blocks within distance

    Hence,

    X=1

    |()| ≤ √X

    =1

    X∈Ξ

    |()| (134)

    ≤ X∈Ξ

    |()|+√X

    =2

    X∈Ξ

    |()| (135)

    The first term is proved to be (−1) = (−1 ) in Lemma 6. The second term can be alsoproved to be (−1 ) in Lemma 7.

    Lemma 6: Under the assumptions in Theorem 2, P

    6= |()| = (−1) = (−1 )Proof: Since = −

    12

    P∈Λ by definition

    X6=|()| = ∈|−1

    X∈Λ∈Λ

    ()| (136)

    ≤ ∈1−1X

    ∈Λ∈Λ() (137)

    because () = ( ) = 1() where 1 0

    To compute the upper bound of the correlation between and , we just need to consider the

    strongest case, e.g. the and are adjacent each other. By Bernsteins’ blocking method, the number

    of ( ) combinations that are within distance is bounded by 2√

    2 where 2 0 Hence we

    can get

    ∈1−1X

    ∈Λ∈Λ() ≤ 3∈−1

    p

    4√X

    =0

    2() (138)

    where 3 = 12 4 0

    By assumption (ii) in Theorem 2, 2()→ 0 as →∞ Therefore,

    3∈−1p

    4√X

    =0

    2() = (−1) (139)

    Since = by construction, (−1) = (−1 )

    Lemma 7: Under the assumptions in Theorem 2, P√

    =2

    P∈Ξ |()| = (−1 )

    33

  • Proof: Because ∈Ξ ×∈Λ ×∈Λ|() = (√( − 1)) we have that

    √X

    =2

    X∈Ξ

    |()| ≤ 5√X

    =2

    #Ξ−1 ×#Λ ×#Λ(

    p( − 1)) (140)

    ≤ 6−12√X

    =1

    (p) = (

    −1

    √X

    =1

    () = (−1) (141)

    = (−1 ) (142)

    where # denotes the number of objects, and (−1P√

    =1 () = (−1) follows from assumption

    (i): as →∞ 2(∗)(∗) = (1).

    Lemma 8: Under the assumptions in Theorem 2,P

    =12 =

    P=1(

    2) + (1)

    Proof: In order to proveP

    =12 =

    P=1(

    2) + (1) it suffices to show that

    X=1

    X=1

    (2 2) = (1) (143)

    We have thatX=1

    X=1

    (22) =

    X=1

    X=1

    {[2 −(2)][2 −(2)]} (144)

    ≤ 78√X

    =0

    ( + 1)(p)(

    4) (145)

    where 7, 8 0 are large enough. Also

    (4) ≤ −2X

    1234∈Λ|[1 2 3 4] (146)

    ≤ 9−2X

    1234∈Λ{(12) + + (34)} (147)

    ≤ 10−2X

    12∈Λ{(12)} (148)

    ≤ 11−22X

    1∈Λ

    12√X

    =0

    () = (−23) (149)

    where 9 10 11 12 0 |P∞

    =0 ()| ∞ Therefore finally

    7

    8√X

    =0

    ( + 1)(p)(

    4) = (

    −23) = (1) (150)

    because = and −12 → 0 as →∞

    34

  • Finally, the following Lemma 9 generalizes Pinkse and Slade (1998) results as a way to obtain

    consistent estimates of the variance covariance matrix.

    Lemma 9: If assumptions in Theorem 2 hold, and sup°°Φ4

    + Φ3

    °° ∞ then (b)−(0) =(1) and (b) −(0) = (1); where () ≡ [() ()] and () ≡ −[()]Proof: First, we prove that (b)−(0) = (1) We know that (b) = − 1P=1(b) and

    by definition, lim→∞

    (0) = (0) So we just need prove that {(b)− lim→∞

    (0)} = (1) forall kk = 1 From the proof of Theorem 2, we have already proved that

    1

    X=1

    (b)→ 1

    X=1

    (0) (151)

    as → ∞ provided that b − 0 = (1) which is proved in Theorem 1. Therefore, we can get(b)−(0) = (1)Second, we consider how to show (b)− (0) = (1) As before, it is sufficient to show that

    (b)−(0) = (1) as →∞We know that (0) = [(0) (0)] = ((0)) given(0) = 0 Recall from the proof of Theorem 2 that

    (0) =1

    X=1

    {12(1 1)

    (0) + 1(1− 2)(1 0)

    (0)

    +(1− 1)2(0 1)

    (0) + (1− 1)(1− 2)(0 0)

    (0)} (152)

    and we can rewrite it as

    (0) =1

    X=1

    {12[(1 1)

    (0)− (1 0)

    (0)− (0 1)

    (0) +(0 0)

    (0)]

    +1[(1 0)

    (0)− (0 0)

    (0)] + 2[

    (0 1)

    (0)− (0 0)

    (0)]

    +(0 0)

    (0)} (153)

    For the sake of brevity, we redefine

    1 ≡ [(1 1)

    (0)− (1 0)

    (0)− (0 1)

    (0) +

    (0 0)

    (0)] (154)

    2 ≡ [(1 0)

    (0)− (0 0)

    (0)] (155)

    3 ≡ [(0 1)

    (0)− (0 0)

    (0)] (156)

    4 ≡(0 0)

    (0) (157)

    35

  • Therefore,

    ((0)) = −1(0)

    = −2X

    =1

    X=1

    {11(12 12) + 212(12 1)

    +213(12 2) + 22(1 1)

    +223(1 2) + 33(2 2) (158)

    where 1 2 3 are defined similarly as 1 2 3

    As before, we just need to provide the proof for one of these terms, and the same logic applies to

    other terms. We consider the most complicated term and the rest follow the same argument

    −1X

    =1

    X=1

    [11(12 12)]

    = −1X

    =1

    X=1

    11[(1212)−(12)(12)] (159)

    (1212) = Pr(1 = 1 2 = 1 1 = 1 2 = 1|) (160)= Φ4(1 2 1 2 12 13 14 23 24 34) (161)

    where Φ4 is the cdf for the quadvariate standard normal distribution, 1 =1√

    (1)etc. Similarly,

    (12) = Pr(1 = 1 2 = 1|) = Φ2(1 2 12) (162)(12) = Pr(1 = 1 2 = 1|) = Φ2(1 2 34) (163)

    and therefore,

    (12)(12) = Φ2(1 2 12)×Φ2(1 2 34) (164)= Φ4(1 2 1 2 12 0 0 0 0 34) (165)

    so we can write the first term as

    (0) = −1

    X=1

    X=1

    11[(1212)−(12)(12)] (166)

    = −1X

    =1

    X=1

    11[Φ4(1 2 1 2 12(0) 13(0) 14(0) 23(0) 24(0) 34(0))

    −Φ4(1 2 1 2 12(0) 0 0 0 0 34(0))] (167)Similarly, we can write the first term of (b) as

    −1X

    =1

    X=1

    11[Φ4(1 2 1 2 12(b) 13(b) 14(b) 23(b) 24(b) 34(b))−Φ4(1 2 1 2 12(b) 0 0 0 0 34(b))] (168)

    36

  • By the mean value theorem, the first term of (b) −(0) is given as−1

    X=1

    X=1

    11{[Φ4(1 2 1 2 12(b) 13(b) 14(b) 23(b) 24(b) 34(b))−Φ4(1 2 1 2 12(0) 13(0) 14(0) 23(0) 24(0) 34(0))]−[Φ4(1 2 1 2 12(b) 0 0 0 0 34(b))−Φ4(1 2 1 2 12(0) 0 0 0 0 34(0))]} (169)

    = −1(b − 0) X=1

    X=1

    11{Φ4(1 2 1 2 12(

    ∗) 13(∗) 14(

    ∗) 23(∗) 24(

    ∗) 34(∗)

    −Φ4(1 2 1 2 12(∗) 0 0 0 0 34(

    ∗))

    } (170)

    Since sup°°1°° ∞ by the proof in Theorem 2, we just need to assume

    sup

    °°°°Φ4(1 2 1 2 12(∗) 13(∗) 14(∗) 23(∗) 24(∗) 34(∗)°°°° ∞ (171)

    and the same argument applies to

    sup

    °°°°Φ4(1 2 1 2 12(∗) 0 0 0 0 34(∗))°°°° ∞ (172)

    so that

    −1(b − 0) X=1

    X=1

    11{Φ4(1 2 1 2 12(

    ∗) 13(∗) 14(

    ∗) 23(∗) 24(

    ∗) 34(∗)

    )

    −Φ4(1 2 1 2 12(∗) 0 0 0 0 34(

    ∗))

    }→ 0 (173)

    because (b − 0)→ 0 and the other terms are bounded.Repeat the proofs to the other terms, plus the new assumption about sup

    °°Φ3

    °° ∞ and thenwe can prove (b) −(0) = (1)

    37

  • 8 Appendix 2

    TABLE 1∗. CASE 1: SIMULATION RESULTS OF DIFFERENT ESTIMATORS OF IN THECONTEXT OF THE BIVARIATE SPATIAL PROBIT MODEL.

    = 02 = 04 = 06 = 08

    HPE PMLE HPE PMLE HPE PMLE HPE PMLE = 500 mean 3.938 0.514 6.177 0.519 7.698 0.571 7.735 0.634

    bias 3.738 0.314 5.777 0.319 7.098 -0.029 6.935 -0.166(s.d.) (12.158) (0.120) (15.776) (0.205) (16.929) (0.151) (16.202) (0.289)

    = 1000 mean 3.174 0.512 4.668 0.518 5.456 0.581 5.914 0.672bias 2.974 0.312 4.268 0.118 4.856 -0.019 5.114 -0.128(s.d) (8.844) (0.107) (9.100) (0.133) (9.631) (0.149) (10.173) (0.276)

    = 1500 mean 2.746 0.511 4.050 0.507 4.872 0.609 5.426 0.708bias 2.546 0.311 3.650 0.107 4.272 0.009 4.626 -0.092(s.d.) (6.423) (0.099) (7.414) (0.124) (8.598) (0.149) (8.514) (0.253)

    ∗Results are presented for our new Partial Maximum Likelihood Estimator (PMLE) and theHeteroskedastic Probit Estimator (HPE) of . Numbers in brackets show standard deviations (s.d.).

    38

  • TABLE 2∗. CASE 1: SIMULATION RESULTS OF DIFFERENT ESTIMATORS OF 1 2 AND3 IN THE CONTEXT OF THE BIVARIATE SPATIAL PROBIT MODEL.

    1= 1 2= 1 3= 1

    HPE PMLE HPE PMLE HPE PMLE = 02 = 500 mean 5.322 2.618 5.333 2.619 5.329 2.623

    (s.d.) (8.844) (0.839) (8.872) (0.855) (8.863) (0.870) = 1000 mean 5.308 2.616 5.296 2.616 5.289 2.618

    (s.d) (7.612) (0.560) (7.570) (0.560) (7.568) (0.564) = 1500 mean 5.247 2.604 5.239 2.602 5.235 2.604

    (s.d.) (6.624) (0.540) (6.606) (0.536) (6.613) (0.543) = 04 = 500 mean 3.610 1.329 3.614 1.329 3.608 1.328

    (s.d.) (5.305) (0.362) (5.311) (0.365) (5.290) (0.366) = 1000 mean 3.600 1.318 3.593 1.316 3.588 1.315

    (s.d.) (4.192) (0.355) (4.177) (0.355) (4.178) (0.353) = 1500 mean 3.456 1.281 3.441 1.281 3.438 1.278

    (s.d.) (3.818) (0.342) (3.793) (0.343) (3.798) (0.339) = 06 = 500 mean 2.898 0.972 2.876 0.966 2.885 0.969

    (s.d.) (3.761) (0.271) (3.723) (0.268) (3.735) (0.271) = 1000 mean 2.669 0.981 2.669 0.979 2.657 0.978

    (s.d.) (2.951) (0.261) (2.953) (0.261) (2.916) (0.259) = 1500 mean 2.508 1.016 2.499 1.015 2.501 1.016

    (s.d.) (2.726) (0.250) (2.706) (0.250) (2.708) (0.253) = 08 = 500 mean 2.246 0.805 2.237 0.801 2.249 0.802

    (s.d.) (2.810) (0.373) (2.803) (0.373) (2.841) (0.392) = 1000 mean 2.098 0.843 2.096 0.843 2.082 0.843

    (s.d.) (2.281) (0.349) (2.279) (0.349) (2.246) (0.340) = 1500 mean 2.086 0.884 2.096 0.886 2.094 0.886

    (s.d.) (2.059) (0.316) (2.071) (0.314) (2.073) (0.318)∗Results are presented for our new Partial Maximum Likelihood Estimator (PMLE) and the

    Heteroskedastic Probit Estimator (HPE) of 1 2 and 3. Numbers in brackets show standard

    deviations (s.d.).

    39

  • TABLE 3∗. CASE 2: SIMULATION RESULTS OF DIFFERENT ESTIMATORS OF IN THECONTEXT OF THE BIVARIATE SPATIAL PROBIT MODEL.

    = 02 = 04 = 06 = 08

    HPE PMLE HPE PMLE HPE PMLE HPE PMLE = 500 mean 2.151 0.381 2.575 0.667 2.491 0.970 2.876 1.202

    bias 1.951 0.181 1.175 0.267 1.891 0.370 2.076 0.402(s.d.) (4.630) (0.844) (5.073) (0.923) (4.996) (0.913) (6.213) (0.966)

    = 1000 mean 1.013 0.356 1.089 0.606 1.307 0.863 1.660 1.160bias 0.813 0.156 0.689 0.206 0.707 0.263 0.860 0.360(s.d) (2.131) (0.606) (2.241) (0.622) (2.424) (0.671) (2.675) (0.813)

    = 1500 mean 0.684 0.324 0.792 0.592 0.906 0.860 1.305 1.156bias 0.484 0.124 0.392 0.192 0.306 0.260 0.505 0.356(s.d.) (1.508) (0.484) (1.566) (0.515) (1.611) (0.601) (1.910) (0.706)

    ∗Results are presented for our new Partial Maximum Likelihood Estimator (PMLE) and theHeteroskedastic Probit Estimator (HPE) of . Numbers in brackets show standard deviations (s.d.).

    40

  • TABLE 4∗. CASE 2: SIMULATION RESULTS OF DIFFERENT ESTIMATORS OF 1 2 AND3 IN THE CONTEXT OF THE BIVARIATE SPATIAL PROBIT MODEL.

    1= 1 2= 1 3= 1

    HPE PMLE HPE PMLE HPE PMLE = 02 = 500 mean 1.043 1.021 1.042 1.020 1.042 1.020

    (s.d.) (0.120) (0.109) (0.114) (0.104) (0.122) (0.108) = 1000 mean 1.017 1.006 1.022 1.011 1.016 1.005

    (s.d) (0.079) (0.071) (0.078) (0.071) (0.079) (0.070) = 1500 mean 1.012 1.004 1.013 1.005 1.010 1.002

    (s.d.) (0.065) (0.059) (0.063) (0.058) (0.064) (0.057) = 04 = 500 mean 1.043 1.017 1.042 1.018 1.043 1.019

    (s.d.) (0.125) (0.112) (0.112) (0.105) (0.119) (0.110) = 1000 mean 1.017 1.005 1.020 1.008 1.014 1.002

    (s.d.) (0.079) (0.072) (0.080) (0.072) (0.077) (0.069) = 1500 mean 1.014 1.005 1.013 1.004 1.009 1.000

    (s.d.) (0.065) (0.058) (0.061) (0.056) (0.062) (0.056) = 06 = 500 mean 1.046 1.022 1.045 1.022 1.043 1.020

    (s.d.) (0.123) (0.107) (0.126) (0.111) (0.126) (0.108) = 1000 mean 1.019 1.006 1.020 1.007 1.015 1.003

    (s.d.) (0.083) (0.075) (0.084) (0.075) (0.079) (0.071) = 1500 mean 1.010 1.002 1.012 1.004 1.010 1.002

    (s.d.) (0.066) (0.060) (0.065) (0.060) (0.063) (0.059) = 08 = 500 mean 1.035 1.013 1.036 1.014 1.037 1.015

    (s.d.) (0.125) (0.115) (0.125) (0.111) (0.125) (0.115) = 1000 mean 1.016 1.003 1.017 1.005 1.016 1.004

    (s.d.) (0.083) (0.086) (0.082) (0.090) (0.084) (0.088) = 1500 mean 1.011 1.000 1.012 1.001 1.012 1.001

    (s.d.) (0.065) (0.058) (0.064) (0.057) (0.064) (0.056)∗Results are presented for our new Partial Maximum Likelihood Estimator (PMLE) and the

    Heteroskedastic Probit Estimator (HPE) of 1 2 and 3. Numbers in brackets show standard

    deviations (s.d.).

    41

  • References

    [1] Andrews, D. W. K. (1991): “Heteroskedasticity and Autocorrelation Consistent Covariance

    Matrix Estimation”, Econometrica 59, 3, 817-858.

    [2] Amemiya, T. (1973): “Regression Analysis when the Dependent Variable is Truncated Normal”,

    Econometrica 41, 997-1016.

    [3] Anselin, L. and Florax, R. J. G. M. (1995): “New Direction in Spatial Econometrics”, Springer-

    Verlag, Berlin, Germany.

    [4] Anselin, L. Florax, R. J. G. M., and Rey, J. S. (2004): “Econometrics for Spatial Models: Recent

    Advances”, Advances in Spatial econometrics. Springer-Verlag, Berlin, Germany,1-28.

    [5] Beron, K. J. and Vijverberg, W. P. (2003): “Probit in a Spatial Context: A Monte Carlo

    Approach”, Advances in Spatial econometrics. Springer-Verlag, Berlin, Germany, 169-196.

    [6] Bernstein, S. (1927): “Sur l’Extension du Theoreme du Calcul des Probabilities aux Sommes de

    Quantities Dependantes”, Mathematische Annalen 97, 1-59.

    [7] Case, A. C. (1991): “Spatial Patterns in Household Demand”, Econometrica 59, 953-965.

    [8] Conley, T. G. (1999): “GMM Estimation with Cross Sectional Dependence”, Journal of Econo-

    metrics 92, 1-45.

    [9] Davidson, J. (1994): “Stochastic Limit Theory”, Oxford: Oxford University Press.

    [10] Gourieroux, C. (2000): “Econometrics of Qualitative Dependent Variables”, Cambridge Univer-

    sity Press.

    [11] Kelejian, H. H. and Prucha, I. R. (1999): “A Generalized Moments Estimator for the Autore-

    gressive Parameter in a Spatial Model”, International Economic Review 40, 509-533.

    [12] Kelejian, H. H. and Prucha, I. R. (2001): “On the Asymptotic Distribution of the Moran I Test

    Statistic with Applications”, Journal of Econometrics 104, 219-257.

    [13] Lee, L.-F. (2004): “Asymptotic Distribution of Quasi-maximum Likelihood Estimators for Spa-

    tial Autoregressive Models”, Econometrica 72, 6, 1899-1925.

    [14] Lesage, J. P. (2000): “ Bayesian Estimation of Limit Dependent Variable Spatial Autoregressive

    Models”, Geographical Analysis 32, 19-35.

    [15] McLeish, D. L. (1974): “Dependent Central Limit Theorems and Invariance Principals”, Annals

    of Probability 2, 620-628.

    42

  • [16] McMillen, D. P. (1992): “Probit with S