partial maximum likelihood estimation of spatial probit ......march 18, 2010 abstract this paper...

Partial Maximum Likelihood Estimation of Spatial Probit

Models∗

Honglin Wang†

Michigan State University

Emma M. Iglesias‡


and University of Essex

Jeffrey M. Wooldridge§


March 18, 2010

Abstract

This paper analyzes spatial Probit models for cross sectional dependent data in a binary

choice context. Observations are divided by pairwise groups and bivariate normal distributions

are specified within each group. Partial maximum likelihood estimators are introduced and

they are shown to be consistent and asymptotically normal under some regularity conditions.

Consistent covariance matrix estimators are also provided. Estimates of average partial effects

can also be obtained once we characterize the conditional distribution of the latent error. Fi-

nally, a simulation study shows the advantages of our new estimation procedure in this setting.

Our proposed partial maximum likelihood estimators are shown to be more efficient than the

generalized method of moments counterparts.

Keywords: Spatial statistics, Maximum Likelihood, Probit Model.JEL classification: C12, C13, C21, C24, C25.

∗We are grateful to participants at the 2009 Econometric Society European Meeting and at seminars at CityUniversity, London, London School of Economics, UCL, University of Exeter and University Carlos III for very useful

comments. Any remaining errors are our own.†Department of Economics, Michigan State University, 101 Marshall-Adams Hall, East Lansing, MI 48824-1038,

USA. e-mail: [email protected].‡Department of Economics, Michigan State University, 101 Marshall-Adams Hall, East Lansing, MI 48824-1038,

USA. e-mail: [email protected]. Financial support from the MSU Intramural Research Grants Program is gratefullyacknowledged.

§Department of Economics, Michigan State University, 101 Marshall-Adams Hall, East Lansing, MI 48824-1038,USA. e-mail: [email protected].

1

1 Introduction

Most econometrics techniques on cross-section data are based on the assumption of independence of

observations. However, economic activities become more and more correlated over space with modern

communication and transportation improvements. On the other hand, technological advances in

communications and the geographic information system (GIS) make spatial data more available

than before. Spatial correlations among observations received more and more attentions in regional,

real estate, agricultural, environmental and industrial organizations economics (Lee, 2004).

Econometricians began to pay more attention on spatial dependence problems in the last two

decades and some important advances have been done in both theoretical and empirical studies1.

Spatial dependence not only means lack of independence between observations, but also a spatial

structure underlying these spatial correlations (Anselin and Florax, 1995). There are two ways to

capture spatial dependence by imposing structures on a model: one is in the domain of geostatistics

where the spatial index is continuous (Conley, 1999), the other is that spatial sites form a countable

lattice (Lee, 2004). Among the lattice models, there are also two types of spatial dependence models

according to spatial correlation between variables or error terms: the spatial autoregressive dependent

variable model (SAR) and the spatial autoregressive error model (SAE). In most applications of

spatial models, the dependent variables are continuous (Conley, 1999; Lee, 2004; Kelejian and Prucha,

1999, 2001; among others), and only few applications address the spatial dependence with discrete

choice dependent variables (exceptions include: Case, 1991; McMillen, 1995; Pinkse and Slade, 1998;

Lesage, 2000; Beron and Vijerberg, 2003). This paper is designed to address this gap and we are

concerned about the SAE model with discrete choice dependent variables.

As the name indicates, there are two aspects in the discrete choice model with spatial dependence.

First, the dependent variable is discrete and the leading cases occur where the choice is binary. Probit

and Logit are the two most popular non-linear models for binary choice problems. For the sake of

brevity, in this study we focus on Probit model, but the approach developed here generalizes to other

discrete choice models.

In discrete choice models, if the observations are independent, we use maximum likelihood estima-

tion to get efficient estimators given the correct conditional distribution of dependent variables. The

nice part of the maximum likelihood estimator (MLE) is that we can still get consistency, asymptotic

normality but inefficient estimators in many situations (panel data or clustering) by pseudo MLE

even when we ignore certain dependence among observations (Poirier and Rudd, 1988). However, the

non-linear property causes computation difficulties in estimation, and this computational difficulty

becomes much worse when dependence occurs, which results in solving -dimensional integration.

Dependence is the other aspect of this problem. General forms of dependence are rarely allowed for

in cross-sectional data, although routinely allowed for in time-series data (Conley, 1999). For example,

some scholars discussed discrete choice models with dependence in time-series data: Robinson (1982)

1Anselin, Florax and Rey (2004) wrote a comprehensive review about econometrics for spatial models.

2

relaxed Amemiya (1973) assumptions of independence in Tobit model, and proved that the MLE

with dependent observations is strongly consistent and asymptotically normal under some regularity

conditions. Poirier and Rudd (1988) discussed the Probit model with dependence in time-series

data, and developed generalized conditional moment (GCM) estimators which are computational

attractive and relatively more efficient.

However, dependence in space is more complicated than in the time setting because of four

reasons: first, time is one dimensional whereas space has at least two dimensions; second, time has

natural order (direction) whereas space has no natural direction; third, time is regularly divided

because of regular astronomical phenomena whereas spatial observations are attached to geographic

properties of the surface of the earth; fourth, time-series observations are draws from a continuous

process whereas, with spatial data, it is common for the sample and the population to be the same

(Pinkse et al., 2007).

Therefore, how to deal with dependence in space in estimation is the key to spatial econome-

tricians. Inspired by works about dependence in time-series data, Conley (1999) uses metrics of

economic distance to characterize dependence among agents, and shows that the GMM estimator is

consistent and asymptotically normal under some assumptions similar to time-series data. He also

provides how to get consistent covariance matrix estimator by an approach similar to Newey-West

(1987). Pinkse and Slade (1998) use GMM in the discrete choice setting with the SAE model, and

show that the GMM estimator remains consistent and asymptotically normal under some regularity

conditions. Although Pinkse and Slade (1998) generated generalized residuals from the MLE as the

basis of the GMM estimators, they do not take advantage of information from spatial correlations

among observations, and hence the GMM estimator is much less efficient than full ML estimators.

Lee (2004) examines carefully the asymptotic properties of MLE and quasi-MLE for the linear spatial

autoregressive model (SAR), and he shows that the rate of convergence of those estimators may de-

pend on some general features of the spatial weights matrix of the model. If each units are influenced

by only a few neighboring units, the estimators may have√-rate of convergence and asymptotic

normality; otherwise, it may have lower rate of convergence and estimators could be inconsistent.

In this paper, we choose to capture spatial dependence by considering spatial sites to form a

countable lattice, and we provide a middle-ground approach which trades off efficiency and compu-

tation burdensome. The idea is to divide spatial dependent observations into many small groups

(clusters) in which adjacent observations belong to one group. The implicit rationale behind this

is adjacent observations usually account for the most important spatial correlations between obser-

vations. If we can correctly specify the conditional joint distribution within groups, which allows

us to utilize relatively more information of spatial correlations, estimating the model by partial ML

(PML) will give us consistent and more efficient estimators, which should be generally better than

GMM estimators. However, this approach is subject to biased variance-covariance matrix estimators

because of spatial correlations among groups. To deal with this problem, we follow the methods

proposed by Newey-West (1987) and Conley (1999) to get consistent variance-covariance matrix es-

3

timators. Of course, this middle ground approach will not get the most efficient estimator. However,

since information from adjacent observations usually capture important spatial correlations in the

whole sample, we get a consistent and a relatively efficient estimators, and we avoid some tedious

computations at expense of a loss of a relatively small part of efficiency.

This paper is organized as follows. Section 2 reviews some widely applied econometric techniques

on discrete choice models. Also, the SAE model with discrete choice dependent variable is presented

and the regularity conditions are specified. We also show how we can estimate average partial effects

once we characterize the conditional distribution of the latent error. Section 3 presents the bivariate

spatial Probit model. In Section 4, we prove consistency and asymptotic normality of the PML

estimator (PMLE) under regularity assumptions, and discuss how to get consistent covariance matrix

estimators. Section 5 presents a simulation study showing the advantages of our new estimation

procedure in this setting. Finally, Section 6 concludes. The proofs are collected in Appendix 1, while

the results for the simulation study are provided in Appendix 2.

2 Discrete Choice Models with Spatial Dependence

2.1 A Probit Model without Dependence

We first review the standard Probit model without dependence and the underlying linear latent

variable model is

∗ = + (1)

where ∗ is the latent dependent variable and a scalar, is a 1× vector of regressors, is a×1parameter vector to be estimated, and is a continuous random variable, independent of and

it follows a standard normal distribution. However, we cannot observe ∗ , and we can only observethe indicator , which is related to ∗ as follows

=

(1 if ∗ 00 if ∗ ≤ 0

) (2)

Therefore, we can get the conditional distribution of given as

( = 1|) = ( ∗ 0|) = ( −|) = Φ() (3)

where Φ denotes the standard normal cumulative distribution function (cdf). It is easy to see we can

get

( = 0|) = 1−Φ() (4)Since is a Bernoulli random variable, we can write the conditional density function of

conditional on as

(|) = [Φ()][1−Φ()]1− ∈ {0 1} (5)

4

Also, given the independence assumption of random variables, the log likelihood function can be

written as

() =X=1

{ log[Φ()] + (1− ) log[1−Φ()]} (6)

and the sufficient condition for uniqueness of the global maximum of () is that the function is

strictly concave (Gourieroux, 2000). We can solve then b from the first order condition()

=

X=1

−Φ()Φ()[1−Φ()]()

0 = 0 (7)

where is the probability density function (pdf) of the standard normal distribution. However,

the simple closed-form expressions for the MLE are not available because the cdf of the normal

distribution has no close-form solution. So the MLE must be solved by using numerical algorithms2.

In general, we can prove that the conditional MLE is consistent and the most efficient estimator

given some regularity conditions3 such as correctly specifying a parametric model, an identified

and a log-likelihood function that is continuous in .

2.2 A Probit Model with Spatial Dependence

When spatial dependence is present in a binary response model, the dependence not only brings us

the difficulties in estimating parameters, but also raises a question: what is the estimation of interest?

because the magnitudes of the partial effects depend not only on the value of the covariates, but also

on the value of the unobserved heterogeneity caused by the spatial dependence.

For example, a binary response model with spatial correlation in the latent errors is given by

= 1[ + ≥ 0] (8)

where {} is allowed to be spatially correlated. Distance between points is almost always assumedto be exogenous, and so we can think of having marginal distributions

(| ) (9)

where is the matrix of covariates for all observations and is a matrix containing pairwise

distances. The model that just allows spatial correlation in the errors essentially assumes

(| ) = (| ) (10)

and, in fact, the dependence on appears only in the variance. In particular, for probit,

| ∼ (0 ()) (11)2Commonly used numerical solutions are all derived from Newton’s method. (see Gourieroux, 2000 for details).3See details in Wooldridge (2002, page 391).

5

where (· ·) is known (in principal) from the spatial autocorrelation assumption. Literature hasfocused on estimating the parameters, and (the latter is the spatial dependence coefficient, a

scalar in practice), but we need to understand that estimating and is not our final goal. We should

pay more attention to estimating average partial effects based on the average structural function,

which is directly correlated to parameters estimating (especially spatial dependence coefficient ).

() = nΦh

p()

io(12)

which is (hopefully) consistently estimated by

−1X=1

Φ

∙̂

q( ̂)

¸ (13)

If we knew the unconditional distribution of , we would not have to use the variance function at

all; but we have the conditional distribution. We proceed now to provide a new method to estimate

the parameters of a spatial probit model.

2.3 A Probit Model with Spatial Error Correlation

Consider the Probit model with spatial error correlation (SAE), where the underlying linear latent

variable model is

∗ = + (14)

= X

=1

+ (15)

where is an element in the spatial weights matrix which can be defined by different spatial

distances such as the Euclidean distance. is the spatial autoregressive error coefficient and we have

a random variable ∼ (0 1). We can write equations (14) and (15) in matrix form as follows

∗ = + (16)

= ( − )−1 (17)

so that the variance-covariance matrix for the model is

Ω ≡ (| ) = [( − )0( − )]−1 (18)

If ∗ is observable, equation (16) becomes a linear function, and we can use the Jacobian trans-formation of into ∗ and write the log likelihood function as

( ) = −2ln(2)− 1

2( ∗ −)00( ∗ −) + ln || (19)

where = − , and then the estimate of can be solved as b = ( 00)−1 00 ∗6

However, in practice we cannot observe ∗, and we can only observe , and it implies thatwe have to deal with a non-linear Probit model because of the normal distributional assumption.

Moreover the errors are correlated, and the full likelihood function becomes

= (1 = 1 2 = 2 · · · = ) =1Z

−∞

· · ·Z

−∞

() (20)

() = (2)−2 |Ω|−1− 12 (0Ω−1) (21)

Although theoretically, if we take the first derivatives subject to and the spatial coefficient ,

we obtain

=

{1Z

−∞

· · ·Z

−∞

(2)−2 |( − )0( − )|−12 [0(− )0(− )]}

= 0 (22)

=

{1Z

−∞

· · ·Z

−∞

(2)−2 |( − )0( − )|−12 [0(− )0(− )]}

= 0 (23)

The expression of the first derivatives are quite complicated, but if we have sufficient computational

ability and and are identifiable, we can get consistent and efficient estimates of and by using

numerical methods. However, in practice, it would be a formidable computational task even for a

moderate size sample. We propose in this paper a more attractive procedure in the next sections

that involves to deal with partial maximum likelihood.

2.4 Probit Models with Other Forms of Spatial Correlation

Generally, there is no reason to think that spatial correlation is properly modeled by (15). Other

forms are possible. For example, one might assume that, outside of a certain geographic radius from

a given observation , is uncorrelated with shocks to the outlying regions. So, for example, we

might assume a constant correlation with any unit within a given radius — similar to a random effects

structure for unbalanced panel data.

Alternatively, we may prefer more of a moving average structure, such as

= +

ÃX6=

! (24)

where the are i.i.d. with unit variance. This formulation is attractive because it is relatively easy

to find variances and pairwise correlations, which we will use in the PMLEs described in the next

section. For example,

(| ) = 1 + 2ÃX

6= 2

! (25)

7

Clearly, methods that use only the variance in estimation can only identify 2 (but we almost

always think 0, anyway). Pairwise covariances can also be obtained from

( | ) = + + 2Ã X

6=6=

! (26)

Expressions like this for the covariance between different errors are important for applying grouped

PMLE methods.

3 Using Partial MLEs to Estimate General Spatial Probit

Models

Estimating a probit spatial autocorrelation model by full MLE is a prodigious task, although several

approaches have been applied. The EM algorithm can be used (McMillen, 1992), the RIS simulator

(Beron and Vijverberg, 2003), and the Bayesian Gibbs sampler (Lesage, 2000). But each of these

approaches is still computationally burdensome. To combine such approaches with simulation studies,

or to be able to quickly estimate a range of models, is outside the abilities of even current computation

capabilities for even moderate sample sizes.

To get an estimator that is computationally feasible, Pinkse and Slade (1998) proposed using gen-

eralized method of moments (GMM) using information on the marginal distributions of the binary

responses. In particular, the generalized residuals from the marginal probit log likelihood are used to

construct moment conditions for the GMM method. Pinske and Slade show that, under conditions

very similar to those in this paper, the GMM estimator is consistent and asymptotically normal.

The consistent variance-covariance matrix can also be obtained theoretically without a covariance

stationary assumption, although Pinske and Slade do not discuss estimation of the asymptotic vari-

ance. Therefore, the GMM estimator is almost practically useful, but it is fundamentally based on

the marginal probit models. Thus, while a GMM estimator can be obtained that is efficient given

the information on the marginal likelihood, the method throws out much useful information. This

is very similar to just using a pooled probit for time series or panel and ignoring dynamics. We

describe a simplified version of this approach in Section 3.1, which, in effect, uses a heteroskedastic

probit model to estimate the along with any spatial autocorrelation parameter.

Using only the marginal distribution of , conditional on the covariates and weights, likely results

in serious loss of information for estimating both and the spatial autocorrelation parameters.

Our key contribution in this paper is to explore the use of partial maximum likelihood where we

group small numbers of nearby observations and obtain the joint distribution of those observations.

Naturally, these distributions are determined by the fully specified spatial autocorrelation model —

just as we must obtain the implied variance to apply marginal probit methods. Once the covariances

between observations are found as a function of the weights and , we can use that information in

8

multivariate probit estimation. Section 3.2 covers the case of where we describe a bivariate probit

approach, with heteroskedasticity and covariance implied by the particular spatial autocorrelation

model. Using a single covariance in addition to the variance seems likely to improve efficiency of

estimation.

3.1 Univariate Probit Partial MLE

One way to estimate the coefficients along with spatial correlation parameters is to derive the

marginal distributions, ( = 1| ) as a function of all of the weights (and the parameters, and , of course). Under the joint normality assumption, the model will be a form of probit with

heteroskedasticity. In particular, given any spatial probit model such that the variances are well

defined, we can find

( = 1| ) = Φ(()) (27)where 2 () = (| ) = (| ) is a function of all weights, , and the spatial correlationparameters, . As is well-known in time series contexts — see, for example, Poirier and Ruud (1988)

or Robinson (1982) — using probit while ignoring the time series correlation leads to consistent

estimation under standard regularity conditions, provided the data are weakly dependent. Thus, it is

not surprising that pooled probit that accounts for the heteroskedasticity in the marginal distribution

is generally consistent for spatially correlated data, too — provided, of course, we limit the amount

of spatial correlation.

The log likelihood can be written generically as

() =X=1

{ log[Φ(())] + (1− ) log[1−Φ(())]} (28)

Assuming that and are identified, and that the conditions in Section 4 hold, the pooled het-

eroskedastic probit is generally consistent and√-asymptotically normal. But, for reasons we dis-

cussed above, it is likely to be very inefficient relative to the full MLE. Further, estimators that use

some information on the spatial correlation across observations seem more promising in terms of

increasing precision.

3.2 Bivariate Probit Partial MLE

We now turn to using information on pairs of “nearby” observations to identify and . There is

nothing special about using pairs; we could use, say, triplets, or even larger groups. But the bivariate

case is easy to illustrate and is computationally quite feasible.

For illustration, assume a sample includes 2 observations, and we divide the 2 observations

into pairwise groups according to the spatial Euclidean distance between them (see Graph 1). In

other words, each group includes two observations, with the idea being that the internal correlation

9

between the two observations is more important than external correlations with observations in other

groups. Within a group, the two observations follow a conditional bivariate normal distribution be-

cause error terms are assumed to have a joint normal distribution.

Graph 1: (2 observations =⇒ groups)

Let© ∗ : = 1

ªwhere, for notational simplicity, we assume 2 Observations. Then ∗ =¡

∗1 ∗2

¢follows something similar to a random effects probit model, but there is heteroskedasticity

in the variances and covariances. In group , we have

∗1 = 111 + 212 + + 1 + 1 (29)

∗2 = 121 + 222 + + 2 + 2 = 1 2 (30)

We can rewrite the above equations in latent variable matrix form as

∗1 = 1 + 1 (31)

∗2 = 2 + 2 = 1 2 (32)

where 1 and 2 are 1× vectors of regressors and is a ×1 vector. 1 and 2 are scalars. Ingroup , observation and observation are not only correlated with each other, but also correlated

with other observations over space. Therefore, the variances and covariance between 1 and 2 not

only depend on the weight within group, but also weights with other observations out of the group,

and the parameters, as well. See, for example equation (26).

10

It is easy to see that (1|1 ) = (2|2 ) = 0, and the covariance-variance matrix forgroup under assumption (??), is defined as Ω ≡ (| )

(| ) ≡ Ω() ="Ω11 Ω12

Ω21 Ω22

# (33)

where we suppress the dependence on and in what follows for notational simplicity. The variance

terms are the same as in the Pinske and Slade (1998) approach. In our case, the covariance must

also be computed, although this will be the source to improve the precision of the estimates of and

Note here that elements in Ω depend not only on the weight between two observations in group

, but also weights for every observation in the whole sample, because two observations in group

not only correlated with each other, but also correlated with other observations over space. Since we

define two nearby observations as one group, we pick up the corresponding part (Ω) from the whole

covariance-variance matrix (see equation (26)).

Since we cannot observe ∗1 and ∗2, as we discussed in the univariate Probit model, we define

=

(1 if ∗ 00 if ∗ ≤ 0

) (34)

Therefore the conditional bivariate normal distribution of 1 and 2 given is given as

(1 = 1 2 = 1|) = (1 + 1 0 2 + 2 0|) (35)= (1 1 2 2|) = Φ2( 1p

Ω112pΩ22

) (36)

=(1 2)p

(1)p (2)

=Ω12pΩ11Ω22

(37)

where Φ2 is the standard bivariate normal distribution, 2 is the standard density function of the

bivariate normal distribution and is the standardized covariance between two error terms. Esti-

mation in this context is similar to “random effects” probit with 2 observations, although one needs

to properly compute the variances and covariances of the latent errors. We can also use random

matched pairs of even groups of more than two observations. The idea behind is that the more

correlations used, the better.

Given that (1 2) has a joint normal distribution, we can write

1 = 12 + 1 (38)

where

1 =(1 2)

(2) (39)

and 1 is independent of and 2

11

Because of the joint normality of (1 2) 1 is also normally distributed with (1) = 0 and

(1) = (1)− 21 (2) (40)

Thus, we can write the conditional distribution of 1 as

(1| 2) ∼ (0 (1)) (41)

Substitute equation (38) back to ∗1 = 1 + 1, and we can get

∗1 = 1 + 12 + 1 (42)

Therefore

(1 = 1| 2) = ΦÃ1 + 12p

(1)

! (43)

The reason we want to find (43) is to retrieve (1 = 1 2 = 1|). Since

(1 = 1 2 = 1|) = (1 = 1|2 = 1)× (2 = 1|) (44)

it is easy to see that (2 = 1|) = Φ( 2√ (2)

), and thus it remains to get (1 = 1|2 = 1)First, since 2 = 1 if and only if 2 −2 and 2 follows a normal distribution and it is

independent of , then the density of 2 given 2 −2 is( 2√

(2))

(2 −2) =( 2√

(2))

Φ( 2√ (2)

) (45)

Therefore,

(1 = 1|2 = 1) = [ (1 = 1| 2)|2 = 1) (46)= [Φ(

1 + 12p (1)

)|2 = 1 ] (47)

=1

Φ( 2√ (2)

)

Z ∞−2

Φ(1 + 12p

(1))(

2p (2)

)2 (48)

and it is easy to see that (1 = 0|2 = 1) = 1 − (1 = 1|2 = 1) because 1 is thebinary variable.

Similarly, we can get

(1 = 1|2 = 0) = 11−Φ( 2√

(2))

Z 2−∞

Φ(1 + 12p

(1))(

2p (2)

)2 (49)

and (1 = 0|2 = 0) = 1− (1 = 1|2 = 0)

12

Now we are ready to get (1 = 1 2 = 1|) as follows

(1 = 1 2 = 1|) = 1Φ( 2√

(2))

Z ∞−2

Φ(1 + 12p

(1))(

2p (2)

)2

×Φ( 2p (2)

) (50)

=

Z ∞−2

Φ(1 + 12p

(1))(

2p (2)

)2 (51)

and similarly we can obtain finally

(1 = 0 2 = 1|) = Φ( 2p (2)

)−Z ∞−2

Φ(1 + 12p

(1))(

2p (2)

)2 (52)

(1 = 1 2 = 0|) =Z 2−∞

Φ(1 + 12p

(1))(

2p (2)

)2 (53)

(1 = 0 2 = 0|)= [1− Φ( 2p

(2))]−

Z 2−∞

Φ(1 + 12p

(1))(

2p (2)

)2 (54)

4 Partial Maximum Likelihood Estimation

As we discussed in the introduction, if the observations are independent, we can simplify the mul-

tivariate distribution into the product of univariate distributions, and then the ML estimator can

be obtained easily. However, spatial correlations among observations do not allow the simplification

any more. Under spatial correlation, the situation is kind of similar to the panel data case. In panel

data, we cannot assume independence among observations over different periods for the same person

(or firm), which means we are not likely to specify the full conditional density of given correctly.

Therefore, we need to relax the assumption in the panel data case. The way we deal with the problem

is that if we have a correctly specified model for the density of given we can define the partial

log likelihood function as

∈Θ

X=1

X=1

log (| ) (55)

where (| ) is the density for given for each . The partial log likelihood function worksbecause 0 (the true value) maximizes the expected value of the above equation provided we have

the densities (| ) correctly specified (Wooldridge, 2002).We can apply a similar idea to the spatial Probit model: if we have the bivariate normal densities

2(12| ) correctly specified for each group, we could get a consistent estimator by partialML. However, there are several differences between panel data and spatial dependent data: first, the

panel data model assumes that the cross section dimension () is sufficiently large relative to the

13

time dimension ( ), but in spatial data we do not have this assumption. Second, in the panel data

model, we view the cross section observations as independent, while in the spatial data model, even

though we divided the sample into groups, however, we are definitely not assuming independence

among groups. Observations in different groups are still correlated, but the correlations are assumed

to decay as distances become further away. Third, as we discussed before, dependence in space is

more complicated than dependence in time, and we need to assume that the correlations between

groups die out quickly enough as distance goes further away. In short, we need to examine carefully

how the weak law of large numbers (WLLN) and central limit theorem (CLT) can be applied in the

spatial dependent case. We will discuss these issues and provide proofs in the following sections.

Rather than “filling in”, we use a type of asymptotics where area size is increasing. Essentially, we

need the process to be mixing in a spatial sense.

First, we can write the partial log likelihood function as

=X

=1

{12 log(1 = 1 2 = 1|) + 1(1− 2) log(1 = 1 2 = 0|)

+(1− 1)2 log(1 = 0 2 = 1|) + (1− 1)(1− 2) log(1 = 0 2 = 0|)} = 1 2(56)

and for the sake of brevity, we define

(1 1) ≡ log(1 = 1 2 = 1|); (1 0) ≡ log(1 = 1 2 = 0|); (57)(0 1) ≡ log(1 = 0 2 = 1|) and (0 0) ≡ log(1 = 0 2 = 0|) (58)

Therefore, we can rewrite the partial log likelihood function as

=X

=1

{12(1 1)+1(1−2)(1 0)+(1−1)2(0 1)+(1−1)(1−2)(0 0)} (59)

We proceed now to prove the consistency and asymptotic normality of the PMLE.

4.1 Consistency of Bivariate Probit Estimation

Consistent estimators b ≡ (b b)0 are the ones that converge in probability to the true value 0 ≡(0 0)

0, i.e. b −→ 0, as the sample size goes to infinity for all possible true values. In this section,to make the asymptotic arguments formal, we distinguish between the true value, 0, and a generic

parameter value .

In the bivariate probit estimation, the estimator b is defined as: b maximizes () subject to ∈ Θ, where Θ is the parameters set. The objective function () is defined as

() ≡ 1

X=1

{12(1 1) + 1(1− 2)(1 0)

+(1− 1)2(0 1) + (1− 1)(1− 2)(0 0)} (60)

14

i.e, in other words, b = argmax∈Θ

() (61)

Remember that this objective function represents a partial log likelihood, not a fully log likelihood:

we are only using information on the conditional distribution (1 2| ) across the groups .We are not using (1 2 | ) as in a full maximum likelihood setting.The identification condition requires that () is uniquely maximized at the true value 0 where

() is defined as

() ≡ lim→∞

[ ()] (62)

This condition typically holds for well-specified models when there is not perfect collinearity among

the regressors. Following Pinske and Slade (1998), we impose the identification condition in all our

analysis. Further, one needs to be a little careful in parameterizing the spatial autocorrelation, but

standard models of spatial autocorrelation cause no problems.

The following Theorem 1 states the main consistency result for a broad class of spatial probit

models. We define () ≡ () and lim

→∞[ ()] = ()

Theorem 1. If (i) 0 is the interior of a compact set Θ which is the closure of a concave set,

(ii) attains a unique maximum over the compact set Θ at 0(iii) is continuous on Θ , (iv)

the density of observations in any region whose area exceeds a fixed minimum is bounded, (v) as

→∞ sup(°°° 1Pr(1=12=1|) + 1Pr(1=12=0|) + 1Pr(1=02=1|) + 1Pr(1=02=0|)°°°) ∞ (vi)

as →∞ sup(kk+kk) = (1) (vii) sup |( )| ≤ () = 1 2 where denotesthe distance between group and , and () → 0 as → ∞ and (viii) lim

→∞[ ()] exists, (ix)

sup kk ∞, then b − 0 = (1).Proof: Given in Appendix 1.

Condition (i) is a standard assumption from set theory. Condition (ii) is the identification condi-

tion for MLE. Condition (iii) assumes that the function is continuous in the metric space, which

is a reasonable assumption and necessary for the proof that () is stochastically equicontinuous.

Condition (iv) simply excludes that an infinite number of observations crowd in one bounded area.

The minimum area restriction is imposed because an infinitesimal area around a single observation

has infinite density. Condition (v) makes sure any one of these four situations will be present in a

sufficiently large sample. Condition (vi) makes sure the regressors are deterministic and uniformly

bounded, which is not a strong assumption in this literature. Condition (vii) is the key assumption

for this theorem, and it requires that the dependence among groups decays sufficiently quick when

the distance between groups become further apart. This assumption employs the concept from -

mixing to define the rate of dependence decreasing as distance increases. Condition (viii) assumes

the limit of [ ()] exists as →∞ which is not a strong assumption. Condition (ix) is actuallyimplied by the rule of dividing groups, which just excludes that the two groups are exactly in the

same location.

15

4.2 Asymptotic Normality

As we discussed in the introduction, the spatial dependence is more complicated than time-series

dependence at least in four perspectives. These differences cause that central limit theorems (CLT)

need stronger conditions for the spatial dependence case. To deal with general dependence problems,

the common way in the literature is to use the so called "Bernstein Sums", which break up into

blocks (partial sums), and we consider the sequence of blocks. Each block must be so large, relative

to the rate at which the memory of the sequence decays, that the degree to which the next block

can be predicted from current information is negligible. But at the same time, the number of blocks

must increase with so that the CLT argument can be applied to this derived sequence (Davidson,

1994).

In this section, we show under what assumptions we are able to apply McLeish’s central limit the-

orem (1974) to spatial dependence cases to get asymptotic normality for the spatial Probit estimator.

This is presented in the following Theorem. denotes the transpose of matrix

Theorem 2: If the assumptions of Theorem 1 hold, and in addition: (i) as →∞ 2(∗)(∗) = (1)

for all fixed ∗ 0 (ii) the sampling area grows uniformly at a rate of√ in two non-opposing

directions, (iii) (0) ≡ lim→∞[(0) (0)] and (0) ≡ lim→∞−[(0)] are uniformlypositive definite matrices; then

√(b − 0) → [0 (0)−1(0)(0)−1] where (0) ≡ (0)

and (0) =2

(0)

Proof: Given in Appendix 1.

Condition (i) is stronger than condition (vii) in Theorem 1, and it is also stronger than the usual

condition in time series data because spatial dependent data has more dimension correlations than

time series data. It shows that how dependence decays when distance between groups gets further

away, and the dependence decays at the rate fast enough. Condition (ii) just repeats the assumption

in the Bernstein’s blocking method, the two non-opposing directions just exclude sampling area grows

at two parallel directions, which does not make much sense in spatial dependent case. Conditions in

(iii) are natural conditions about matrices, which are implied by the previous assumptions. Matrices

are semidefinite if some extreme situations happen such as Pr(1 = 1 2 = 1|) = 0 which areassumed to be excluded in the previous assumptions.

4.3 Estimation of Variance-covariance Matrices

Consistent estimation of the asymptotic covariance matrix is important for the construction of as-

ymptotic confidence intervals and hypothesis tests (Newey and West, 1987). The estimations of

(i.e. ̂ = (b)) are relatively easy, usually just obtaining sample analogues of 0 with b; but theestimation of (i.e. b = (b)) is more difficult and more important because of the correlationsamong groups. Newey-West (1987) proposed a method to estimate the variance-covariance matrix in

16

settings of dependence of infinite order under a covariance stationary condition, and they suggested

modified Bartlett weights to make sure the estimated variance and test statistics were positive.

Andrews (1991) established the consistency of kernel HAC (Heteroskedasticity and Autocorrelation

Consistent) estimators under more general conditions. Pinkse and Slade (1998) also showed that we

can obtain (b) −(0) = (1) under regularity assumptions, where () ≡ [() ()] (seeLemma 9 in Appendix 1). This approach is feasible in practice only if we can get closed form expres-

sions for [() ()], which should be a function of , and then plug in b for 0 in the function toget consistent covariance estimators. However, it is difficult to get closed form expressions for (b)in practice, and hence we follow an alternative approach proposed by Conley (1999).

A feasible way to obtain a consistent estimate of a variance-covariance matrix that allows for a

wider range of dependence is to apply the approach of Conley (1999) along the lines of Newey-West

(1987). We follow this procedure in the following Theorem 3.

Let ΞΛ be the −algebra generated by a given random field ∈ Λ with Λ compact, andlet |Λ| be the number of ∈ Λ. Let Υ (Λ1Λ2) denote the minimum Euclidean distance from anelement of Λ1 to an element of Λ2 There exists also a regular lattice index random field ∗ thatis equal to one if location ∈ 2 is sampled and zero otherwise. ∗ is assumed to be independentof the underlying random field and to have a finite expectation and to be stationary. The mixing

coefficient is defined as

() ≡ sup {| ( ∩)− () ()|} ∈ ΞΛ1 ∈ ΞΛ2 and|Λ1| ≤ |Λ2| ≤ Υ (Λ1Λ2) ≥

We also define a new process () such as

() =

( () if ∗ = 10 if ∗ = 0

)

Then

Theorem 3. If (i) Λ grows uniformly in two non-opposing directions as −→ ∞ (ii)(0) ≡ lim→∞[(0) (0)] and (0) ≡ lim→∞−[(0)] are uniformly positive definitematrices, (iii) as defined in Theorem 1 = 1 2 and ∗ are mixing where () con-verges to zero as → ∞; () is Borel measurable for all ∈ Θ and continuous on Θ andfirst moment continuous on Θ (iv)

P∞=1 () ∞ for + ≤ 4 (v) 1∞ () = (−2)

(vi) for some 0 (k (0)k)2+ ∞ andP∞

=111 ()(2+) ∞ (vii) () is Borel

measurable for all ∈ Θ continuous on Θ and second moment continuous, (0) exists and isfull rank, (viii)

P∈2 (0 (0) (0)) is a non-singular matrix, (ix) the ( ) are uni-

formly bounded and ( ) −→ 1, −→ ∞ as −→ ∞ ( −→ ∞) = ¡13

¢and

= ¡ 13

¢, (x) for some 0 (k (0)k)4+ ∞ and as defined in Theorem 1

17

= 1 2 and ∗ are mixing where ∞∞ ()(2+) = (−4) (xi) supΘ k ()k2 ∞ and

supΘ k() [ ()]k2 ∞ thenb −(0) = (1) as −→∞where we split = [ ], Λ is a rectangle so that ∈ {1 2 } and ∈ {1 2 } and

b = −1 X=0

X=0

X=+1

X=+1

( )

⎛⎝ ³b´−− ³b´ +−−

³b´ ³b´⎞⎠

−−1X

=1

X=1

³b´ ³b´

To ensure positive semi-definite covariance matrix estimates, we need to choose an appropriate

two-dimensional weights function that is a Bartlett window in each dimension

( ) =

½(1− ||

)(1− ||

) || ||

0 else

¾

Proof: It follows from Conley (1999), Proposition 3.

5 Simulation Study

In the previous section, we have proved that the partial maximum likelihood estimator (PMLE)

based on the bivariate normal distribution is consistent and asymptotically normal. Moreover, one

of the most attractive properties of our new PMLE is that we can get a more efficient estimator

compared to the GMM estimator, and the approach is much less computational demanding when

compared to full information methods. In order to learn about the gains in efficiency that we obtain

in the context of a Bivariate Spatial Probit model when using PML versus GMM, we conduct in this

Section a simulation study to show the efficiency gains of PML.

5.1 Simulation Design and Results

Instead of comparing our PMLE to the GMM estimator of Pinkse and Slade (1998) directly, we

choose to compare the PMLE to the heteroskedastic Probit estimator (HPE) because of two reasons:

First, the HPE uses similar information with the GMM estimator because both methods use gener-

alized residuals from the Probit estimation to construct the moment conditions, which means that

both methods use the information from the heterogeneities of the diagonal terms of the variance-

covariance matrix, while our PMLE uses both diagonal and off-diagonal correlations information

between two closest neighbors. Second, the STATA4 source codes for bivariate probit estimation and

4See http://www.stata.com/

18

heteroskedastic Probit estimation are available online, and we can easily add the spatial parts into

these existing source codes to compare PML estimators with Heteroskedastic Probit Estimators. We

consider two simulation settings allowing for different types of spatial dependence.

5.1.1 Case 1

According to the theoretical framework given in previous sections, we could generate a dataset which

allows a general correlation structure across groups as equations (14) and (15), and it requires to

specify the exact formula (as functions of and ) for the elements of Ω However, it is quite

difficult to derive the pairwise covariances for a bivariate probit because the exact formula for Ω12(and of Ω11Ω22) is very complicated, which is an element of the inverse matrix with 2 spatially

correlated observations as follows

Ω =

"Ω11 Ω12

Ω21 Ω22

#= [( − )0( − )]−1 =

⎡⎢⎢⎢⎢⎢⎢⎣Ω111

Ω11 Ω12

Ω21 Ω22

Ω22

⎤⎥⎥⎥⎥⎥⎥⎦ (63)

Therefore, it seems reasonable to do the following. Let be the weighting matrix which can be

generated in STATA5 according to the distance between observations

∗ = 11 +22 +33 + (64)

= (65)

where ∼ (0 ). The weighting matrix is standardized so that the diagonal elements areones, and then the elements of shrink as distance is increasing . The reason we do this is because it

is easier to determine () and ( ) to apply the HP and the bivariate probit estimators.

In this way, we still allow general correlation across groups, and we are able to compare the efficiency

gains from only using the diagonal information (the HP approach) to using both diagonal and off-

diagonal information (bivariate probit), and we do not require to know the exact formula for the

elements in Ω (given in equation (63)) to reach the same goal.

Therefore, we generate the dataset according to equations (64) and (65), which allows spatial

correlation between any two observations, and we set the true parameter values for 1 2 and

3 equal to 1, 1 and 1 respectively. Since our main focus in this study is on the estimation of the

spatial parameter , we also set different true values for each simulated sample: = 02; 04; 06

and 08 to test for the performance of the two estimation methods (PML and HP). These values for

5The STATA command is “Spatwmat”. Since the speed to calculate the inverse of a matrix is much slower as the

size of matrix increases, and moreover the maximum matrix size in Stata is 800, we allow here each observation to bespatially correlated to nearby 99 observations.

19

are in the range of the estimated value in the empirical application of Pinkse and Slade (1998). In

this setting and with 1000 replications, we consider a sample size of = 1000 observations (where

the sample size is divided into 500 pairwise groups). Finally, we also simulate samples of sizes 500 and

1500 (with 250 and 750 pairwise groups respectively) to check the performance of the two methods

in different samples sizes. The simulation results are reported in Tables 1 (for the spatial parameter

) and 2 (for the 1 2 and 3) in Appendix 2.

From Table 2, we can observe that both the HPE and the PMLE of 1 2 and 3 converge to true

parameter values across the different parameter values as sample size increasing. Also the the PML

estimator has much less bias than the HPE. Moreover, as expected, PML always provides smaller

standard errors than the HP estimation method and bias and standard errors decrease in general

when sample size increases.

Furthermore, it is in Table 1 where we can observe the largest advantages of using PML versus HP.

We can see that the PMLE is much better than the HPE in terms of estimating the spatial parameter

. The PMLE is always much closer to the true parameter values and with small standard errors

across different sample sizes and parameter values (as expected from our theoretical results), while

the HPE is much further away from true parameter values and it is has a much larger standard

deviation over the different sample sizes, even though HPE also shows the trend to converge to the

true values in general as the sample increases. The HPE has always much larger standard deviation

than the PMLE, showing clearly the gains in efficiency of PML versus HPE/GMM as predicted

by our theory. Since both the HPE and the GMM estimator use generalized residuals from Probit

estimation to construct the moment conditions, we conjecture that the GMM estimator is subject to

similar inefficiency problems in estimating the spatial coefficient. Also, as it is expected, the bias of

the PMLE decreases when increases.

In summary, from the simulation results of Tables 1 and 2, we see how the PMLE outperforms

clearly the HPE (i.e., the GMM estimator of Pinkse and Slade (1998)), specially when estimating the

spatial parameter , which implies that the PMLE is much more robust and efficient in the context

of the spatial probit model. The simulation results provide clear evidence of the gains in efficiency

that can be obtained by PML versus GMM, as predicted by our theoretical results in the previous

section.

5.1.2 Case 2

We considered a second data generating process given as (64), where again we set the true parameter

values for 1 2 and 3 equal to 1, 1 and 1 respectively, and the covariates are jointly normal and

independent. In this case we assume (24) where the are i.i.d. (0 1) and is the

reciprocal of the distance between and . Then we can get closed form expressions for the variance

and covariance given in (25) and (26). Results for 1,000 replications are provided in Tables 3 and 4

in Appendix 2. Again, the PMLE provides improvements versus GMM, especially when estimating

20

Efficiency gains in estimating are smaller in Table 4, probably because we chose in this setting

pairwise observations to be closest to each other.

6 Conclusions

The idea of this paper is simple and intuitive: instead of just using information contained in moment

conditions (GMM), we divide observations into pairwise groups. Provided we correctly specify the

conditional joint distribution within these pairwise groups, we show that the spatial bivariate Pro-

bit model allows us to use the most important information of spatial correlations among adjacent

observations and to get a more efficient estimator: the partial MLE (PMLE). We prove that the

PMLE is consistent and asymptotically normal under some regularity conditions. We also discuss

how to get consistent covariance matrix estimators under general spatial dependence by following the

approach of Conley (1999) and Newey-West (1987), which is more usable in practice compared to

the proposal of Pinkse and Slade (1998). An attractive aspect of our approach is that we can obtain

a more efficient estimator than GMM without introducing stronger assumptions, and the approach

is much less computationally demanding compared to full information methods. In order to learn

about the gains in efficiency that we obtain in the bivariate Probit model with PML versus the GMM

estimator, we provide a simulation study in Section 5. The advantages in terms of bias and efficiency

of our new estimation procedure proposed in this paper are clearly demonstrated. Moreover, if we

extend this method to the trivariate or higher dimensional multivariate Probit models, we can obtain

even more efficient estimators, but it comes at the expense of more computational demands. We

have also shown how the estimation of the spatial probit model through PML also allows to obtain

estimates of partial average effects. The results in this paper can be generalized to consider models

with distributed lags, and also to extend the performance of the PMLE to lots of other nonlinear

models, including Tobit, count and switching. This will be the subject of future work.

7 Appendix 1

7.1 Proofs to Theorems

Proof of Theorem 1. If we can prove that ()→ () uniformly, by the information inequality,

() has a unique maximum at the true parameter when 0 is identified. Then under technical

conditions for the limit of the maximum to be the maximum of the limit, b should converge inprobability to 0. Sufficient conditions for the maximum of the limit to be the limit of maximum

are that the convergence in probability is uniform and the parameter set is compact (Newey and

Mcfadden, 1994).

To prove consistency, the proof includes three parts:

(i) has a unique maximum at 0

21

(ii) ()− () = (1) at all ∈ Θ(iii) () is stochastically equicontinuous and is continuous on Θ

Condition (i) and to be continuous on Θ are assumed. The proof of condition (ii) is provided

in Lemma 1, and the proof that () is stochastically equicontinuous can be found in Lemma 2.

Proof of Theorem 2. To find out the asymptotic normality of the Partial MLE for spatialbivariate Probit model, we start the proof from mean value theorem. Since

(b) = 0 and by using

the mean value theorem

(b) = 0 =

(0) +2

(∗)(b − 0) (66)

⇒ (b − 0) = −[ 2

(∗)]−1

(0) (67)

where ∗ lies between b and 0First, let us discuss the term

2

(∗) to find out the asymptotic properties of 2

(∗) Recall

that

() =1

X=1

{12(1 1) + 1(1− 2)(1 0)

+(1− 1)2(0 1) + (1− 1)(1− 2)(0 0)} (68)where (1 1) ≡ log(1 = 1 2 = 1|) etc. Also

2

() =

1

X=1

{122(1 1)

+ 1(1− 2)

2(1 0)

+(1− 1)(2)2(0 1)

+ (1− 1)(1− 2)

2(0 0)

(69)

where2(1 1)

=

−1[Pr(1 = 1 2 = 1|)]2 [

Pr(1 = 1 2 = 1|)

]2

+1

Pr(1 = 1 2 = 1|)2[Pr(1 = 1 2 = 1|)]

(70)

and all other terms behave similar.

As before, we only discuss one of these terms, and the same logic applies to the other terms. We

know that

1

X=1

[122(1 1)

(∗)]

=1

X=1

12{ −1[Pr(1 = 1 2 = 1|)]2 [

Pr(1 = 1 2 = 1|)

(∗)]2

+1

Pr(1 = 1 2 = 1|)2[Pr(1 = 1 2 = 1|)]

(∗)} (71)

22

Look at the first term of the above equation given by

1

X=1

12{ −1[Pr(1 = 1 2 = 1|)]2 [

Pr(1 = 1 2 = 1|)

(∗)]2} (72)

Since°°° 1[Pr(1=12=1|)]2°°° ∞, we can write this term as

1

X=1

11[ Pr(1 = 1 2 = 1|)

(∗)]2 (73)

where 11 ≡ 12 −1[Pr(1=12=1|)]2 In order to prove

1

X=1

11[ Pr(1 = 1 2 = 1|)

(∗)]2

→ 1

X=1

11[ Pr(1 = 1 2 = 1|)

(0)]

2 (74)

we need to show that it holds for all kk = 1 Set 11 = and then

{1

X=1

11[ Pr(1 = 1 2 = 1|)

(b)]2

−1

X=1

11[ Pr(1 = 1 2 = 1|)

(0)]

2} (75)

=1

X=1

11{[ Pr(1 = 1 2 = 1|)

(b)]2 − [ Pr(1 = 1 2 = 1|)

(0)]2} (76)

= (b − 0) 2

X=1

11 Pr(1 = 1 2 = 1|)

(∗)×

2 Pr(1 = 1 2 = 1|)

(∗) (77)

From the proof of Theorem 1, we know that sup°°° Pr(1=12=1|) °°° ∞ From Lemma 3,

sup

°°°2 Pr(1=12=1|)

°°° ∞ From Theorem 1, we also know that b − 0 = (1) and hence(b − 0) 2

X=1

11 Pr(1 = 1 2 = 1|)

(∗)×

2 Pr(1 = 1 2 = 1|)

(∗) = (1) (78)

=⇒ {1

X=1

11[ Pr(1 = 1 2 = 1|)

(b)]2 − 1

X=1

11[ Pr(1 = 1 2 = 1|)

(0)]

2} = (1)

(79)

=⇒ 1

X=1

11[ Pr(1 = 1 2 = 1|)

(∗)]2

→ 1

X=1

11[ Pr(1 = 1 2 = 1|)

(0)]

2 (80)

By definition,

lim→∞

1

X=1

11[ Pr(1 = 1 2 = 1|)

(0)]

2 = {11[ Pr(1 = 1 2 = 1|)

(0)]2} (81)

23

and therefore,

1

X=1

12{ −1[Pr(1 = 1 2 = 1|)]2 [

Pr(1 = 1 2 = 1|)

(∗)]2} → (82)

{12 −1[Pr(1 = 1 2 = 1|)]2 [

Pr(1 = 1 2 = 1|)

(0)]2} (83)

Similarly, we can prove in relation to the second term in (71) that

1

X=1

121

Pr(1 = 1 2 = 1|)2[Pr(1 = 1 2 = 1|)]

(∗) (84)

→ {12 1Pr(1 = 1 2 = 1|)

2[Pr(1 = 1 2 = 1|)]

(0)} (85)

As usual, we apply repeatedly the above arguments to the other terms. Finally, we can get that

lim→∞

2

(∗)

→ [ 2

(0)] (86)

If we define

≡ {122(1 1)

+ 1(1− 2)

2(1 0)

+(1− 1)(2)2(0 1)

+ (1− 1)(1− 2)

2(0 0)

} (87)

where denotes the Hessian, equation (87) can be rewritten as

lim→∞

1

X=1

(∗)→ lim

→∞[(0)] (88)

Therefore, it remains to show the asymptotic normality of the score term: (0) For the sake

of brevity, redefine the score as: (0) ≡ (0) Then

(0) =1

X=1

{12(1 1)

(0) + 1(1− 2)(1 0)

(0)

+(1− 1)2(0 1)

(0) + (1− 1)(1− 2)(0 0)

(0)} (89)

We need to show that −12 (0)(0)→ (0 ) where () ≡ lim

→∞[()

()] Note that

the information matrix equality does not hold here, i.e. −[(0)] 6= [() ()] because thescore terms are correlated with each other over space. In this part, we follow Pinkse and Slade

(1998) and we use Bernstein’s blocking methods and the McLeish’s (1974) central limit theorem for

dependent processes. First, define ≡ Π=1(1+ ) where 2 = −1 and ( = 1 2) is

24

an array of random variables on the probability triple (Ωz ) is a real number. McLeish’s (1974)central limit theorem for dependent processes requires the following four conditions

(i) {}is uniformly integrable,(ii) → 1(iii)

P=1

2

→ 1(iv)

≤|| → 0

Now we need to define in our case. Let 0 ≡ {√(0)√(0)

} = − 12 P=1 for implicitlydefine In order to prove 0

→ (0 1) we need to establish that the property holds for allkk = 1 using the Cramer-Wold device. As in the proof of Lemma 1 in Theorem 1, we split theregion in which observations are located up to an area of size

√ ×

√ We also know that

increases faster than√ and slower, where and are integers such that = Let

and be constructed such that (√) → 0 Let −12 × 1 uniformly in , for some

fixed 0 12 Let Λ denote the set of indices corresponding to the observations in area . By

assumption a number 0 exists such that (#Λ) Define ≡ − 12P

∈Λ ,and hence we can write 0 =

P=1

Now we are ready to discuss the four conditions for Mcleish’s (1974) central limit theorem. First,

look at condition (iv), which requires that ≤

|| = (1)

≤

|| =≤

|− 12X∈Λ

| (90)

Since by assumption

(#Λ) ⇒≤

|− 12X∈Λ

| ≤ × − 12 sup kk (91)

where # denotes the number of objects, by definition we have that

{√(0)p(0)

} = − 12X

=1

= 1√

0{12(1 1)

(0) + 1(1− 2)(1 0)

(0)

+(1− 1)2(0 1)

(0) + (1− 1)(1− 2)(0 0)

(0)} (92)

Since(0) is positive definite, (0)−12 is bounded as →∞ and we have that sup kk ∞ by

assumption (vi) in Theorem 1. We have also proved that sup°°°(11) °°° ∞ in Lemma 2. Therefore,

we are able to prove that sup kk ∞. Then × − 12 sup kk = ( × − 12 ) = (1) byconstruction of . Hence we can get that

≤|| = (1)

Second, let us discuss condition (i): {} is uniformly integrable. Following Davidson (1994),if a random variable is integrable, the contribution to the integer of extreme random variable values

must be negligible. In other words, if | | ∞ (||1| |) → 0 as → ∞ it is

25

equivalent to say [sup || ] = 0 for some 0 as →∞ Here we follow the proof ofLemma 10 in Pinkse and Slade (1998). We have that

[sup

|| ] = [sup

|Π=1(1 + )| ] (93)

≤ [sup

|Π=1(q1 + 22)| ] (94)

= { [sup

|Π=1(q1 + 22)| |( sup

|| ≤ )]× [sup || ≤ ]

+ [sup

|Π=1(q1 + 22)| |( sup

|| )]× [sup || ]} (95)

≤ { [sup

|Π=1(q1 + 22)| |( sup

|| ≤ )] + [sup || ] (96)

where is a uniform upper bound toP

∈Λ Therefore,

[sup || ] = [sup |− 12X∈Λ

| ] (97)

= [sup−12

X∈Λ

|| ] ≤ [sup− 12 X∈Λ

|| ] = 0 (98)

since − 12 1 and by construction of Then,

[sup

|Π=1(q1 + 22)| |( sup

|| ≤ )] ≤ [sup

|(1 + 2−22)2 | ] = 0 (99)

provided we set sufficiently large. Therefore, we proved that [sup || ] = 0⇒ {}is uniformly integrable.

Third, condition (ii) requires that → 1 which is equivalent to say that − 1 = (1);see proof in Lemma 4.

Fourth, in order to prove (iii):P

=12

→ 1 by Lemma 8, P=12 − 1 = P=1(2) −1 + (1) and

X=1

(2)− 1 + (1) = ( 20)− 1−X6=

() + (1) = (1) (100)

by construction of 0 since ( 20) = 1 It remains to show thatP

6= () = (1) Thiscondition is proved in Lemmas 5-76.

6Lemmas 5-8 are along the lines of those in Pinkse and Slade (1998), which are a simplified version of the proofsin Davidson (1994).

26

7.2 Technical Lemmas

The proofs of Theorems 1-2 require the use of the following Lemmas 1-8.

Lemma 1: Under the assumptions in Theorem 1, ()− () = (1) for all ∈ ΘProof: we can rewrite () as

() =1

X=1

{12[(1 1)− (1 0)− (0 1) + (0 0)]

+1[(1 0)− (0 0)] + 2[(0 1)− (0 0)] + (0 0)} (101)

Since we assume that lim→∞

[ ()] exists, and by definition () ≡ lim→∞

[ ()] this implies

that: () − [ ()] = (1). In order to prove () − () = (1) we only need to showthat () − [ ()] = (1) That is equivalent to prove that the distance between () and[ ()] is infinitely small as → ∞ That is: k ()−[ ()]k2 → 0 as → ∞ and bydefinition, it is equivalent to [ ()]→ 0 as →∞It is easy to see that

[ ()] =

1

2

X=1

X=1

{11(12 12) + 212(12 1) + 213(12 2)

+22(1 1) + 223(1 2) + 33(2 2) (102)

where 1 = [(1 1) − (1 0) − (0 1) + (0 0)] 2 = [(1 0) − (0 0)] and 3 =[(0 1)− (0 0)] The same definition applies to 1 2and 3Note that here

(1 1) = log{Z ∞−2

Φ(1 + 12p

(1))(

2p (2)

)2} (103)

which is not a function of or Hence 1 is not a function of or The same logic applies to

the other terms (2 3 1 2 and 3). Since we can set 1 ≤ (1 1) ≤ 2 for finite 1 and2 the same applies to (1 0) (0 1) and (0 0) We normalize to 0 ≤ (1 1) ≤ 1 Therefore,it is easy to see that || ≤ 2 and the same || and hence || ≤ 4 = 1 2Therefore, we can write

| [ ()]| =1

2

X=1

X=1

{4(12 12) + 8(12 1) + 8(12 2)

+4(1 1) + 8(1 2) + 4(2 2) (104)

27

In the previous equation, firstly, let us look at the term 12

P=1

P=1

4(1 1)

1

2

X=1

X=1

4(1 1) ≤ 12

X=1

X=1

4|(1 1)| ≤ 42

X=1

X=1

() (105)

by assumption (vii). Therefore, we need to prove that

4

2

X=1

X=1

() = (1) as →∞ (106)

Following Pinkse and Slade (1998), we also use the Bernstein’s (1927) blocking method to prove

this as follows. We split the region in which observations are located up to an area of size

1√ × 2

√ We also know that increases faster than

√ and slower, where and are

integers such that = Without loss of generality, we assume 1 = 2 = 1, and let and be

constructed such that (√) → 0 Let − 12 × 1 uniformly in , for some fixed 0 12

By construction of (−12 ) = (1) Then we are able to apply the same idea to our case. In

our case, the groups and take the role of and where one grows faster and the other grows

slower than√We also know the is the distance between |−|. So we can find an upper bound

for |−| as the maximum between group and . Let us suppose that is the one that grows fasterthan

√ and is the one that grows slower than

√. Then we can cancel one of the summations

corresponding to with −1 Moreover, since grows faster than√ but slower than −1 one way

is to define

√P

=1

() as the one that grows faster than√ but slower than in such a way that

X=1

X=1

() = (1

√X

=1

()) (107)

Finally,

√P

=1

() grows slower than and therefore, ( 1

√P

=1

()) = (1) So, we can get

1

2

X=1

X=1

4(1 1) ≤ 42

X=1

X=1

() = (1) (108)

We can apply the same logic to 12

P=1

P=1

8(1 2) and 12P

=1

P4

=1

(2 2) Let us consider

12

P=1

P=1

4(12 12) If we define = 12 and = 12 we can apply the same logic

to prove that 12

P=1

P=1

4( ) ≤ 42P

=1

P=1

() = (1) Therefore, we are able to show that

k ()−[ ()]k2 ≤ | [ ()]| ≤ 362

X=1

X=1

() = (1) (109)

28

Hence, ()−[ ()] = (1) =⇒ ()− () = (1) at all ∈ Θ

Lemma 2 Under the assumptions in Theorem 1, ()− () is stochastically equicontinuous.Proof: The proof requires only to show that () is stochastically equicontinuous because

() is continuous by assumption (iii). We have that

()−³e´ = 1

X=1

{12[(1 1 )− (1 1e)]+1(1− 2)[(1 0 )− (1 0e)]+(1− 1)2[(0 1 )− 01(0 1e)]+(1− 1)(1− 2)[(0 0 )− (0 0e)]} (110)

By the mean value theorem

()−³e´ = 1

X=1

{12[(1 1)

(∗)( − e)] + 1(1− 2)[(1 0)

(∗)( − e)]+(1− 1)2[(0 1)

(∗)( − e)] + (1− 1)(1− 2)[(0 0)

(∗)( − e)]} (111)

where ∗ lies between and e In order to prove () is stochastically equicontinuous, it is sufficientto show that

sup∈Θ|1

X=1

12(1 1)

()| = (1) (112)

and the same requirement applies to other terms. For simplicity issues we just prove one of them

and the rest follow the same argument. Recall that

(1 1) ≡ log(1 = 1 2 = 1|) (113)

and note that (1 = 1 2 = 1|) = Φ2( 1√Ω11

2√Ω22

|) where Φ2 is the bivariate normaldistribution function. Also

(1 1)

=

[logΦ2(1√Ω11

2√Ω22

)]

(114)

and since ≡ ( )(1 1)

() =

((11)

()

(11)

()

) (115)

We focus first on (11)

() where

(1 1)

=

[logΦ2(1√Ω11

2√Ω22

)]

=

11√Ω11

+ 22√Ω22

Φ2(1√Ω11

2√Ω22

) (116)

29

with

1 = (1pΩ11

)Φ(( 2√

Ω22− 1√

Ω11)p

1− 2) (117)

2 = (2pΩ22

)Φ(( 1√

Ω11− 2√

Ω22)p

1− 2) (118)

By assumption (v)

sup

°°°°°° 1Φ2( 1√Ω11

2√Ω22

)

°°°°°° = sup°°°° 1Pr(1 = 1 2 = 1|)

°°°° ∞ (119)and it is easy to see that

°°°° 11√Ω11 + 22√Ω22°°°° ∞ provided that sup(kk) ∞ Therefore,

sup

°°°°(1 1) ()°°°° ∞ (120)

We now discuss the second term (1=12=1|)

() where

(1 1)

=

[logΦ2(1√Ω11

2√Ω22

)]

(121)

=2(

1√Ω11

2√Ω22

)

Φ2(1√Ω11

2√Ω22

)×

2(1√Ω11

2√Ω22

)

(122)

and after some algebra, we can prove that sup

°°°°2( 1√Ω11 2√Ω22 ) °°°° ∞ provided that sup kk ∞Therefore, it easy to see when sup

°°°(11) °°° ∞ and sup °°°(11) °°° ∞ we can getsup

°°°°(1 1)°°°° ∞. (123)

We apply the same logic to the other terms, and we can prove that sup°°°(10)

()°°°, sup °°°(01) ()°°°

and sup°°°(00)

()°°° are also bounded.

Therefore, finally sup∈Θ| 1

P=1 12

(11)

()| = (1) given sup(kk) = (1) and hence we

can prove that ()− () is stochastically equicontinuous.

Lemma 3: Under the assumptions in Theorem 2, sup°°°2 Pr(1=12=1|)

°°° ∞30

Proof: From Lemma 2, we know that

Pr(1 = 1 2 = 1|)

=( 1√

Ω11)Φ(

(2√Ω22

− 1√Ω11

)√1−2

)1pΩ11

+( 2√

Ω22)Φ(

(1√Ω11

− 2√Ω22

)√1−2

)2pΩ22

(124)

=⇒ 2 Pr(1 = 1 2 = 1|)

=1(

1√Ω11

){1 1√Ω11

Φ[(2√Ω22

− 1√Ω11

)√1−2

] + [(2√Ω22

− 1√Ω11

)√1−2

](

2√Ω22

− 1√Ω11

)√1−2

}pΩ11

+2(

2√Ω22

){2 2√Ω22

Φ[(1√Ω11

− 2√Ω22

)√1−2

] + [(1√Ω11

− 2√Ω22

)√1−2

](

1√Ω11

− 2√Ω22

)√1−2

}pΩ22

(125)

and even though the above expression is complicated, it is easy to see that all the terms are bounded

provided the assumptions in Theorem 2 hold. This is equivalent to

sup

°°°°2 Pr(1 = 1 2 = 1|)°°°° ∞ (126)

Pr(1 = 1 2 = 1|)

=2(

1√Ω11

2√Ω22

)

Φ2(1√Ω11

2√Ω22

)×

2(1√Ω11

2√Ω22

)

(127)

=⇒ 2 Pr(1 = 1 2 = 1|)

2

=(2)2(Φ2 − 22)(Φ2)2

+2Φ2

222

(128)

It is easy to see that the first term of the above equation is bounded from previous results (i.e.

sup

°°°2 °°° ∞) and the second term can be also proved bounded since 222 can be proved tobe bounded given that sup kk ∞ after some algebra. Hence sup

°°°2 Pr(1=12=1|)

°°° ∞

Lemma 4. Under the assumptions in Theorem 2, −1 = (1)where ≡ Π=1(1+)Proof: By definition, = Π

=1(1 + ) = −1 + −1 By repeatedly multi-

plying out, we finally get = 1 + P

=1 −1 Hence,

− 1 = (X=1

−1) (129)

31

In order to prove − 1 = (1) we just need to show that: (P

=1 −1) = (1)This is equivalent to prove that (−1) = (−1 ) We can rewrite −1 as −1 = Π

−1=1(1 +

)We know there are − 1 groups of in −1We split these − 1 groups into two parts:groups adjacent to group , and groups that are not adjacent to group . We then define the area

Ξ−1 as the area which is adjacent to group . Therefore, −1 = Π∈Ξ−1(1+ )Π∈Ξ−1(1+) = Π∈Ξ−1(1 + )where ≡ Π∈Ξ−1(1 + ) which includes the groupswhich are not adjacent to group .

Since −1 = Π∈Ξ−1(1 + )we just need to prove

[(Π∈Ξ−1(1 + ))] = [(Π∈Ξ−1(1 + ))] = (−1 ) (130)

We know that

[(Π∈Ξ−1(1 + ))] = [(1 + X

∈Ξ−1−1)] (131)

= [] +[(X

∈Ξ−1−1)] (132)

First, we look at the term [] Since ≡ Π∈Ξ−1(1 + ) that means thegroup is not adjacent to group . By Bernstein’s method, we split the region in such a way that

the distance between group and non-adjacent group is at least 12 Hence, |[]| =

|( ) = (√) provided () = 0 and by assumption (vi) in Theorem 1. By

construction of and (√) = (1) and hence we obtain |[]| = (−1 )

Second, we look at the term [(P

∈Ξ−1 −1)] We have that

[(X

∈Ξ−1−1)] =

X∈Ξ−1

[Π∈Ξ−1)] (133)

Consider [)] first. We know that [)] = ( ) provided

() = 0 Since ( ) → ( ) as → ∞ because gets more andmore terms (all groups not adjacent to group ), while keeps the same amount. In the first

step, we have proved that ( ) = (−1 ) and by the same argument ( ) =(−1 )Therefore, we can prove that (−1) = (−1 )⇒ − 1 = (1)

Lemma 5. Under the assumptions in Theorem 2,P

6= () = (1)Proof: We know that

P6= () =

P=1

P=1()−

P= () = (1) if

we can show that P

=1 |()| = (−1 ) This is equivalent to proveP

6= () =(1) because the summation over contains − 1 terms.

32

Define Ξ as the set of indices corresponding to blocks that have blocks removed from every

direction from block . In other words, we assume there are no more than 8 blocks within distance

Hence,

X=1

|()| ≤ √X

=1

X∈Ξ

|()| (134)

≤ X∈Ξ

|()|+√X

=2

X∈Ξ

|()| (135)

The first term is proved to be (−1) = (−1 ) in Lemma 6. The second term can be alsoproved to be (−1 ) in Lemma 7.

Lemma 6: Under the assumptions in Theorem 2, P

6= |()| = (−1) = (−1 )Proof: Since = −

12

P∈Λ by definition

X6=|()| = ∈|−1

X∈Λ∈Λ

()| (136)

≤ ∈1−1X

∈Λ∈Λ() (137)

because () = ( ) = 1() where 1 0

To compute the upper bound of the correlation between and , we just need to consider the

strongest case, e.g. the and are adjacent each other. By Bernsteins’ blocking method, the number

of ( ) combinations that are within distance is bounded by 2√

2 where 2 0 Hence we

can get

∈1−1X

∈Λ∈Λ() ≤ 3∈−1

p

4√X

=0

2() (138)

where 3 = 12 4 0

By assumption (ii) in Theorem 2, 2()→ 0 as →∞ Therefore,

3∈−1p

4√X

=0

2() = (−1) (139)

Since = by construction, (−1) = (−1 )

Lemma 7: Under the assumptions in Theorem 2, P√

=2

P∈Ξ |()| = (−1 )

33

Proof: Because ∈Ξ ×∈Λ ×∈Λ|() = (√( − 1)) we have that

√X

=2

X∈Ξ

|()| ≤ 5√X

=2

#Ξ−1 ×#Λ ×#Λ(

p( − 1)) (140)

≤ 6−12√X

=1

(p) = (

−1

√X

=1

() = (−1) (141)

= (−1 ) (142)

where # denotes the number of objects, and (−1P√

=1 () = (−1) follows from assumption

(i): as →∞ 2(∗)(∗) = (1).

Lemma 8: Under the assumptions in Theorem 2,P

=12 =

P=1(

2) + (1)

Proof: In order to proveP

=12 =

P=1(

2) + (1) it suffices to show that

X=1

X=1

(2 2) = (1) (143)

We have thatX=1

X=1

(22) =

X=1

X=1

{[2 −(2)][2 −(2)]} (144)

≤ 78√X

=0

( + 1)(p)(

4) (145)

where 7, 8 0 are large enough. Also

(4) ≤ −2X

1234∈Λ|[1 2 3 4] (146)

≤ 9−2X

1234∈Λ{(12) + + (34)} (147)

≤ 10−2X

12∈Λ{(12)} (148)

≤ 11−22X

1∈Λ

12√X

=0

() = (−23) (149)

where 9 10 11 12 0 |P∞

=0 ()| ∞ Therefore finally

7

8√X

=0

( + 1)(p)(

4) = (

−23) = (1) (150)

because = and −12 → 0 as →∞

34

Finally, the following Lemma 9 generalizes Pinkse and Slade (1998) results as a way to obtain

consistent estimates of the variance covariance matrix.

Lemma 9: If assumptions in Theorem 2 hold, and sup°°Φ4

+ Φ3

°° ∞ then (b)−(0) =(1) and (b) −(0) = (1); where () ≡ [() ()] and () ≡ −[()]Proof: First, we prove that (b)−(0) = (1) We know that (b) = − 1P=1(b) and

by definition, lim→∞

(0) = (0) So we just need prove that {(b)− lim→∞

(0)} = (1) forall kk = 1 From the proof of Theorem 2, we have already proved that

1

X=1

(b)→ 1

X=1

(0) (151)

as → ∞ provided that b − 0 = (1) which is proved in Theorem 1. Therefore, we can get(b)−(0) = (1)Second, we consider how to show (b)− (0) = (1) As before, it is sufficient to show that

(b)−(0) = (1) as →∞We know that (0) = [(0) (0)] = ((0)) given(0) = 0 Recall from the proof of Theorem 2 that

(0) =1

X=1

{12(1 1)

(0) + 1(1− 2)(1 0)

(0)

+(1− 1)2(0 1)

(0) + (1− 1)(1− 2)(0 0)

(0)} (152)

and we can rewrite it as

(0) =1

X=1

{12[(1 1)

(0)− (1 0)

(0)− (0 1)

(0) +(0 0)

(0)]

+1[(1 0)

(0)− (0 0)

(0)] + 2[

(0 1)

(0)− (0 0)

(0)]

+(0 0)

(0)} (153)

For the sake of brevity, we redefine

1 ≡ [(1 1)

(0)− (1 0)

(0)− (0 1)

(0) +

(0 0)

(0)] (154)

2 ≡ [(1 0)

(0)− (0 0)

(0)] (155)

3 ≡ [(0 1)

(0)− (0 0)

(0)] (156)

4 ≡(0 0)

(0) (157)

35

Therefore,

((0)) = −1(0)

= −2X

=1

X=1

{11(12 12) + 212(12 1)

+213(12 2) + 22(1 1)

+223(1 2) + 33(2 2) (158)

where 1 2 3 are defined similarly as 1 2 3

As before, we just need to provide the proof for one of these terms, and the same logic applies to

other terms. We consider the most complicated term and the rest follow the same argument

−1X

=1

X=1

[11(12 12)]

= −1X

=1

X=1

11[(1212)−(12)(12)] (159)

(1212) = Pr(1 = 1 2 = 1 1 = 1 2 = 1|) (160)= Φ4(1 2 1 2 12 13 14 23 24 34) (161)

where Φ4 is the cdf for the quadvariate standard normal distribution, 1 =1√

(1)etc. Similarly,

(12) = Pr(1 = 1 2 = 1|) = Φ2(1 2 12) (162)(12) = Pr(1 = 1 2 = 1|) = Φ2(1 2 34) (163)

and therefore,

(12)(12) = Φ2(1 2 12)×Φ2(1 2 34) (164)= Φ4(1 2 1 2 12 0 0 0 0 34) (165)

so we can write the first term as

(0) = −1

X=1

X=1

11[(1212)−(12)(12)] (166)

= −1X

=1

X=1

11[Φ4(1 2 1 2 12(0) 13(0) 14(0) 23(0) 24(0) 34(0))

−Φ4(1 2 1 2 12(0) 0 0 0 0 34(0))] (167)Similarly, we can write the first term of (b) as

−1X

=1

X=1

11[Φ4(1 2 1 2 12(b) 13(b) 14(b) 23(b) 24(b) 34(b))−Φ4(1 2 1 2 12(b) 0 0 0 0 34(b))] (168)

36

By the mean value theorem, the first term of (b) −(0) is given as−1

X=1

X=1

11{[Φ4(1 2 1 2 12(b) 13(b) 14(b) 23(b) 24(b) 34(b))−Φ4(1 2 1 2 12(0) 13(0) 14(0) 23(0) 24(0) 34(0))]−[Φ4(1 2 1 2 12(b) 0 0 0 0 34(b))−Φ4(1 2 1 2 12(0) 0 0 0 0 34(0))]} (169)

= −1(b − 0) X=1

X=1

11{Φ4(1 2 1 2 12(

∗) 13(∗) 14(

∗) 23(∗) 24(

∗) 34(∗)

−Φ4(1 2 1 2 12(∗) 0 0 0 0 34(

∗))

} (170)

Since sup°°1°° ∞ by the proof in Theorem 2, we just need to assume

sup

°°°°Φ4(1 2 1 2 12(∗) 13(∗) 14(∗) 23(∗) 24(∗) 34(∗)°°°° ∞ (171)

and the same argument applies to

sup

°°°°Φ4(1 2 1 2 12(∗) 0 0 0 0 34(∗))°°°° ∞ (172)

so that

−1(b − 0) X=1

X=1

11{Φ4(1 2 1 2 12(

∗) 13(∗) 14(

∗) 23(∗) 24(

∗) 34(∗)

)

−Φ4(1 2 1 2 12(∗) 0 0 0 0 34(

∗))

}→ 0 (173)

because (b − 0)→ 0 and the other terms are bounded.Repeat the proofs to the other terms, plus the new assumption about sup

°°Φ3

°° ∞ and thenwe can prove (b) −(0) = (1)

37

8 Appendix 2

TABLE 1∗. CASE 1: SIMULATION RESULTS OF DIFFERENT ESTIMATORS OF IN THECONTEXT OF THE BIVARIATE SPATIAL PROBIT MODEL.

= 02 = 04 = 06 = 08

HPE PMLE HPE PMLE HPE PMLE HPE PMLE = 500 mean 3.938 0.514 6.177 0.519 7.698 0.571 7.735 0.634

bias 3.738 0.314 5.777 0.319 7.098 -0.029 6.935 -0.166(s.d.) (12.158) (0.120) (15.776) (0.205) (16.929) (0.151) (16.202) (0.289)

= 1000 mean 3.174 0.512 4.668 0.518 5.456 0.581 5.914 0.672bias 2.974 0.312 4.268 0.118 4.856 -0.019 5.114 -0.128(s.d) (8.844) (0.107) (9.100) (0.133) (9.631) (0.149) (10.173) (0.276)

= 1500 mean 2.746 0.511 4.050 0.507 4.872 0.609 5.426 0.708bias 2.546 0.311 3.650 0.107 4.272 0.009 4.626 -0.092(s.d.) (6.423) (0.099) (7.414) (0.124) (8.598) (0.149) (8.514) (0.253)

∗Results are presented for our new Partial Maximum Likelihood Estimator (PMLE) and theHeteroskedastic Probit Estimator (HPE) of . Numbers in brackets show standard deviations (s.d.).

38

TABLE 2∗. CASE 1: SIMULATION RESULTS OF DIFFERENT ESTIMATORS OF 1 2 AND3 IN THE CONTEXT OF THE BIVARIATE SPATIAL PROBIT MODEL.

1= 1 2= 1 3= 1

HPE PMLE HPE PMLE HPE PMLE = 02 = 500 mean 5.322 2.618 5.333 2.619 5.329 2.623

(s.d.) (8.844) (0.839) (8.872) (0.855) (8.863) (0.870) = 1000 mean 5.308 2.616 5.296 2.616 5.289 2.618

(s.d) (7.612) (0.560) (7.570) (0.560) (7.568) (0.564) = 1500 mean 5.247 2.604 5.239 2.602 5.235 2.604

(s.d.) (6.624) (0.540) (6.606) (0.536) (6.613) (0.543) = 04 = 500 mean 3.610 1.329 3.614 1.329 3.608 1.328

(s.d.) (5.305) (0.362) (5.311) (0.365) (5.290) (0.366) = 1000 mean 3.600 1.318 3.593 1.316 3.588 1.315

(s.d.) (4.192) (0.355) (4.177) (0.355) (4.178) (0.353) = 1500 mean 3.456 1.281 3.441 1.281 3.438 1.278

(s.d.) (3.818) (0.342) (3.793) (0.343) (3.798) (0.339) = 06 = 500 mean 2.898 0.972 2.876 0.966 2.885 0.969

(s.d.) (3.761) (0.271) (3.723) (0.268) (3.735) (0.271) = 1000 mean 2.669 0.981 2.669 0.979 2.657 0.978

(s.d.) (2.951) (0.261) (2.953) (0.261) (2.916) (0.259) = 1500 mean 2.508 1.016 2.499 1.015 2.501 1.016

(s.d.) (2.726) (0.250) (2.706) (0.250) (2.708) (0.253) = 08 = 500 mean 2.246 0.805 2.237 0.801 2.249 0.802

(s.d.) (2.810) (0.373) (2.803) (0.373) (2.841) (0.392) = 1000 mean 2.098 0.843 2.096 0.843 2.082 0.843

(s.d.) (2.281) (0.349) (2.279) (0.349) (2.246) (0.340) = 1500 mean 2.086 0.884 2.096 0.886 2.094 0.886

(s.d.) (2.059) (0.316) (2.071) (0.314) (2.073) (0.318)∗Results are presented for our new Partial Maximum Likelihood Estimator (PMLE) and the

Heteroskedastic Probit Estimator (HPE) of 1 2 and 3. Numbers in brackets show standard

deviations (s.d.).

39

TABLE 3∗. CASE 2: SIMULATION RESULTS OF DIFFERENT ESTIMATORS OF IN THECONTEXT OF THE BIVARIATE SPATIAL PROBIT MODEL.

= 02 = 04 = 06 = 08

HPE PMLE HPE PMLE HPE PMLE HPE PMLE = 500 mean 2.151 0.381 2.575 0.667 2.491 0.970 2.876 1.202

bias 1.951 0.181 1.175 0.267 1.891 0.370 2.076 0.402(s.d.) (4.630) (0.844) (5.073) (0.923) (4.996) (0.913) (6.213) (0.966)

= 1000 mean 1.013 0.356 1.089 0.606 1.307 0.863 1.660 1.160bias 0.813 0.156 0.689 0.206 0.707 0.263 0.860 0.360(s.d) (2.131) (0.606) (2.241) (0.622) (2.424) (0.671) (2.675) (0.813)

= 1500 mean 0.684 0.324 0.792 0.592 0.906 0.860 1.305 1.156bias 0.484 0.124 0.392 0.192 0.306 0.260 0.505 0.356(s.d.) (1.508) (0.484) (1.566) (0.515) (1.611) (0.601) (1.910) (0.706)

∗Results are presented for our new Partial Maximum Likelihood Estimator (PMLE) and theHeteroskedastic Probit Estimator (HPE) of . Numbers in brackets show standard deviations (s.d.).

40

TABLE 4∗. CASE 2: SIMULATION RESULTS OF DIFFERENT ESTIMATORS OF 1 2 AND3 IN THE CONTEXT OF THE BIVARIATE SPATIAL PROBIT MODEL.

1= 1 2= 1 3= 1

HPE PMLE HPE PMLE HPE PMLE = 02 = 500 mean 1.043 1.021 1.042 1.020 1.042 1.020

(s.d.) (0.120) (0.109) (0.114) (0.104) (0.122) (0.108) = 1000 mean 1.017 1.006 1.022 1.011 1.016 1.005

(s.d) (0.079) (0.071) (0.078) (0.071) (0.079) (0.070) = 1500 mean 1.012 1.004 1.013 1.005 1.010 1.002

(s.d.) (0.065) (0.059) (0.063) (0.058) (0.064) (0.057) = 04 = 500 mean 1.043 1.017 1.042 1.018 1.043 1.019

(s.d.) (0.125) (0.112) (0.112) (0.105) (0.119) (0.110) = 1000 mean 1.017 1.005 1.020 1.008 1.014 1.002

(s.d.) (0.079) (0.072) (0.080) (0.072) (0.077) (0.069) = 1500 mean 1.014 1.005 1.013 1.004 1.009 1.000

(s.d.) (0.065) (0.058) (0.061) (0.056) (0.062) (0.056) = 06 = 500 mean 1.046 1.022 1.045 1.022 1.043 1.020

(s.d.) (0.123) (0.107) (0.126) (0.111) (0.126) (0.108) = 1000 mean 1.019 1.006 1.020 1.007 1.015 1.003

(s.d.) (0.083) (0.075) (0.084) (0.075) (0.079) (0.071) = 1500 mean 1.010 1.002 1.012 1.004 1.010 1.002

(s.d.) (0.066) (0.060) (0.065) (0.060) (0.063) (0.059) = 08 = 500 mean 1.035 1.013 1.036 1.014 1.037 1.015

(s.d.) (0.125) (0.115) (0.125) (0.111) (0.125) (0.115) = 1000 mean 1.016 1.003 1.017 1.005 1.016 1.004

(s.d.) (0.083) (0.086) (0.082) (0.090) (0.084) (0.088) = 1500 mean 1.011 1.000 1.012 1.001 1.012 1.001

(s.d.) (0.065) (0.058) (0.064) (0.057) (0.064) (0.056)∗Results are presented for our new Partial Maximum Likelihood Estimator (PMLE) and the

Heteroskedastic Probit Estimator (HPE) of 1 2 and 3. Numbers in brackets show standard

deviations (s.d.).

41

References

[1] Andrews, D. W. K. (1991): “Heteroskedasticity and Autocorrelation Consistent Covariance

Matrix Estimation”, Econometrica 59, 3, 817-858.

[2] Amemiya, T. (1973): “Regression Analysis when the Dependent Variable is Truncated Normal”,

Econometrica 41, 997-1016.

[3] Anselin, L. and Florax, R. J. G. M. (1995): “New Direction in Spatial Econometrics”, Springer-

Verlag, Berlin, Germany.

[4] Anselin, L. Florax, R. J. G. M., and Rey, J. S. (2004): “Econometrics for Spatial Models: Recent

Advances”, Advances in Spatial econometrics. Springer-Verlag, Berlin, Germany,1-28.

[5] Beron, K. J. and Vijverberg, W. P. (2003): “Probit in a Spatial Context: A Monte Carlo

Approach”, Advances in Spatial econometrics. Springer-Verlag, Berlin, Germany, 169-196.

[6] Bernstein, S. (1927): “Sur l’Extension du Theoreme du Calcul des Probabilities aux Sommes de

Quantities Dependantes”, Mathematische Annalen 97, 1-59.

[7] Case, A. C. (1991): “Spatial Patterns in Household Demand”, Econometrica 59, 953-965.

[8] Conley, T. G. (1999): “GMM Estimation with Cross Sectional Dependence”, Journal of Econo-

metrics 92, 1-45.

[9] Davidson, J. (1994): “Stochastic Limit Theory”, Oxford: Oxford University Press.

[10] Gourieroux, C. (2000): “Econometrics of Qualitative Dependent Variables”, Cambridge Univer-

sity Press.

[11] Kelejian, H. H. and Prucha, I. R. (1999): “A Generalized Moments Estimator for the Autore-

gressive Parameter in a Spatial Model”, International Economic Review 40, 509-533.

[12] Kelejian, H. H. and Prucha, I. R. (2001): “On the Asymptotic Distribution of the Moran I Test

Statistic with Applications”, Journal of Econometrics 104, 219-257.

[13] Lee, L.-F. (2004): “Asymptotic Distribution of Quasi-maximum Likelihood Estimators for Spa-

tial Autoregressive Models”, Econometrica 72, 6, 1899-1925.

[14] Lesage, J. P. (2000): “ Bayesian Estimation of Limit Dependent Variable Spatial Autoregressive

Models”, Geographical Analysis 32, 19-35.

[15] McLeish, D. L. (1974): “Dependent Central Limit Theorems and Invariance Principals”, Annals

of Probability 2, 620-628.

42

[16] McMillen, D. P. (1992): “Probit with S

partial maximum likelihood estimation of spatial probit ......march 18, 2010 abstract this paper...

Documents