partial maximum likelihood estimation of spatial probit ......march 18, 2010 abstract this paper...
TRANSCRIPT
-
Partial Maximum Likelihood Estimation of Spatial Probit
Models∗
Honglin Wang†
Michigan State University
Emma M. Iglesias‡
Michigan State University
and University of Essex
Jeffrey M. Wooldridge§
Michigan State University
March 18, 2010
Abstract
This paper analyzes spatial Probit models for cross sectional dependent data in a binary
choice context. Observations are divided by pairwise groups and bivariate normal distributions
are specified within each group. Partial maximum likelihood estimators are introduced and
they are shown to be consistent and asymptotically normal under some regularity conditions.
Consistent covariance matrix estimators are also provided. Estimates of average partial effects
can also be obtained once we characterize the conditional distribution of the latent error. Fi-
nally, a simulation study shows the advantages of our new estimation procedure in this setting.
Our proposed partial maximum likelihood estimators are shown to be more efficient than the
generalized method of moments counterparts.
Keywords: Spatial statistics, Maximum Likelihood, Probit Model.JEL classification: C12, C13, C21, C24, C25.
∗We are grateful to participants at the 2009 Econometric Society European Meeting and at seminars at CityUniversity, London, London School of Economics, UCL, University of Exeter and University Carlos III for very useful
comments. Any remaining errors are our own.†Department of Economics, Michigan State University, 101 Marshall-Adams Hall, East Lansing, MI 48824-1038,
USA. e-mail: [email protected].‡Department of Economics, Michigan State University, 101 Marshall-Adams Hall, East Lansing, MI 48824-1038,
USA. e-mail: [email protected]. Financial support from the MSU Intramural Research Grants Program is gratefullyacknowledged.
§Department of Economics, Michigan State University, 101 Marshall-Adams Hall, East Lansing, MI 48824-1038,USA. e-mail: [email protected].
1
-
1 Introduction
Most econometrics techniques on cross-section data are based on the assumption of independence of
observations. However, economic activities become more and more correlated over space with modern
communication and transportation improvements. On the other hand, technological advances in
communications and the geographic information system (GIS) make spatial data more available
than before. Spatial correlations among observations received more and more attentions in regional,
real estate, agricultural, environmental and industrial organizations economics (Lee, 2004).
Econometricians began to pay more attention on spatial dependence problems in the last two
decades and some important advances have been done in both theoretical and empirical studies1.
Spatial dependence not only means lack of independence between observations, but also a spatial
structure underlying these spatial correlations (Anselin and Florax, 1995). There are two ways to
capture spatial dependence by imposing structures on a model: one is in the domain of geostatistics
where the spatial index is continuous (Conley, 1999), the other is that spatial sites form a countable
lattice (Lee, 2004). Among the lattice models, there are also two types of spatial dependence models
according to spatial correlation between variables or error terms: the spatial autoregressive dependent
variable model (SAR) and the spatial autoregressive error model (SAE). In most applications of
spatial models, the dependent variables are continuous (Conley, 1999; Lee, 2004; Kelejian and Prucha,
1999, 2001; among others), and only few applications address the spatial dependence with discrete
choice dependent variables (exceptions include: Case, 1991; McMillen, 1995; Pinkse and Slade, 1998;
Lesage, 2000; Beron and Vijerberg, 2003). This paper is designed to address this gap and we are
concerned about the SAE model with discrete choice dependent variables.
As the name indicates, there are two aspects in the discrete choice model with spatial dependence.
First, the dependent variable is discrete and the leading cases occur where the choice is binary. Probit
and Logit are the two most popular non-linear models for binary choice problems. For the sake of
brevity, in this study we focus on Probit model, but the approach developed here generalizes to other
discrete choice models.
In discrete choice models, if the observations are independent, we use maximum likelihood estima-
tion to get efficient estimators given the correct conditional distribution of dependent variables. The
nice part of the maximum likelihood estimator (MLE) is that we can still get consistency, asymptotic
normality but inefficient estimators in many situations (panel data or clustering) by pseudo MLE
even when we ignore certain dependence among observations (Poirier and Rudd, 1988). However, the
non-linear property causes computation difficulties in estimation, and this computational difficulty
becomes much worse when dependence occurs, which results in solving -dimensional integration.
Dependence is the other aspect of this problem. General forms of dependence are rarely allowed for
in cross-sectional data, although routinely allowed for in time-series data (Conley, 1999). For example,
some scholars discussed discrete choice models with dependence in time-series data: Robinson (1982)
1Anselin, Florax and Rey (2004) wrote a comprehensive review about econometrics for spatial models.
2
-
relaxed Amemiya (1973) assumptions of independence in Tobit model, and proved that the MLE
with dependent observations is strongly consistent and asymptotically normal under some regularity
conditions. Poirier and Rudd (1988) discussed the Probit model with dependence in time-series
data, and developed generalized conditional moment (GCM) estimators which are computational
attractive and relatively more efficient.
However, dependence in space is more complicated than in the time setting because of four
reasons: first, time is one dimensional whereas space has at least two dimensions; second, time has
natural order (direction) whereas space has no natural direction; third, time is regularly divided
because of regular astronomical phenomena whereas spatial observations are attached to geographic
properties of the surface of the earth; fourth, time-series observations are draws from a continuous
process whereas, with spatial data, it is common for the sample and the population to be the same
(Pinkse et al., 2007).
Therefore, how to deal with dependence in space in estimation is the key to spatial econome-
tricians. Inspired by works about dependence in time-series data, Conley (1999) uses metrics of
economic distance to characterize dependence among agents, and shows that the GMM estimator is
consistent and asymptotically normal under some assumptions similar to time-series data. He also
provides how to get consistent covariance matrix estimator by an approach similar to Newey-West
(1987). Pinkse and Slade (1998) use GMM in the discrete choice setting with the SAE model, and
show that the GMM estimator remains consistent and asymptotically normal under some regularity
conditions. Although Pinkse and Slade (1998) generated generalized residuals from the MLE as the
basis of the GMM estimators, they do not take advantage of information from spatial correlations
among observations, and hence the GMM estimator is much less efficient than full ML estimators.
Lee (2004) examines carefully the asymptotic properties of MLE and quasi-MLE for the linear spatial
autoregressive model (SAR), and he shows that the rate of convergence of those estimators may de-
pend on some general features of the spatial weights matrix of the model. If each units are influenced
by only a few neighboring units, the estimators may have√-rate of convergence and asymptotic
normality; otherwise, it may have lower rate of convergence and estimators could be inconsistent.
In this paper, we choose to capture spatial dependence by considering spatial sites to form a
countable lattice, and we provide a middle-ground approach which trades off efficiency and compu-
tation burdensome. The idea is to divide spatial dependent observations into many small groups
(clusters) in which adjacent observations belong to one group. The implicit rationale behind this
is adjacent observations usually account for the most important spatial correlations between obser-
vations. If we can correctly specify the conditional joint distribution within groups, which allows
us to utilize relatively more information of spatial correlations, estimating the model by partial ML
(PML) will give us consistent and more efficient estimators, which should be generally better than
GMM estimators. However, this approach is subject to biased variance-covariance matrix estimators
because of spatial correlations among groups. To deal with this problem, we follow the methods
proposed by Newey-West (1987) and Conley (1999) to get consistent variance-covariance matrix es-
3
-
timators. Of course, this middle ground approach will not get the most efficient estimator. However,
since information from adjacent observations usually capture important spatial correlations in the
whole sample, we get a consistent and a relatively efficient estimators, and we avoid some tedious
computations at expense of a loss of a relatively small part of efficiency.
This paper is organized as follows. Section 2 reviews some widely applied econometric techniques
on discrete choice models. Also, the SAE model with discrete choice dependent variable is presented
and the regularity conditions are specified. We also show how we can estimate average partial effects
once we characterize the conditional distribution of the latent error. Section 3 presents the bivariate
spatial Probit model. In Section 4, we prove consistency and asymptotic normality of the PML
estimator (PMLE) under regularity assumptions, and discuss how to get consistent covariance matrix
estimators. Section 5 presents a simulation study showing the advantages of our new estimation
procedure in this setting. Finally, Section 6 concludes. The proofs are collected in Appendix 1, while
the results for the simulation study are provided in Appendix 2.
2 Discrete Choice Models with Spatial Dependence
2.1 A Probit Model without Dependence
We first review the standard Probit model without dependence and the underlying linear latent
variable model is
∗ = + (1)
where ∗ is the latent dependent variable and a scalar, is a 1× vector of regressors, is a×1parameter vector to be estimated, and is a continuous random variable, independent of and
it follows a standard normal distribution. However, we cannot observe ∗ , and we can only observethe indicator , which is related to ∗ as follows
=
(1 if ∗ 00 if ∗ ≤ 0
) (2)
Therefore, we can get the conditional distribution of given as
( = 1|) = ( ∗ 0|) = ( −|) = Φ() (3)
where Φ denotes the standard normal cumulative distribution function (cdf). It is easy to see we can
get
( = 0|) = 1−Φ() (4)Since is a Bernoulli random variable, we can write the conditional density function of
conditional on as
(|) = [Φ()][1−Φ()]1− ∈ {0 1} (5)
4
-
Also, given the independence assumption of random variables, the log likelihood function can be
written as
() =X=1
{ log[Φ()] + (1− ) log[1−Φ()]} (6)
and the sufficient condition for uniqueness of the global maximum of () is that the function is
strictly concave (Gourieroux, 2000). We can solve then b from the first order condition()
=
X=1
−Φ()Φ()[1−Φ()]()
0 = 0 (7)
where is the probability density function (pdf) of the standard normal distribution. However,
the simple closed-form expressions for the MLE are not available because the cdf of the normal
distribution has no close-form solution. So the MLE must be solved by using numerical algorithms2.
In general, we can prove that the conditional MLE is consistent and the most efficient estimator
given some regularity conditions3 such as correctly specifying a parametric model, an identified
and a log-likelihood function that is continuous in .
2.2 A Probit Model with Spatial Dependence
When spatial dependence is present in a binary response model, the dependence not only brings us
the difficulties in estimating parameters, but also raises a question: what is the estimation of interest?
because the magnitudes of the partial effects depend not only on the value of the covariates, but also
on the value of the unobserved heterogeneity caused by the spatial dependence.
For example, a binary response model with spatial correlation in the latent errors is given by
= 1[ + ≥ 0] (8)
where {} is allowed to be spatially correlated. Distance between points is almost always assumedto be exogenous, and so we can think of having marginal distributions
(| ) (9)
where is the matrix of covariates for all observations and is a matrix containing pairwise
distances. The model that just allows spatial correlation in the errors essentially assumes
(| ) = (| ) (10)
and, in fact, the dependence on appears only in the variance. In particular, for probit,
| ∼ (0 ()) (11)2Commonly used numerical solutions are all derived from Newton’s method. (see Gourieroux, 2000 for details).3See details in Wooldridge (2002, page 391).
5
-
where (· ·) is known (in principal) from the spatial autocorrelation assumption. Literature hasfocused on estimating the parameters, and (the latter is the spatial dependence coefficient, a
scalar in practice), but we need to understand that estimating and is not our final goal. We should
pay more attention to estimating average partial effects based on the average structural function,
which is directly correlated to parameters estimating (especially spatial dependence coefficient ).
() = nΦh
p()
io(12)
which is (hopefully) consistently estimated by
−1X=1
Φ
∙̂
q( ̂)
¸ (13)
If we knew the unconditional distribution of , we would not have to use the variance function at
all; but we have the conditional distribution. We proceed now to provide a new method to estimate
the parameters of a spatial probit model.
2.3 A Probit Model with Spatial Error Correlation
Consider the Probit model with spatial error correlation (SAE), where the underlying linear latent
variable model is
∗ = + (14)
= X
=1
+ (15)
where is an element in the spatial weights matrix which can be defined by different spatial
distances such as the Euclidean distance. is the spatial autoregressive error coefficient and we have
a random variable ∼ (0 1). We can write equations (14) and (15) in matrix form as follows
∗ = + (16)
= ( − )−1 (17)
so that the variance-covariance matrix for the model is
Ω ≡ (| ) = [( − )0( − )]−1 (18)
If ∗ is observable, equation (16) becomes a linear function, and we can use the Jacobian trans-formation of into ∗ and write the log likelihood function as
( ) = −2ln(2)− 1
2( ∗ −)00( ∗ −) + ln || (19)
where = − , and then the estimate of can be solved as b = ( 00)−1 00 ∗6
-
However, in practice we cannot observe ∗, and we can only observe , and it implies thatwe have to deal with a non-linear Probit model because of the normal distributional assumption.
Moreover the errors are correlated, and the full likelihood function becomes
= (1 = 1 2 = 2 · · · = ) =1Z
−∞
· · ·Z
−∞
() (20)
() = (2)−2 |Ω|−1− 12 (0Ω−1) (21)
Although theoretically, if we take the first derivatives subject to and the spatial coefficient ,
we obtain
=
{1Z
−∞
· · ·Z
−∞
(2)−2 |( − )0( − )|−12 [0(− )0(− )]}
= 0 (22)
=
{1Z
−∞
· · ·Z
−∞
(2)−2 |( − )0( − )|−12 [0(− )0(− )]}
= 0 (23)
The expression of the first derivatives are quite complicated, but if we have sufficient computational
ability and and are identifiable, we can get consistent and efficient estimates of and by using
numerical methods. However, in practice, it would be a formidable computational task even for a
moderate size sample. We propose in this paper a more attractive procedure in the next sections
that involves to deal with partial maximum likelihood.
2.4 Probit Models with Other Forms of Spatial Correlation
Generally, there is no reason to think that spatial correlation is properly modeled by (15). Other
forms are possible. For example, one might assume that, outside of a certain geographic radius from
a given observation , is uncorrelated with shocks to the outlying regions. So, for example, we
might assume a constant correlation with any unit within a given radius — similar to a random effects
structure for unbalanced panel data.
Alternatively, we may prefer more of a moving average structure, such as
= +
ÃX6=
! (24)
where the are i.i.d. with unit variance. This formulation is attractive because it is relatively easy
to find variances and pairwise correlations, which we will use in the PMLEs described in the next
section. For example,
(| ) = 1 + 2ÃX
6= 2
! (25)
7
-
Clearly, methods that use only the variance in estimation can only identify 2 (but we almost
always think 0, anyway). Pairwise covariances can also be obtained from
( | ) = + + 2Ã X
6=6=
! (26)
Expressions like this for the covariance between different errors are important for applying grouped
PMLE methods.
3 Using Partial MLEs to Estimate General Spatial Probit
Models
Estimating a probit spatial autocorrelation model by full MLE is a prodigious task, although several
approaches have been applied. The EM algorithm can be used (McMillen, 1992), the RIS simulator
(Beron and Vijverberg, 2003), and the Bayesian Gibbs sampler (Lesage, 2000). But each of these
approaches is still computationally burdensome. To combine such approaches with simulation studies,
or to be able to quickly estimate a range of models, is outside the abilities of even current computation
capabilities for even moderate sample sizes.
To get an estimator that is computationally feasible, Pinkse and Slade (1998) proposed using gen-
eralized method of moments (GMM) using information on the marginal distributions of the binary
responses. In particular, the generalized residuals from the marginal probit log likelihood are used to
construct moment conditions for the GMM method. Pinske and Slade show that, under conditions
very similar to those in this paper, the GMM estimator is consistent and asymptotically normal.
The consistent variance-covariance matrix can also be obtained theoretically without a covariance
stationary assumption, although Pinske and Slade do not discuss estimation of the asymptotic vari-
ance. Therefore, the GMM estimator is almost practically useful, but it is fundamentally based on
the marginal probit models. Thus, while a GMM estimator can be obtained that is efficient given
the information on the marginal likelihood, the method throws out much useful information. This
is very similar to just using a pooled probit for time series or panel and ignoring dynamics. We
describe a simplified version of this approach in Section 3.1, which, in effect, uses a heteroskedastic
probit model to estimate the along with any spatial autocorrelation parameter.
Using only the marginal distribution of , conditional on the covariates and weights, likely results
in serious loss of information for estimating both and the spatial autocorrelation parameters.
Our key contribution in this paper is to explore the use of partial maximum likelihood where we
group small numbers of nearby observations and obtain the joint distribution of those observations.
Naturally, these distributions are determined by the fully specified spatial autocorrelation model —
just as we must obtain the implied variance to apply marginal probit methods. Once the covariances
between observations are found as a function of the weights and , we can use that information in
8
-
multivariate probit estimation. Section 3.2 covers the case of where we describe a bivariate probit
approach, with heteroskedasticity and covariance implied by the particular spatial autocorrelation
model. Using a single covariance in addition to the variance seems likely to improve efficiency of
estimation.
3.1 Univariate Probit Partial MLE
One way to estimate the coefficients along with spatial correlation parameters is to derive the
marginal distributions, ( = 1| ) as a function of all of the weights (and the parameters, and , of course). Under the joint normality assumption, the model will be a form of probit with
heteroskedasticity. In particular, given any spatial probit model such that the variances are well
defined, we can find
( = 1| ) = Φ(()) (27)where 2 () = (| ) = (| ) is a function of all weights, , and the spatial correlationparameters, . As is well-known in time series contexts — see, for example, Poirier and Ruud (1988)
or Robinson (1982) — using probit while ignoring the time series correlation leads to consistent
estimation under standard regularity conditions, provided the data are weakly dependent. Thus, it is
not surprising that pooled probit that accounts for the heteroskedasticity in the marginal distribution
is generally consistent for spatially correlated data, too — provided, of course, we limit the amount
of spatial correlation.
The log likelihood can be written generically as
() =X=1
{ log[Φ(())] + (1− ) log[1−Φ(())]} (28)
Assuming that and are identified, and that the conditions in Section 4 hold, the pooled het-
eroskedastic probit is generally consistent and√-asymptotically normal. But, for reasons we dis-
cussed above, it is likely to be very inefficient relative to the full MLE. Further, estimators that use
some information on the spatial correlation across observations seem more promising in terms of
increasing precision.
3.2 Bivariate Probit Partial MLE
We now turn to using information on pairs of “nearby” observations to identify and . There is
nothing special about using pairs; we could use, say, triplets, or even larger groups. But the bivariate
case is easy to illustrate and is computationally quite feasible.
For illustration, assume a sample includes 2 observations, and we divide the 2 observations
into pairwise groups according to the spatial Euclidean distance between them (see Graph 1). In
other words, each group includes two observations, with the idea being that the internal correlation
9
-
between the two observations is more important than external correlations with observations in other
groups. Within a group, the two observations follow a conditional bivariate normal distribution be-
cause error terms are assumed to have a joint normal distribution.
Graph 1: (2 observations =⇒ groups)
Let© ∗ : = 1
ªwhere, for notational simplicity, we assume 2 Observations. Then ∗ =¡
∗1 ∗2
¢follows something similar to a random effects probit model, but there is heteroskedasticity
in the variances and covariances. In group , we have
∗1 = 111 + 212 + + 1 + 1 (29)
∗2 = 121 + 222 + + 2 + 2 = 1 2 (30)
We can rewrite the above equations in latent variable matrix form as
∗1 = 1 + 1 (31)
∗2 = 2 + 2 = 1 2 (32)
where 1 and 2 are 1× vectors of regressors and is a ×1 vector. 1 and 2 are scalars. Ingroup , observation and observation are not only correlated with each other, but also correlated
with other observations over space. Therefore, the variances and covariance between 1 and 2 not
only depend on the weight within group, but also weights with other observations out of the group,
and the parameters, as well. See, for example equation (26).
10
-
It is easy to see that (1|1 ) = (2|2 ) = 0, and the covariance-variance matrix forgroup under assumption (??), is defined as Ω ≡ (| )
(| ) ≡ Ω() ="Ω11 Ω12
Ω21 Ω22
# (33)
where we suppress the dependence on and in what follows for notational simplicity. The variance
terms are the same as in the Pinske and Slade (1998) approach. In our case, the covariance must
also be computed, although this will be the source to improve the precision of the estimates of and
Note here that elements in Ω depend not only on the weight between two observations in group
, but also weights for every observation in the whole sample, because two observations in group
not only correlated with each other, but also correlated with other observations over space. Since we
define two nearby observations as one group, we pick up the corresponding part (Ω) from the whole
covariance-variance matrix (see equation (26)).
Since we cannot observe ∗1 and ∗2, as we discussed in the univariate Probit model, we define
=
(1 if ∗ 00 if ∗ ≤ 0
) (34)
Therefore the conditional bivariate normal distribution of 1 and 2 given is given as
(1 = 1 2 = 1|) = (1 + 1 0 2 + 2 0|) (35)= (1 1 2 2|) = Φ2( 1p
Ω112pΩ22
) (36)
=(1 2)p
(1)p (2)
=Ω12pΩ11Ω22
(37)
where Φ2 is the standard bivariate normal distribution, 2 is the standard density function of the
bivariate normal distribution and is the standardized covariance between two error terms. Esti-
mation in this context is similar to “random effects” probit with 2 observations, although one needs
to properly compute the variances and covariances of the latent errors. We can also use random
matched pairs of even groups of more than two observations. The idea behind is that the more
correlations used, the better.
Given that (1 2) has a joint normal distribution, we can write
1 = 12 + 1 (38)
where
1 =(1 2)
(2) (39)
and 1 is independent of and 2
11
-
Because of the joint normality of (1 2) 1 is also normally distributed with (1) = 0 and
(1) = (1)− 21 (2) (40)
Thus, we can write the conditional distribution of 1 as
(1| 2) ∼ (0 (1)) (41)
Substitute equation (38) back to ∗1 = 1 + 1, and we can get
∗1 = 1 + 12 + 1 (42)
Therefore
(1 = 1| 2) = ΦÃ1 + 12p
(1)
! (43)
The reason we want to find (43) is to retrieve (1 = 1 2 = 1|). Since
(1 = 1 2 = 1|) = (1 = 1|2 = 1)× (2 = 1|) (44)
it is easy to see that (2 = 1|) = Φ( 2√ (2)
), and thus it remains to get (1 = 1|2 = 1)First, since 2 = 1 if and only if 2 −2 and 2 follows a normal distribution and it is
independent of , then the density of 2 given 2 −2 is( 2√
(2))
(2 −2) =( 2√
(2))
Φ( 2√ (2)
) (45)
Therefore,
(1 = 1|2 = 1) = [ (1 = 1| 2)|2 = 1) (46)= [Φ(
1 + 12p (1)
)|2 = 1 ] (47)
=1
Φ( 2√ (2)
)
Z ∞−2
Φ(1 + 12p
(1))(
2p (2)
)2 (48)
and it is easy to see that (1 = 0|2 = 1) = 1 − (1 = 1|2 = 1) because 1 is thebinary variable.
Similarly, we can get
(1 = 1|2 = 0) = 11−Φ( 2√
(2))
Z 2−∞
Φ(1 + 12p
(1))(
2p (2)
)2 (49)
and (1 = 0|2 = 0) = 1− (1 = 1|2 = 0)
12
-
Now we are ready to get (1 = 1 2 = 1|) as follows
(1 = 1 2 = 1|) = 1Φ( 2√
(2))
Z ∞−2
Φ(1 + 12p
(1))(
2p (2)
)2
×Φ( 2p (2)
) (50)
=
Z ∞−2
Φ(1 + 12p
(1))(
2p (2)
)2 (51)
and similarly we can obtain finally
(1 = 0 2 = 1|) = Φ( 2p (2)
)−Z ∞−2
Φ(1 + 12p
(1))(
2p (2)
)2 (52)
(1 = 1 2 = 0|) =Z 2−∞
Φ(1 + 12p
(1))(
2p (2)
)2 (53)
(1 = 0 2 = 0|)= [1− Φ( 2p
(2))]−
Z 2−∞
Φ(1 + 12p
(1))(
2p (2)
)2 (54)
4 Partial Maximum Likelihood Estimation
As we discussed in the introduction, if the observations are independent, we can simplify the mul-
tivariate distribution into the product of univariate distributions, and then the ML estimator can
be obtained easily. However, spatial correlations among observations do not allow the simplification
any more. Under spatial correlation, the situation is kind of similar to the panel data case. In panel
data, we cannot assume independence among observations over different periods for the same person
(or firm), which means we are not likely to specify the full conditional density of given correctly.
Therefore, we need to relax the assumption in the panel data case. The way we deal with the problem
is that if we have a correctly specified model for the density of given we can define the partial
log likelihood function as
∈Θ
X=1
X=1
log (| ) (55)
where (| ) is the density for given for each . The partial log likelihood function worksbecause 0 (the true value) maximizes the expected value of the above equation provided we have
the densities (| ) correctly specified (Wooldridge, 2002).We can apply a similar idea to the spatial Probit model: if we have the bivariate normal densities
2(12| ) correctly specified for each group, we could get a consistent estimator by partialML. However, there are several differences between panel data and spatial dependent data: first, the
panel data model assumes that the cross section dimension () is sufficiently large relative to the
13
-
time dimension ( ), but in spatial data we do not have this assumption. Second, in the panel data
model, we view the cross section observations as independent, while in the spatial data model, even
though we divided the sample into groups, however, we are definitely not assuming independence
among groups. Observations in different groups are still correlated, but the correlations are assumed
to decay as distances become further away. Third, as we discussed before, dependence in space is
more complicated than dependence in time, and we need to assume that the correlations between
groups die out quickly enough as distance goes further away. In short, we need to examine carefully
how the weak law of large numbers (WLLN) and central limit theorem (CLT) can be applied in the
spatial dependent case. We will discuss these issues and provide proofs in the following sections.
Rather than “filling in”, we use a type of asymptotics where area size is increasing. Essentially, we
need the process to be mixing in a spatial sense.
First, we can write the partial log likelihood function as
=X
=1
{12 log(1 = 1 2 = 1|) + 1(1− 2) log(1 = 1 2 = 0|)
+(1− 1)2 log(1 = 0 2 = 1|) + (1− 1)(1− 2) log(1 = 0 2 = 0|)} = 1 2(56)
and for the sake of brevity, we define
(1 1) ≡ log(1 = 1 2 = 1|); (1 0) ≡ log(1 = 1 2 = 0|); (57)(0 1) ≡ log(1 = 0 2 = 1|) and (0 0) ≡ log(1 = 0 2 = 0|) (58)
Therefore, we can rewrite the partial log likelihood function as
=X
=1
{12(1 1)+1(1−2)(1 0)+(1−1)2(0 1)+(1−1)(1−2)(0 0)} (59)
We proceed now to prove the consistency and asymptotic normality of the PMLE.
4.1 Consistency of Bivariate Probit Estimation
Consistent estimators b ≡ (b b)0 are the ones that converge in probability to the true value 0 ≡(0 0)
0, i.e. b −→ 0, as the sample size goes to infinity for all possible true values. In this section,to make the asymptotic arguments formal, we distinguish between the true value, 0, and a generic
parameter value .
In the bivariate probit estimation, the estimator b is defined as: b maximizes () subject to ∈ Θ, where Θ is the parameters set. The objective function () is defined as
() ≡ 1
X=1
{12(1 1) + 1(1− 2)(1 0)
+(1− 1)2(0 1) + (1− 1)(1− 2)(0 0)} (60)
14
-
i.e, in other words, b = argmax∈Θ
() (61)
Remember that this objective function represents a partial log likelihood, not a fully log likelihood:
we are only using information on the conditional distribution (1 2| ) across the groups .We are not using (1 2 | ) as in a full maximum likelihood setting.The identification condition requires that () is uniquely maximized at the true value 0 where
() is defined as
() ≡ lim→∞
[ ()] (62)
This condition typically holds for well-specified models when there is not perfect collinearity among
the regressors. Following Pinske and Slade (1998), we impose the identification condition in all our
analysis. Further, one needs to be a little careful in parameterizing the spatial autocorrelation, but
standard models of spatial autocorrelation cause no problems.
The following Theorem 1 states the main consistency result for a broad class of spatial probit
models. We define () ≡ () and lim
→∞[ ()] = ()
Theorem 1. If (i) 0 is the interior of a compact set Θ which is the closure of a concave set,
(ii) attains a unique maximum over the compact set Θ at 0(iii) is continuous on Θ , (iv)
the density of observations in any region whose area exceeds a fixed minimum is bounded, (v) as
→∞ sup(°°° 1Pr(1=12=1|) + 1Pr(1=12=0|) + 1Pr(1=02=1|) + 1Pr(1=02=0|)°°°) ∞ (vi)
as →∞ sup(kk+kk) = (1) (vii) sup |( )| ≤ () = 1 2 where denotesthe distance between group and , and () → 0 as → ∞ and (viii) lim
→∞[ ()] exists, (ix)
sup kk ∞, then b − 0 = (1).Proof: Given in Appendix 1.
Condition (i) is a standard assumption from set theory. Condition (ii) is the identification condi-
tion for MLE. Condition (iii) assumes that the function is continuous in the metric space, which
is a reasonable assumption and necessary for the proof that () is stochastically equicontinuous.
Condition (iv) simply excludes that an infinite number of observations crowd in one bounded area.
The minimum area restriction is imposed because an infinitesimal area around a single observation
has infinite density. Condition (v) makes sure any one of these four situations will be present in a
sufficiently large sample. Condition (vi) makes sure the regressors are deterministic and uniformly
bounded, which is not a strong assumption in this literature. Condition (vii) is the key assumption
for this theorem, and it requires that the dependence among groups decays sufficiently quick when
the distance between groups become further apart. This assumption employs the concept from -
mixing to define the rate of dependence decreasing as distance increases. Condition (viii) assumes
the limit of [ ()] exists as →∞ which is not a strong assumption. Condition (ix) is actuallyimplied by the rule of dividing groups, which just excludes that the two groups are exactly in the
same location.
15
-
4.2 Asymptotic Normality
As we discussed in the introduction, the spatial dependence is more complicated than time-series
dependence at least in four perspectives. These differences cause that central limit theorems (CLT)
need stronger conditions for the spatial dependence case. To deal with general dependence problems,
the common way in the literature is to use the so called "Bernstein Sums", which break up into
blocks (partial sums), and we consider the sequence of blocks. Each block must be so large, relative
to the rate at which the memory of the sequence decays, that the degree to which the next block
can be predicted from current information is negligible. But at the same time, the number of blocks
must increase with so that the CLT argument can be applied to this derived sequence (Davidson,
1994).
In this section, we show under what assumptions we are able to apply McLeish’s central limit the-
orem (1974) to spatial dependence cases to get asymptotic normality for the spatial Probit estimator.
This is presented in the following Theorem. denotes the transpose of matrix
Theorem 2: If the assumptions of Theorem 1 hold, and in addition: (i) as →∞ 2(∗)(∗) = (1)
for all fixed ∗ 0 (ii) the sampling area grows uniformly at a rate of√ in two non-opposing
directions, (iii) (0) ≡ lim→∞[(0) (0)] and (0) ≡ lim→∞−[(0)] are uniformlypositive definite matrices; then
√(b − 0) → [0 (0)−1(0)(0)−1] where (0) ≡ (0)
and (0) =2
(0)
Proof: Given in Appendix 1.
Condition (i) is stronger than condition (vii) in Theorem 1, and it is also stronger than the usual
condition in time series data because spatial dependent data has more dimension correlations than
time series data. It shows that how dependence decays when distance between groups gets further
away, and the dependence decays at the rate fast enough. Condition (ii) just repeats the assumption
in the Bernstein’s blocking method, the two non-opposing directions just exclude sampling area grows
at two parallel directions, which does not make much sense in spatial dependent case. Conditions in
(iii) are natural conditions about matrices, which are implied by the previous assumptions. Matrices
are semidefinite if some extreme situations happen such as Pr(1 = 1 2 = 1|) = 0 which areassumed to be excluded in the previous assumptions.
4.3 Estimation of Variance-covariance Matrices
Consistent estimation of the asymptotic covariance matrix is important for the construction of as-
ymptotic confidence intervals and hypothesis tests (Newey and West, 1987). The estimations of
(i.e. ̂ = (b)) are relatively easy, usually just obtaining sample analogues of 0 with b; but theestimation of (i.e. b = (b)) is more difficult and more important because of the correlationsamong groups. Newey-West (1987) proposed a method to estimate the variance-covariance matrix in
16
-
settings of dependence of infinite order under a covariance stationary condition, and they suggested
modified Bartlett weights to make sure the estimated variance and test statistics were positive.
Andrews (1991) established the consistency of kernel HAC (Heteroskedasticity and Autocorrelation
Consistent) estimators under more general conditions. Pinkse and Slade (1998) also showed that we
can obtain (b) −(0) = (1) under regularity assumptions, where () ≡ [() ()] (seeLemma 9 in Appendix 1). This approach is feasible in practice only if we can get closed form expres-
sions for [() ()], which should be a function of , and then plug in b for 0 in the function toget consistent covariance estimators. However, it is difficult to get closed form expressions for (b)in practice, and hence we follow an alternative approach proposed by Conley (1999).
A feasible way to obtain a consistent estimate of a variance-covariance matrix that allows for a
wider range of dependence is to apply the approach of Conley (1999) along the lines of Newey-West
(1987). We follow this procedure in the following Theorem 3.
Let ΞΛ be the −algebra generated by a given random field ∈ Λ with Λ compact, andlet |Λ| be the number of ∈ Λ. Let Υ (Λ1Λ2) denote the minimum Euclidean distance from anelement of Λ1 to an element of Λ2 There exists also a regular lattice index random field ∗ thatis equal to one if location ∈ 2 is sampled and zero otherwise. ∗ is assumed to be independentof the underlying random field and to have a finite expectation and to be stationary. The mixing
coefficient is defined as
() ≡ sup {| ( ∩)− () ()|} ∈ ΞΛ1 ∈ ΞΛ2 and|Λ1| ≤ |Λ2| ≤ Υ (Λ1Λ2) ≥
We also define a new process () such as
() =
( () if ∗ = 10 if ∗ = 0
)
Then
Theorem 3. If (i) Λ grows uniformly in two non-opposing directions as −→ ∞ (ii)(0) ≡ lim→∞[(0) (0)] and (0) ≡ lim→∞−[(0)] are uniformly positive definitematrices, (iii) as defined in Theorem 1 = 1 2 and ∗ are mixing where () con-verges to zero as → ∞; () is Borel measurable for all ∈ Θ and continuous on Θ andfirst moment continuous on Θ (iv)
P∞=1 () ∞ for + ≤ 4 (v) 1∞ () = (−2)
(vi) for some 0 (k (0)k)2+ ∞ andP∞
=111 ()(2+) ∞ (vii) () is Borel
measurable for all ∈ Θ continuous on Θ and second moment continuous, (0) exists and isfull rank, (viii)
P∈2 (0 (0) (0)) is a non-singular matrix, (ix) the ( ) are uni-
formly bounded and ( ) −→ 1, −→ ∞ as −→ ∞ ( −→ ∞) = ¡13
¢and
= ¡ 13
¢, (x) for some 0 (k (0)k)4+ ∞ and as defined in Theorem 1
17
-
= 1 2 and ∗ are mixing where ∞∞ ()(2+) = (−4) (xi) supΘ k ()k2 ∞ and
supΘ k() [ ()]k2 ∞ thenb −(0) = (1) as −→∞where we split = [ ], Λ is a rectangle so that ∈ {1 2 } and ∈ {1 2 } and
b = −1 X=0
X=0
X=+1
X=+1
( )
⎛⎝ ³b´−− ³b´ +−−
³b´ ³b´⎞⎠
−−1X
=1
X=1
³b´ ³b´
To ensure positive semi-definite covariance matrix estimates, we need to choose an appropriate
two-dimensional weights function that is a Bartlett window in each dimension
( ) =
½(1− ||
)(1− ||
) || ||
0 else
¾
Proof: It follows from Conley (1999), Proposition 3.
5 Simulation Study
In the previous section, we have proved that the partial maximum likelihood estimator (PMLE)
based on the bivariate normal distribution is consistent and asymptotically normal. Moreover, one
of the most attractive properties of our new PMLE is that we can get a more efficient estimator
compared to the GMM estimator, and the approach is much less computational demanding when
compared to full information methods. In order to learn about the gains in efficiency that we obtain
in the context of a Bivariate Spatial Probit model when using PML versus GMM, we conduct in this
Section a simulation study to show the efficiency gains of PML.
5.1 Simulation Design and Results
Instead of comparing our PMLE to the GMM estimator of Pinkse and Slade (1998) directly, we
choose to compare the PMLE to the heteroskedastic Probit estimator (HPE) because of two reasons:
First, the HPE uses similar information with the GMM estimator because both methods use gener-
alized residuals from the Probit estimation to construct the moment conditions, which means that
both methods use the information from the heterogeneities of the diagonal terms of the variance-
covariance matrix, while our PMLE uses both diagonal and off-diagonal correlations information
between two closest neighbors. Second, the STATA4 source codes for bivariate probit estimation and
4See http://www.stata.com/
18
-
heteroskedastic Probit estimation are available online, and we can easily add the spatial parts into
these existing source codes to compare PML estimators with Heteroskedastic Probit Estimators. We
consider two simulation settings allowing for different types of spatial dependence.
5.1.1 Case 1
According to the theoretical framework given in previous sections, we could generate a dataset which
allows a general correlation structure across groups as equations (14) and (15), and it requires to
specify the exact formula (as functions of and ) for the elements of Ω However, it is quite
difficult to derive the pairwise covariances for a bivariate probit because the exact formula for Ω12(and of Ω11Ω22) is very complicated, which is an element of the inverse matrix with 2 spatially
correlated observations as follows
Ω =
"Ω11 Ω12
Ω21 Ω22
#= [( − )0( − )]−1 =
⎡⎢⎢⎢⎢⎢⎢⎣Ω111
Ω11 Ω12
Ω21 Ω22
Ω22
⎤⎥⎥⎥⎥⎥⎥⎦ (63)
Therefore, it seems reasonable to do the following. Let be the weighting matrix which can be
generated in STATA5 according to the distance between observations
∗ = 11 +22 +33 + (64)
= (65)
where ∼ (0 ). The weighting matrix is standardized so that the diagonal elements areones, and then the elements of shrink as distance is increasing . The reason we do this is because it
is easier to determine () and ( ) to apply the HP and the bivariate probit estimators.
In this way, we still allow general correlation across groups, and we are able to compare the efficiency
gains from only using the diagonal information (the HP approach) to using both diagonal and off-
diagonal information (bivariate probit), and we do not require to know the exact formula for the
elements in Ω (given in equation (63)) to reach the same goal.
Therefore, we generate the dataset according to equations (64) and (65), which allows spatial
correlation between any two observations, and we set the true parameter values for 1 2 and
3 equal to 1, 1 and 1 respectively. Since our main focus in this study is on the estimation of the
spatial parameter , we also set different true values for each simulated sample: = 02; 04; 06
and 08 to test for the performance of the two estimation methods (PML and HP). These values for
5The STATA command is “Spatwmat”. Since the speed to calculate the inverse of a matrix is much slower as the
size of matrix increases, and moreover the maximum matrix size in Stata is 800, we allow here each observation to bespatially correlated to nearby 99 observations.
19
-
are in the range of the estimated value in the empirical application of Pinkse and Slade (1998). In
this setting and with 1000 replications, we consider a sample size of = 1000 observations (where
the sample size is divided into 500 pairwise groups). Finally, we also simulate samples of sizes 500 and
1500 (with 250 and 750 pairwise groups respectively) to check the performance of the two methods
in different samples sizes. The simulation results are reported in Tables 1 (for the spatial parameter
) and 2 (for the 1 2 and 3) in Appendix 2.
From Table 2, we can observe that both the HPE and the PMLE of 1 2 and 3 converge to true
parameter values across the different parameter values as sample size increasing. Also the the PML
estimator has much less bias than the HPE. Moreover, as expected, PML always provides smaller
standard errors than the HP estimation method and bias and standard errors decrease in general
when sample size increases.
Furthermore, it is in Table 1 where we can observe the largest advantages of using PML versus HP.
We can see that the PMLE is much better than the HPE in terms of estimating the spatial parameter
. The PMLE is always much closer to the true parameter values and with small standard errors
across different sample sizes and parameter values (as expected from our theoretical results), while
the HPE is much further away from true parameter values and it is has a much larger standard
deviation over the different sample sizes, even though HPE also shows the trend to converge to the
true values in general as the sample increases. The HPE has always much larger standard deviation
than the PMLE, showing clearly the gains in efficiency of PML versus HPE/GMM as predicted
by our theory. Since both the HPE and the GMM estimator use generalized residuals from Probit
estimation to construct the moment conditions, we conjecture that the GMM estimator is subject to
similar inefficiency problems in estimating the spatial coefficient. Also, as it is expected, the bias of
the PMLE decreases when increases.
In summary, from the simulation results of Tables 1 and 2, we see how the PMLE outperforms
clearly the HPE (i.e., the GMM estimator of Pinkse and Slade (1998)), specially when estimating the
spatial parameter , which implies that the PMLE is much more robust and efficient in the context
of the spatial probit model. The simulation results provide clear evidence of the gains in efficiency
that can be obtained by PML versus GMM, as predicted by our theoretical results in the previous
section.
5.1.2 Case 2
We considered a second data generating process given as (64), where again we set the true parameter
values for 1 2 and 3 equal to 1, 1 and 1 respectively, and the covariates are jointly normal and
independent. In this case we assume (24) where the are i.i.d. (0 1) and is the
reciprocal of the distance between and . Then we can get closed form expressions for the variance
and covariance given in (25) and (26). Results for 1,000 replications are provided in Tables 3 and 4
in Appendix 2. Again, the PMLE provides improvements versus GMM, especially when estimating
20
-
Efficiency gains in estimating are smaller in Table 4, probably because we chose in this setting
pairwise observations to be closest to each other.
6 Conclusions
The idea of this paper is simple and intuitive: instead of just using information contained in moment
conditions (GMM), we divide observations into pairwise groups. Provided we correctly specify the
conditional joint distribution within these pairwise groups, we show that the spatial bivariate Pro-
bit model allows us to use the most important information of spatial correlations among adjacent
observations and to get a more efficient estimator: the partial MLE (PMLE). We prove that the
PMLE is consistent and asymptotically normal under some regularity conditions. We also discuss
how to get consistent covariance matrix estimators under general spatial dependence by following the
approach of Conley (1999) and Newey-West (1987), which is more usable in practice compared to
the proposal of Pinkse and Slade (1998). An attractive aspect of our approach is that we can obtain
a more efficient estimator than GMM without introducing stronger assumptions, and the approach
is much less computationally demanding compared to full information methods. In order to learn
about the gains in efficiency that we obtain in the bivariate Probit model with PML versus the GMM
estimator, we provide a simulation study in Section 5. The advantages in terms of bias and efficiency
of our new estimation procedure proposed in this paper are clearly demonstrated. Moreover, if we
extend this method to the trivariate or higher dimensional multivariate Probit models, we can obtain
even more efficient estimators, but it comes at the expense of more computational demands. We
have also shown how the estimation of the spatial probit model through PML also allows to obtain
estimates of partial average effects. The results in this paper can be generalized to consider models
with distributed lags, and also to extend the performance of the PMLE to lots of other nonlinear
models, including Tobit, count and switching. This will be the subject of future work.
7 Appendix 1
7.1 Proofs to Theorems
Proof of Theorem 1. If we can prove that ()→ () uniformly, by the information inequality,
() has a unique maximum at the true parameter when 0 is identified. Then under technical
conditions for the limit of the maximum to be the maximum of the limit, b should converge inprobability to 0. Sufficient conditions for the maximum of the limit to be the limit of maximum
are that the convergence in probability is uniform and the parameter set is compact (Newey and
Mcfadden, 1994).
To prove consistency, the proof includes three parts:
(i) has a unique maximum at 0
21
-
(ii) ()− () = (1) at all ∈ Θ(iii) () is stochastically equicontinuous and is continuous on Θ
Condition (i) and to be continuous on Θ are assumed. The proof of condition (ii) is provided
in Lemma 1, and the proof that () is stochastically equicontinuous can be found in Lemma 2.
Proof of Theorem 2. To find out the asymptotic normality of the Partial MLE for spatialbivariate Probit model, we start the proof from mean value theorem. Since
(b) = 0 and by using
the mean value theorem
(b) = 0 =
(0) +2
(∗)(b − 0) (66)
⇒ (b − 0) = −[ 2
(∗)]−1
(0) (67)
where ∗ lies between b and 0First, let us discuss the term
2
(∗) to find out the asymptotic properties of 2
(∗) Recall
that
() =1
X=1
{12(1 1) + 1(1− 2)(1 0)
+(1− 1)2(0 1) + (1− 1)(1− 2)(0 0)} (68)where (1 1) ≡ log(1 = 1 2 = 1|) etc. Also
2
() =
1
X=1
{122(1 1)
+ 1(1− 2)
2(1 0)
+(1− 1)(2)2(0 1)
+ (1− 1)(1− 2)
2(0 0)
(69)
where2(1 1)
=
−1[Pr(1 = 1 2 = 1|)]2 [
Pr(1 = 1 2 = 1|)
]2
+1
Pr(1 = 1 2 = 1|)2[Pr(1 = 1 2 = 1|)]
(70)
and all other terms behave similar.
As before, we only discuss one of these terms, and the same logic applies to the other terms. We
know that
1
X=1
[122(1 1)
(∗)]
=1
X=1
12{ −1[Pr(1 = 1 2 = 1|)]2 [
Pr(1 = 1 2 = 1|)
(∗)]2
+1
Pr(1 = 1 2 = 1|)2[Pr(1 = 1 2 = 1|)]
(∗)} (71)
22
-
Look at the first term of the above equation given by
1
X=1
12{ −1[Pr(1 = 1 2 = 1|)]2 [
Pr(1 = 1 2 = 1|)
(∗)]2} (72)
Since°°° 1[Pr(1=12=1|)]2°°° ∞, we can write this term as
1
X=1
11[ Pr(1 = 1 2 = 1|)
(∗)]2 (73)
where 11 ≡ 12 −1[Pr(1=12=1|)]2 In order to prove
1
X=1
11[ Pr(1 = 1 2 = 1|)
(∗)]2
→ 1
X=1
11[ Pr(1 = 1 2 = 1|)
(0)]
2 (74)
we need to show that it holds for all kk = 1 Set 11 = and then
{1
X=1
11[ Pr(1 = 1 2 = 1|)
(b)]2
−1
X=1
11[ Pr(1 = 1 2 = 1|)
(0)]
2} (75)
=1
X=1
11{[ Pr(1 = 1 2 = 1|)
(b)]2 − [ Pr(1 = 1 2 = 1|)
(0)]2} (76)
= (b − 0) 2
X=1
11 Pr(1 = 1 2 = 1|)
(∗)×
2 Pr(1 = 1 2 = 1|)
(∗) (77)
From the proof of Theorem 1, we know that sup°°° Pr(1=12=1|) °°° ∞ From Lemma 3,
sup
°°°2 Pr(1=12=1|)
°°° ∞ From Theorem 1, we also know that b − 0 = (1) and hence(b − 0) 2
X=1
11 Pr(1 = 1 2 = 1|)
(∗)×
2 Pr(1 = 1 2 = 1|)
(∗) = (1) (78)
=⇒ {1
X=1
11[ Pr(1 = 1 2 = 1|)
(b)]2 − 1
X=1
11[ Pr(1 = 1 2 = 1|)
(0)]
2} = (1)
(79)
=⇒ 1
X=1
11[ Pr(1 = 1 2 = 1|)
(∗)]2
→ 1
X=1
11[ Pr(1 = 1 2 = 1|)
(0)]
2 (80)
By definition,
lim→∞
1
X=1
11[ Pr(1 = 1 2 = 1|)
(0)]
2 = {11[ Pr(1 = 1 2 = 1|)
(0)]2} (81)
23
-
and therefore,
1
X=1
12{ −1[Pr(1 = 1 2 = 1|)]2 [
Pr(1 = 1 2 = 1|)
(∗)]2} → (82)
{12 −1[Pr(1 = 1 2 = 1|)]2 [
Pr(1 = 1 2 = 1|)
(0)]2} (83)
Similarly, we can prove in relation to the second term in (71) that
1
X=1
121
Pr(1 = 1 2 = 1|)2[Pr(1 = 1 2 = 1|)]
(∗) (84)
→ {12 1Pr(1 = 1 2 = 1|)
2[Pr(1 = 1 2 = 1|)]
(0)} (85)
As usual, we apply repeatedly the above arguments to the other terms. Finally, we can get that
lim→∞
2
(∗)
→ [ 2
(0)] (86)
If we define
≡ {122(1 1)
+ 1(1− 2)
2(1 0)
+(1− 1)(2)2(0 1)
+ (1− 1)(1− 2)
2(0 0)
} (87)
where denotes the Hessian, equation (87) can be rewritten as
lim→∞
1
X=1
(∗)→ lim
→∞[(0)] (88)
Therefore, it remains to show the asymptotic normality of the score term: (0) For the sake
of brevity, redefine the score as: (0) ≡ (0) Then
(0) =1
X=1
{12(1 1)
(0) + 1(1− 2)(1 0)
(0)
+(1− 1)2(0 1)
(0) + (1− 1)(1− 2)(0 0)
(0)} (89)
We need to show that −12 (0)(0)→ (0 ) where () ≡ lim
→∞[()
()] Note that
the information matrix equality does not hold here, i.e. −[(0)] 6= [() ()] because thescore terms are correlated with each other over space. In this part, we follow Pinkse and Slade
(1998) and we use Bernstein’s blocking methods and the McLeish’s (1974) central limit theorem for
dependent processes. First, define ≡ Π=1(1+ ) where 2 = −1 and ( = 1 2) is
24
-
an array of random variables on the probability triple (Ωz ) is a real number. McLeish’s (1974)central limit theorem for dependent processes requires the following four conditions
(i) {}is uniformly integrable,(ii) → 1(iii)
P=1
2
→ 1(iv)
≤|| → 0
Now we need to define in our case. Let 0 ≡ {√(0)√(0)
} = − 12 P=1 for implicitlydefine In order to prove 0
→ (0 1) we need to establish that the property holds for allkk = 1 using the Cramer-Wold device. As in the proof of Lemma 1 in Theorem 1, we split theregion in which observations are located up to an area of size
√ ×
√ We also know that
increases faster than√ and slower, where and are integers such that = Let
and be constructed such that (√) → 0 Let −12 × 1 uniformly in , for some
fixed 0 12 Let Λ denote the set of indices corresponding to the observations in area . By
assumption a number 0 exists such that (#Λ) Define ≡ − 12P
∈Λ ,and hence we can write 0 =
P=1
Now we are ready to discuss the four conditions for Mcleish’s (1974) central limit theorem. First,
look at condition (iv), which requires that ≤
|| = (1)
≤
|| =≤
|− 12X∈Λ
| (90)
Since by assumption
(#Λ) ⇒≤
|− 12X∈Λ
| ≤ × − 12 sup kk (91)
where # denotes the number of objects, by definition we have that
{√(0)p(0)
} = − 12X
=1
= 1√
0{12(1 1)
(0) + 1(1− 2)(1 0)
(0)
+(1− 1)2(0 1)
(0) + (1− 1)(1− 2)(0 0)
(0)} (92)
Since(0) is positive definite, (0)−12 is bounded as →∞ and we have that sup kk ∞ by
assumption (vi) in Theorem 1. We have also proved that sup°°°(11) °°° ∞ in Lemma 2. Therefore,
we are able to prove that sup kk ∞. Then × − 12 sup kk = ( × − 12 ) = (1) byconstruction of . Hence we can get that
≤|| = (1)
Second, let us discuss condition (i): {} is uniformly integrable. Following Davidson (1994),if a random variable is integrable, the contribution to the integer of extreme random variable values
must be negligible. In other words, if | | ∞ (||1| |) → 0 as → ∞ it is
25
-
equivalent to say [sup || ] = 0 for some 0 as →∞ Here we follow the proof ofLemma 10 in Pinkse and Slade (1998). We have that
[sup
|| ] = [sup
|Π=1(1 + )| ] (93)
≤ [sup
|Π=1(q1 + 22)| ] (94)
= { [sup
|Π=1(q1 + 22)| |( sup
|| ≤ )]× [sup || ≤ ]
+ [sup
|Π=1(q1 + 22)| |( sup
|| )]× [sup || ]} (95)
≤ { [sup
|Π=1(q1 + 22)| |( sup
|| ≤ )] + [sup || ] (96)
where is a uniform upper bound toP
∈Λ Therefore,
[sup || ] = [sup |− 12X∈Λ
| ] (97)
= [sup−12
X∈Λ
|| ] ≤ [sup− 12 X∈Λ
|| ] = 0 (98)
since − 12 1 and by construction of Then,
[sup
|Π=1(q1 + 22)| |( sup
|| ≤ )] ≤ [sup
|(1 + 2−22)2 | ] = 0 (99)
provided we set sufficiently large. Therefore, we proved that [sup || ] = 0⇒ {}is uniformly integrable.
Third, condition (ii) requires that → 1 which is equivalent to say that − 1 = (1);see proof in Lemma 4.
Fourth, in order to prove (iii):P
=12
→ 1 by Lemma 8, P=12 − 1 = P=1(2) −1 + (1) and
X=1
(2)− 1 + (1) = ( 20)− 1−X6=
() + (1) = (1) (100)
by construction of 0 since ( 20) = 1 It remains to show thatP
6= () = (1) Thiscondition is proved in Lemmas 5-76.
6Lemmas 5-8 are along the lines of those in Pinkse and Slade (1998), which are a simplified version of the proofsin Davidson (1994).
26
-
7.2 Technical Lemmas
The proofs of Theorems 1-2 require the use of the following Lemmas 1-8.
Lemma 1: Under the assumptions in Theorem 1, ()− () = (1) for all ∈ ΘProof: we can rewrite () as
() =1
X=1
{12[(1 1)− (1 0)− (0 1) + (0 0)]
+1[(1 0)− (0 0)] + 2[(0 1)− (0 0)] + (0 0)} (101)
Since we assume that lim→∞
[ ()] exists, and by definition () ≡ lim→∞
[ ()] this implies
that: () − [ ()] = (1). In order to prove () − () = (1) we only need to showthat () − [ ()] = (1) That is equivalent to prove that the distance between () and[ ()] is infinitely small as → ∞ That is: k ()−[ ()]k2 → 0 as → ∞ and bydefinition, it is equivalent to [ ()]→ 0 as →∞It is easy to see that
[ ()] =
1
2
X=1
X=1
{11(12 12) + 212(12 1) + 213(12 2)
+22(1 1) + 223(1 2) + 33(2 2) (102)
where 1 = [(1 1) − (1 0) − (0 1) + (0 0)] 2 = [(1 0) − (0 0)] and 3 =[(0 1)− (0 0)] The same definition applies to 1 2and 3Note that here
(1 1) = log{Z ∞−2
Φ(1 + 12p
(1))(
2p (2)
)2} (103)
which is not a function of or Hence 1 is not a function of or The same logic applies to
the other terms (2 3 1 2 and 3). Since we can set 1 ≤ (1 1) ≤ 2 for finite 1 and2 the same applies to (1 0) (0 1) and (0 0) We normalize to 0 ≤ (1 1) ≤ 1 Therefore,it is easy to see that || ≤ 2 and the same || and hence || ≤ 4 = 1 2Therefore, we can write
| [ ()]| =1
2
X=1
X=1
{4(12 12) + 8(12 1) + 8(12 2)
+4(1 1) + 8(1 2) + 4(2 2) (104)
27
-
In the previous equation, firstly, let us look at the term 12
P=1
P=1
4(1 1)
1
2
X=1
X=1
4(1 1) ≤ 12
X=1
X=1
4|(1 1)| ≤ 42
X=1
X=1
() (105)
by assumption (vii). Therefore, we need to prove that
4
2
X=1
X=1
() = (1) as →∞ (106)
Following Pinkse and Slade (1998), we also use the Bernstein’s (1927) blocking method to prove
this as follows. We split the region in which observations are located up to an area of size
1√ × 2
√ We also know that increases faster than
√ and slower, where and are
integers such that = Without loss of generality, we assume 1 = 2 = 1, and let and be
constructed such that (√) → 0 Let − 12 × 1 uniformly in , for some fixed 0 12
By construction of (−12 ) = (1) Then we are able to apply the same idea to our case. In
our case, the groups and take the role of and where one grows faster and the other grows
slower than√We also know the is the distance between |−|. So we can find an upper bound
for |−| as the maximum between group and . Let us suppose that is the one that grows fasterthan
√ and is the one that grows slower than
√. Then we can cancel one of the summations
corresponding to with −1 Moreover, since grows faster than√ but slower than −1 one way
is to define
√P
=1
() as the one that grows faster than√ but slower than in such a way that
X=1
X=1
() = (1
√X
=1
()) (107)
Finally,
√P
=1
() grows slower than and therefore, ( 1
√P
=1
()) = (1) So, we can get
1
2
X=1
X=1
4(1 1) ≤ 42
X=1
X=1
() = (1) (108)
We can apply the same logic to 12
P=1
P=1
8(1 2) and 12P
=1
P4
=1
(2 2) Let us consider
12
P=1
P=1
4(12 12) If we define = 12 and = 12 we can apply the same logic
to prove that 12
P=1
P=1
4( ) ≤ 42P
=1
P=1
() = (1) Therefore, we are able to show that
k ()−[ ()]k2 ≤ | [ ()]| ≤ 362
X=1
X=1
() = (1) (109)
28
-
Hence, ()−[ ()] = (1) =⇒ ()− () = (1) at all ∈ Θ
Lemma 2 Under the assumptions in Theorem 1, ()− () is stochastically equicontinuous.Proof: The proof requires only to show that () is stochastically equicontinuous because
() is continuous by assumption (iii). We have that
()−³e´ = 1
X=1
{12[(1 1 )− (1 1e)]+1(1− 2)[(1 0 )− (1 0e)]+(1− 1)2[(0 1 )− 01(0 1e)]+(1− 1)(1− 2)[(0 0 )− (0 0e)]} (110)
By the mean value theorem
()−³e´ = 1
X=1
{12[(1 1)
(∗)( − e)] + 1(1− 2)[(1 0)
(∗)( − e)]+(1− 1)2[(0 1)
(∗)( − e)] + (1− 1)(1− 2)[(0 0)
(∗)( − e)]} (111)
where ∗ lies between and e In order to prove () is stochastically equicontinuous, it is sufficientto show that
sup∈Θ|1
X=1
12(1 1)
()| = (1) (112)
and the same requirement applies to other terms. For simplicity issues we just prove one of them
and the rest follow the same argument. Recall that
(1 1) ≡ log(1 = 1 2 = 1|) (113)
and note that (1 = 1 2 = 1|) = Φ2( 1√Ω11
2√Ω22
|) where Φ2 is the bivariate normaldistribution function. Also
(1 1)
=
[logΦ2(1√Ω11
2√Ω22
)]
(114)
and since ≡ ( )(1 1)
() =
((11)
()
(11)
()
) (115)
We focus first on (11)
() where
(1 1)
=
[logΦ2(1√Ω11
2√Ω22
)]
=
11√Ω11
+ 22√Ω22
Φ2(1√Ω11
2√Ω22
) (116)
29
-
with
1 = (1pΩ11
)Φ(( 2√
Ω22− 1√
Ω11)p
1− 2) (117)
2 = (2pΩ22
)Φ(( 1√
Ω11− 2√
Ω22)p
1− 2) (118)
By assumption (v)
sup
°°°°°° 1Φ2( 1√Ω11
2√Ω22
)
°°°°°° = sup°°°° 1Pr(1 = 1 2 = 1|)
°°°° ∞ (119)and it is easy to see that
°°°° 11√Ω11 + 22√Ω22°°°° ∞ provided that sup(kk) ∞ Therefore,
sup
°°°°(1 1) ()°°°° ∞ (120)
We now discuss the second term (1=12=1|)
() where
(1 1)
=
[logΦ2(1√Ω11
2√Ω22
)]
(121)
=2(
1√Ω11
2√Ω22
)
Φ2(1√Ω11
2√Ω22
)×
2(1√Ω11
2√Ω22
)
(122)
and after some algebra, we can prove that sup
°°°°2( 1√Ω11 2√Ω22 ) °°°° ∞ provided that sup kk ∞Therefore, it easy to see when sup
°°°(11) °°° ∞ and sup °°°(11) °°° ∞ we can getsup
°°°°(1 1)°°°° ∞. (123)
We apply the same logic to the other terms, and we can prove that sup°°°(10)
()°°°, sup °°°(01) ()°°°
and sup°°°(00)
()°°° are also bounded.
Therefore, finally sup∈Θ| 1
P=1 12
(11)
()| = (1) given sup(kk) = (1) and hence we
can prove that ()− () is stochastically equicontinuous.
Lemma 3: Under the assumptions in Theorem 2, sup°°°2 Pr(1=12=1|)
°°° ∞30
-
Proof: From Lemma 2, we know that
Pr(1 = 1 2 = 1|)
=( 1√
Ω11)Φ(
(2√Ω22
− 1√Ω11
)√1−2
)1pΩ11
+( 2√
Ω22)Φ(
(1√Ω11
− 2√Ω22
)√1−2
)2pΩ22
(124)
=⇒ 2 Pr(1 = 1 2 = 1|)
=1(
1√Ω11
){1 1√Ω11
Φ[(2√Ω22
− 1√Ω11
)√1−2
] + [(2√Ω22
− 1√Ω11
)√1−2
](
2√Ω22
− 1√Ω11
)√1−2
}pΩ11
+2(
2√Ω22
){2 2√Ω22
Φ[(1√Ω11
− 2√Ω22
)√1−2
] + [(1√Ω11
− 2√Ω22
)√1−2
](
1√Ω11
− 2√Ω22
)√1−2
}pΩ22
(125)
and even though the above expression is complicated, it is easy to see that all the terms are bounded
provided the assumptions in Theorem 2 hold. This is equivalent to
sup
°°°°2 Pr(1 = 1 2 = 1|)°°°° ∞ (126)
Pr(1 = 1 2 = 1|)
=2(
1√Ω11
2√Ω22
)
Φ2(1√Ω11
2√Ω22
)×
2(1√Ω11
2√Ω22
)
(127)
=⇒ 2 Pr(1 = 1 2 = 1|)
2
=(2)2(Φ2 − 22)(Φ2)2
+2Φ2
222
(128)
It is easy to see that the first term of the above equation is bounded from previous results (i.e.
sup
°°°2 °°° ∞) and the second term can be also proved bounded since 222 can be proved tobe bounded given that sup kk ∞ after some algebra. Hence sup
°°°2 Pr(1=12=1|)
°°° ∞
Lemma 4. Under the assumptions in Theorem 2, −1 = (1)where ≡ Π=1(1+)Proof: By definition, = Π
=1(1 + ) = −1 + −1 By repeatedly multi-
plying out, we finally get = 1 + P
=1 −1 Hence,
− 1 = (X=1
−1) (129)
31
-
In order to prove − 1 = (1) we just need to show that: (P
=1 −1) = (1)This is equivalent to prove that (−1) = (−1 ) We can rewrite −1 as −1 = Π
−1=1(1 +
)We know there are − 1 groups of in −1We split these − 1 groups into two parts:groups adjacent to group , and groups that are not adjacent to group . We then define the area
Ξ−1 as the area which is adjacent to group . Therefore, −1 = Π∈Ξ−1(1+ )Π∈Ξ−1(1+) = Π∈Ξ−1(1 + )where ≡ Π∈Ξ−1(1 + ) which includes the groupswhich are not adjacent to group .
Since −1 = Π∈Ξ−1(1 + )we just need to prove
[(Π∈Ξ−1(1 + ))] = [(Π∈Ξ−1(1 + ))] = (−1 ) (130)
We know that
[(Π∈Ξ−1(1 + ))] = [(1 + X
∈Ξ−1−1)] (131)
= [] +[(X
∈Ξ−1−1)] (132)
First, we look at the term [] Since ≡ Π∈Ξ−1(1 + ) that means thegroup is not adjacent to group . By Bernstein’s method, we split the region in such a way that
the distance between group and non-adjacent group is at least 12 Hence, |[]| =
|( ) = (√) provided () = 0 and by assumption (vi) in Theorem 1. By
construction of and (√) = (1) and hence we obtain |[]| = (−1 )
Second, we look at the term [(P
∈Ξ−1 −1)] We have that
[(X
∈Ξ−1−1)] =
X∈Ξ−1
[Π∈Ξ−1)] (133)
Consider [)] first. We know that [)] = ( ) provided
() = 0 Since ( ) → ( ) as → ∞ because gets more andmore terms (all groups not adjacent to group ), while keeps the same amount. In the first
step, we have proved that ( ) = (−1 ) and by the same argument ( ) =(−1 )Therefore, we can prove that (−1) = (−1 )⇒ − 1 = (1)
Lemma 5. Under the assumptions in Theorem 2,P
6= () = (1)Proof: We know that
P6= () =
P=1
P=1()−
P= () = (1) if
we can show that P
=1 |()| = (−1 ) This is equivalent to proveP
6= () =(1) because the summation over contains − 1 terms.
32
-
Define Ξ as the set of indices corresponding to blocks that have blocks removed from every
direction from block . In other words, we assume there are no more than 8 blocks within distance
Hence,
X=1
|()| ≤ √X
=1
X∈Ξ
|()| (134)
≤ X∈Ξ
|()|+√X
=2
X∈Ξ
|()| (135)
The first term is proved to be (−1) = (−1 ) in Lemma 6. The second term can be alsoproved to be (−1 ) in Lemma 7.
Lemma 6: Under the assumptions in Theorem 2, P
6= |()| = (−1) = (−1 )Proof: Since = −
12
P∈Λ by definition
X6=|()| = ∈|−1
X∈Λ∈Λ
()| (136)
≤ ∈1−1X
∈Λ∈Λ() (137)
because () = ( ) = 1() where 1 0
To compute the upper bound of the correlation between and , we just need to consider the
strongest case, e.g. the and are adjacent each other. By Bernsteins’ blocking method, the number
of ( ) combinations that are within distance is bounded by 2√
2 where 2 0 Hence we
can get
∈1−1X
∈Λ∈Λ() ≤ 3∈−1
p
4√X
=0
2() (138)
where 3 = 12 4 0
By assumption (ii) in Theorem 2, 2()→ 0 as →∞ Therefore,
3∈−1p
4√X
=0
2() = (−1) (139)
Since = by construction, (−1) = (−1 )
Lemma 7: Under the assumptions in Theorem 2, P√
=2
P∈Ξ |()| = (−1 )
33
-
Proof: Because ∈Ξ ×∈Λ ×∈Λ|() = (√( − 1)) we have that
√X
=2
X∈Ξ
|()| ≤ 5√X
=2
#Ξ−1 ×#Λ ×#Λ(
p( − 1)) (140)
≤ 6−12√X
=1
(p) = (
−1
√X
=1
() = (−1) (141)
= (−1 ) (142)
where # denotes the number of objects, and (−1P√
=1 () = (−1) follows from assumption
(i): as →∞ 2(∗)(∗) = (1).
Lemma 8: Under the assumptions in Theorem 2,P
=12 =
P=1(
2) + (1)
Proof: In order to proveP
=12 =
P=1(
2) + (1) it suffices to show that
X=1
X=1
(2 2) = (1) (143)
We have thatX=1
X=1
(22) =
X=1
X=1
{[2 −(2)][2 −(2)]} (144)
≤ 78√X
=0
( + 1)(p)(
4) (145)
where 7, 8 0 are large enough. Also
(4) ≤ −2X
1234∈Λ|[1 2 3 4] (146)
≤ 9−2X
1234∈Λ{(12) + + (34)} (147)
≤ 10−2X
12∈Λ{(12)} (148)
≤ 11−22X
1∈Λ
12√X
=0
() = (−23) (149)
where 9 10 11 12 0 |P∞
=0 ()| ∞ Therefore finally
7
8√X
=0
( + 1)(p)(
4) = (
−23) = (1) (150)
because = and −12 → 0 as →∞
34
-
Finally, the following Lemma 9 generalizes Pinkse and Slade (1998) results as a way to obtain
consistent estimates of the variance covariance matrix.
Lemma 9: If assumptions in Theorem 2 hold, and sup°°Φ4
+ Φ3
°° ∞ then (b)−(0) =(1) and (b) −(0) = (1); where () ≡ [() ()] and () ≡ −[()]Proof: First, we prove that (b)−(0) = (1) We know that (b) = − 1P=1(b) and
by definition, lim→∞
(0) = (0) So we just need prove that {(b)− lim→∞
(0)} = (1) forall kk = 1 From the proof of Theorem 2, we have already proved that
1
X=1
(b)→ 1
X=1
(0) (151)
as → ∞ provided that b − 0 = (1) which is proved in Theorem 1. Therefore, we can get(b)−(0) = (1)Second, we consider how to show (b)− (0) = (1) As before, it is sufficient to show that
(b)−(0) = (1) as →∞We know that (0) = [(0) (0)] = ((0)) given(0) = 0 Recall from the proof of Theorem 2 that
(0) =1
X=1
{12(1 1)
(0) + 1(1− 2)(1 0)
(0)
+(1− 1)2(0 1)
(0) + (1− 1)(1− 2)(0 0)
(0)} (152)
and we can rewrite it as
(0) =1
X=1
{12[(1 1)
(0)− (1 0)
(0)− (0 1)
(0) +(0 0)
(0)]
+1[(1 0)
(0)− (0 0)
(0)] + 2[
(0 1)
(0)− (0 0)
(0)]
+(0 0)
(0)} (153)
For the sake of brevity, we redefine
1 ≡ [(1 1)
(0)− (1 0)
(0)− (0 1)
(0) +
(0 0)
(0)] (154)
2 ≡ [(1 0)
(0)− (0 0)
(0)] (155)
3 ≡ [(0 1)
(0)− (0 0)
(0)] (156)
4 ≡(0 0)
(0) (157)
35
-
Therefore,
((0)) = −1(0)
= −2X
=1
X=1
{11(12 12) + 212(12 1)
+213(12 2) + 22(1 1)
+223(1 2) + 33(2 2) (158)
where 1 2 3 are defined similarly as 1 2 3
As before, we just need to provide the proof for one of these terms, and the same logic applies to
other terms. We consider the most complicated term and the rest follow the same argument
−1X
=1
X=1
[11(12 12)]
= −1X
=1
X=1
11[(1212)−(12)(12)] (159)
(1212) = Pr(1 = 1 2 = 1 1 = 1 2 = 1|) (160)= Φ4(1 2 1 2 12 13 14 23 24 34) (161)
where Φ4 is the cdf for the quadvariate standard normal distribution, 1 =1√
(1)etc. Similarly,
(12) = Pr(1 = 1 2 = 1|) = Φ2(1 2 12) (162)(12) = Pr(1 = 1 2 = 1|) = Φ2(1 2 34) (163)
and therefore,
(12)(12) = Φ2(1 2 12)×Φ2(1 2 34) (164)= Φ4(1 2 1 2 12 0 0 0 0 34) (165)
so we can write the first term as
(0) = −1
X=1
X=1
11[(1212)−(12)(12)] (166)
= −1X
=1
X=1
11[Φ4(1 2 1 2 12(0) 13(0) 14(0) 23(0) 24(0) 34(0))
−Φ4(1 2 1 2 12(0) 0 0 0 0 34(0))] (167)Similarly, we can write the first term of (b) as
−1X
=1
X=1
11[Φ4(1 2 1 2 12(b) 13(b) 14(b) 23(b) 24(b) 34(b))−Φ4(1 2 1 2 12(b) 0 0 0 0 34(b))] (168)
36
-
By the mean value theorem, the first term of (b) −(0) is given as−1
X=1
X=1
11{[Φ4(1 2 1 2 12(b) 13(b) 14(b) 23(b) 24(b) 34(b))−Φ4(1 2 1 2 12(0) 13(0) 14(0) 23(0) 24(0) 34(0))]−[Φ4(1 2 1 2 12(b) 0 0 0 0 34(b))−Φ4(1 2 1 2 12(0) 0 0 0 0 34(0))]} (169)
= −1(b − 0) X=1
X=1
11{Φ4(1 2 1 2 12(
∗) 13(∗) 14(
∗) 23(∗) 24(
∗) 34(∗)
−Φ4(1 2 1 2 12(∗) 0 0 0 0 34(
∗))
} (170)
Since sup°°1°° ∞ by the proof in Theorem 2, we just need to assume
sup
°°°°Φ4(1 2 1 2 12(∗) 13(∗) 14(∗) 23(∗) 24(∗) 34(∗)°°°° ∞ (171)
and the same argument applies to
sup
°°°°Φ4(1 2 1 2 12(∗) 0 0 0 0 34(∗))°°°° ∞ (172)
so that
−1(b − 0) X=1
X=1
11{Φ4(1 2 1 2 12(
∗) 13(∗) 14(
∗) 23(∗) 24(
∗) 34(∗)
)
−Φ4(1 2 1 2 12(∗) 0 0 0 0 34(
∗))
}→ 0 (173)
because (b − 0)→ 0 and the other terms are bounded.Repeat the proofs to the other terms, plus the new assumption about sup
°°Φ3
°° ∞ and thenwe can prove (b) −(0) = (1)
37
-
8 Appendix 2
TABLE 1∗. CASE 1: SIMULATION RESULTS OF DIFFERENT ESTIMATORS OF IN THECONTEXT OF THE BIVARIATE SPATIAL PROBIT MODEL.
= 02 = 04 = 06 = 08
HPE PMLE HPE PMLE HPE PMLE HPE PMLE = 500 mean 3.938 0.514 6.177 0.519 7.698 0.571 7.735 0.634
bias 3.738 0.314 5.777 0.319 7.098 -0.029 6.935 -0.166(s.d.) (12.158) (0.120) (15.776) (0.205) (16.929) (0.151) (16.202) (0.289)
= 1000 mean 3.174 0.512 4.668 0.518 5.456 0.581 5.914 0.672bias 2.974 0.312 4.268 0.118 4.856 -0.019 5.114 -0.128(s.d) (8.844) (0.107) (9.100) (0.133) (9.631) (0.149) (10.173) (0.276)
= 1500 mean 2.746 0.511 4.050 0.507 4.872 0.609 5.426 0.708bias 2.546 0.311 3.650 0.107 4.272 0.009 4.626 -0.092(s.d.) (6.423) (0.099) (7.414) (0.124) (8.598) (0.149) (8.514) (0.253)
∗Results are presented for our new Partial Maximum Likelihood Estimator (PMLE) and theHeteroskedastic Probit Estimator (HPE) of . Numbers in brackets show standard deviations (s.d.).
38
-
TABLE 2∗. CASE 1: SIMULATION RESULTS OF DIFFERENT ESTIMATORS OF 1 2 AND3 IN THE CONTEXT OF THE BIVARIATE SPATIAL PROBIT MODEL.
1= 1 2= 1 3= 1
HPE PMLE HPE PMLE HPE PMLE = 02 = 500 mean 5.322 2.618 5.333 2.619 5.329 2.623
(s.d.) (8.844) (0.839) (8.872) (0.855) (8.863) (0.870) = 1000 mean 5.308 2.616 5.296 2.616 5.289 2.618
(s.d) (7.612) (0.560) (7.570) (0.560) (7.568) (0.564) = 1500 mean 5.247 2.604 5.239 2.602 5.235 2.604
(s.d.) (6.624) (0.540) (6.606) (0.536) (6.613) (0.543) = 04 = 500 mean 3.610 1.329 3.614 1.329 3.608 1.328
(s.d.) (5.305) (0.362) (5.311) (0.365) (5.290) (0.366) = 1000 mean 3.600 1.318 3.593 1.316 3.588 1.315
(s.d.) (4.192) (0.355) (4.177) (0.355) (4.178) (0.353) = 1500 mean 3.456 1.281 3.441 1.281 3.438 1.278
(s.d.) (3.818) (0.342) (3.793) (0.343) (3.798) (0.339) = 06 = 500 mean 2.898 0.972 2.876 0.966 2.885 0.969
(s.d.) (3.761) (0.271) (3.723) (0.268) (3.735) (0.271) = 1000 mean 2.669 0.981 2.669 0.979 2.657 0.978
(s.d.) (2.951) (0.261) (2.953) (0.261) (2.916) (0.259) = 1500 mean 2.508 1.016 2.499 1.015 2.501 1.016
(s.d.) (2.726) (0.250) (2.706) (0.250) (2.708) (0.253) = 08 = 500 mean 2.246 0.805 2.237 0.801 2.249 0.802
(s.d.) (2.810) (0.373) (2.803) (0.373) (2.841) (0.392) = 1000 mean 2.098 0.843 2.096 0.843 2.082 0.843
(s.d.) (2.281) (0.349) (2.279) (0.349) (2.246) (0.340) = 1500 mean 2.086 0.884 2.096 0.886 2.094 0.886
(s.d.) (2.059) (0.316) (2.071) (0.314) (2.073) (0.318)∗Results are presented for our new Partial Maximum Likelihood Estimator (PMLE) and the
Heteroskedastic Probit Estimator (HPE) of 1 2 and 3. Numbers in brackets show standard
deviations (s.d.).
39
-
TABLE 3∗. CASE 2: SIMULATION RESULTS OF DIFFERENT ESTIMATORS OF IN THECONTEXT OF THE BIVARIATE SPATIAL PROBIT MODEL.
= 02 = 04 = 06 = 08
HPE PMLE HPE PMLE HPE PMLE HPE PMLE = 500 mean 2.151 0.381 2.575 0.667 2.491 0.970 2.876 1.202
bias 1.951 0.181 1.175 0.267 1.891 0.370 2.076 0.402(s.d.) (4.630) (0.844) (5.073) (0.923) (4.996) (0.913) (6.213) (0.966)
= 1000 mean 1.013 0.356 1.089 0.606 1.307 0.863 1.660 1.160bias 0.813 0.156 0.689 0.206 0.707 0.263 0.860 0.360(s.d) (2.131) (0.606) (2.241) (0.622) (2.424) (0.671) (2.675) (0.813)
= 1500 mean 0.684 0.324 0.792 0.592 0.906 0.860 1.305 1.156bias 0.484 0.124 0.392 0.192 0.306 0.260 0.505 0.356(s.d.) (1.508) (0.484) (1.566) (0.515) (1.611) (0.601) (1.910) (0.706)
∗Results are presented for our new Partial Maximum Likelihood Estimator (PMLE) and theHeteroskedastic Probit Estimator (HPE) of . Numbers in brackets show standard deviations (s.d.).
40
-
TABLE 4∗. CASE 2: SIMULATION RESULTS OF DIFFERENT ESTIMATORS OF 1 2 AND3 IN THE CONTEXT OF THE BIVARIATE SPATIAL PROBIT MODEL.
1= 1 2= 1 3= 1
HPE PMLE HPE PMLE HPE PMLE = 02 = 500 mean 1.043 1.021 1.042 1.020 1.042 1.020
(s.d.) (0.120) (0.109) (0.114) (0.104) (0.122) (0.108) = 1000 mean 1.017 1.006 1.022 1.011 1.016 1.005
(s.d) (0.079) (0.071) (0.078) (0.071) (0.079) (0.070) = 1500 mean 1.012 1.004 1.013 1.005 1.010 1.002
(s.d.) (0.065) (0.059) (0.063) (0.058) (0.064) (0.057) = 04 = 500 mean 1.043 1.017 1.042 1.018 1.043 1.019
(s.d.) (0.125) (0.112) (0.112) (0.105) (0.119) (0.110) = 1000 mean 1.017 1.005 1.020 1.008 1.014 1.002
(s.d.) (0.079) (0.072) (0.080) (0.072) (0.077) (0.069) = 1500 mean 1.014 1.005 1.013 1.004 1.009 1.000
(s.d.) (0.065) (0.058) (0.061) (0.056) (0.062) (0.056) = 06 = 500 mean 1.046 1.022 1.045 1.022 1.043 1.020
(s.d.) (0.123) (0.107) (0.126) (0.111) (0.126) (0.108) = 1000 mean 1.019 1.006 1.020 1.007 1.015 1.003
(s.d.) (0.083) (0.075) (0.084) (0.075) (0.079) (0.071) = 1500 mean 1.010 1.002 1.012 1.004 1.010 1.002
(s.d.) (0.066) (0.060) (0.065) (0.060) (0.063) (0.059) = 08 = 500 mean 1.035 1.013 1.036 1.014 1.037 1.015
(s.d.) (0.125) (0.115) (0.125) (0.111) (0.125) (0.115) = 1000 mean 1.016 1.003 1.017 1.005 1.016 1.004
(s.d.) (0.083) (0.086) (0.082) (0.090) (0.084) (0.088) = 1500 mean 1.011 1.000 1.012 1.001 1.012 1.001
(s.d.) (0.065) (0.058) (0.064) (0.057) (0.064) (0.056)∗Results are presented for our new Partial Maximum Likelihood Estimator (PMLE) and the
Heteroskedastic Probit Estimator (HPE) of 1 2 and 3. Numbers in brackets show standard
deviations (s.d.).
41
-
References
[1] Andrews, D. W. K. (1991): “Heteroskedasticity and Autocorrelation Consistent Covariance
Matrix Estimation”, Econometrica 59, 3, 817-858.
[2] Amemiya, T. (1973): “Regression Analysis when the Dependent Variable is Truncated Normal”,
Econometrica 41, 997-1016.
[3] Anselin, L. and Florax, R. J. G. M. (1995): “New Direction in Spatial Econometrics”, Springer-
Verlag, Berlin, Germany.
[4] Anselin, L. Florax, R. J. G. M., and Rey, J. S. (2004): “Econometrics for Spatial Models: Recent
Advances”, Advances in Spatial econometrics. Springer-Verlag, Berlin, Germany,1-28.
[5] Beron, K. J. and Vijverberg, W. P. (2003): “Probit in a Spatial Context: A Monte Carlo
Approach”, Advances in Spatial econometrics. Springer-Verlag, Berlin, Germany, 169-196.
[6] Bernstein, S. (1927): “Sur l’Extension du Theoreme du Calcul des Probabilities aux Sommes de
Quantities Dependantes”, Mathematische Annalen 97, 1-59.
[7] Case, A. C. (1991): “Spatial Patterns in Household Demand”, Econometrica 59, 953-965.
[8] Conley, T. G. (1999): “GMM Estimation with Cross Sectional Dependence”, Journal of Econo-
metrics 92, 1-45.
[9] Davidson, J. (1994): “Stochastic Limit Theory”, Oxford: Oxford University Press.
[10] Gourieroux, C. (2000): “Econometrics of Qualitative Dependent Variables”, Cambridge Univer-
sity Press.
[11] Kelejian, H. H. and Prucha, I. R. (1999): “A Generalized Moments Estimator for the Autore-
gressive Parameter in a Spatial Model”, International Economic Review 40, 509-533.
[12] Kelejian, H. H. and Prucha, I. R. (2001): “On the Asymptotic Distribution of the Moran I Test
Statistic with Applications”, Journal of Econometrics 104, 219-257.
[13] Lee, L.-F. (2004): “Asymptotic Distribution of Quasi-maximum Likelihood Estimators for Spa-
tial Autoregressive Models”, Econometrica 72, 6, 1899-1925.
[14] Lesage, J. P. (2000): “ Bayesian Estimation of Limit Dependent Variable Spatial Autoregressive
Models”, Geographical Analysis 32, 19-35.
[15] McLeish, D. L. (1974): “Dependent Central Limit Theorems and Invariance Principals”, Annals
of Probability 2, 620-628.
42
-
[16] McMillen, D. P. (1992): “Probit with S